Our Feed

UX Collective - Medium02/06/2026, 21:53
AI-created document fatigue: how I designed my way out of it
Design / UXopen article
Towards Data Science02/06/2026, 16:30
Code Is Cheap. Engineering Judgement Is Now the Scarce Resource
The barriers to building have collapsed. That shifts the bottleneck to ownership, validation, taste, and deciding what should actually exist The post Code Is Cheap. Engineering Judgement Is Now the Scarce Resource appeared first on Towards Data Science.
AI / MLopen article
Towards Data Science02/06/2026, 15:00
From Local App to Public Website in Minutes
Three free ways to quickly deploy a static web app that anyone can access The post From Local App to Public Website in Minutes appeared first on Towards Data Science.
AI / MLopen article
Hugging Face - Blog02/06/2026, 14:13
Holo3.1: Fast & Local Computer Use Agents
AI / MLopen article
Towards Data Science02/06/2026, 13:30
From Regex to Vision Models: Which RAG Technique Fits Which Problem
Enterprise Document Intelligence [Vol.1 #4] - A diagnostic across PDFs and questions, and a map of the techniques the rest of the series will cover The post From Regex to Vision Models: Which RAG Technique Fits Which Problem appeared first on Towards Data Science.
AI / MLopen article
MachineLearningMastery.com02/06/2026, 12:00
Scikit-LLM vs. Traditional Text Classifiers: When Should You Use an LLM?
In recent years, generative AI models like LLMs (large language models) have gradually taken over classical machine learning ones for addressing certain tasks, for instance, text classification .
AI / MLopen article
Towards Data Science02/06/2026, 12:00
Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn
Exploratory data analysis on the US Census Dataset The post Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn appeared first on Towards Data Science.
AI / MLopen article
UX Collective - Medium02/06/2026, 11:08
The most important part of building your taste is to hand it off
What having taste trapped in your head costs you and your team Continue reading on UX Collective »
Design / UXopen article
UX Collective - Medium02/06/2026, 11:07
Default Bias: Who chose your settings?
Design / UXopen article
Martin Fowler02/06/2026, 09:21
Fragments: June 2
Greg Wilson has noticed that lots of folks are using dodgy metrics to figure out if AI tools are worth their costs. Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way; He lists lots of common metrics, and why they are flawed. Sadly he doesn’t give any suggestions on what would be better. In my view, since we cannot measure productivity, any metrics are weak evidence at the best of times. I do somewhat use one of his flawed measures: “Asking Developers If They Feel More Productive”. While I acknowledge the problems he gives with this measure, I find that in an environment where decent measures are hard to find, even such a dim light is the best we have. In this situation these kinds of qualitative metrics may not be conclusive, but they are useful. ❄ ❄ ❄ ❄ ❄ Benedict Evans observes that extensive automation didn’t mean the demise of professions in the past. we spent a century automating accounting: we built calculating machines, punch cards, mainframes, data processing, databases, PCs, spreadsheets, ERPs, cloud… in fact, we built half of the tech industry around automating this. Yet the number of accountants kept going up. He goes into the myriad of problems that exist when we’re trying to forecast the impact of a technology on jobs. There’s the much-talked-about Jevons paradox - once something becomes cheaper, people do it more, which can increase demand. Often this leads to the nature of jobs changing, even if it’s called the same thing. Accountants today aren’t doing exactly the same work that they did in 1970 or 1980 ‘but more’ - they’re still called ‘accountants’ but the job is different. New technology often starts out being used for ‘the old thing but more’, but it rarely ends up like that. Technologies often affect whole businesses - consider the impact of the internet on news publishing. Did anyone observing the rise of smart phones in the early 2000s realize that a consequence of this would change the economics of taxis due to the rise of ride-sharing apps? The conclusion is that it is, at the very least, almost impossible to forecast the impact of AI on our work. ❄ ❄ ❄ ❄ ❄ Stephen O’Grady looks at how closed and open models have performed on benchmarks over time. Closed models are setting the pace of innovation, and constantly breaking new ground from a capabilities standpoint. Open models are chasing them, and the cycle times seem to be getting shorter. There are no clear capability moats, and what is frontier today is table stakes tomorrow. It tooks 13-18 months for open models to catch up to GPT-4 on these benchmarks, but only 2-7 months to catch up to GPT-4o. There’s a bunch of caveats to this analysis, that he lists, but it’s a worthwhile survey of how various kinds of models perform against the various measures we are trying to assess them with. ❄ ❄ ❄ ❄ ❄ One of the starkest examples of sloppy AI use is hallucinated citations - a give-away of both usage of LLMs and carelessness driving them. GPTZero is a company that makes tools to detect AI writing. I’ve no insight as to whether their tool is effective or not, but they do publish investigations of AI usage, and have published several articles highlighting hallucinated citations. One post focuses on Ernst & Young Canada’s report on cyber threats to loyalty systems and found that more than half its references were hallucinations. The post uses a lot of extremely annoying animations in how it presents its information (breaking Safari’s reader mode in the process). But the harm that these kind of AI generated reports can do goes further than just some misled humans: Publishing a report online is essentially a form of data injection into the pool of knowledge that is the internet. When the report includes fake information (either vibed citations or false claims) it can “poison the well” by misleading future researchers, especially if the report is published by a well-known consulting firm and hosted on a high-traffic website. ❄ ❄ ❄ ❄ ❄ As LLMs get more capable in programming, we are rightly worried that people will use them attack software systems. But these models can also be used for defense, allowing teams to find bugs before attackers do. Some folks from Mozilla posted an article on how they’ve used AI model to identify and fix an unprecedented number of latent security bugs in Firefox. Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it. It is difficult to overstate how much this dynamic changed for us over a few short months. This was due to a combination of two main factors. First, the models got a lot more capable. Second, we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise. During 2025, there were 17-31 security bugs fixed each month. In April 2026, they fixed 423. ❄ ❄ ❄ ❄ ❄ Pavel Voronin riffs on Unmesh Joshi’s post on What is Code. He observes that cruft in a codebase (technical debt) has always added friction to software development. But the consequences of this cruft are compounded when LLMs are using existing code as context for future work. In a degraded codebase, the model does not see “technical debt” as debt. It sees examples. It sees precedent. It sees a style to continue. LLMs multiply what’s currently happening. I hear reports that good code might take the place of much of what’s put in markdown, because LLMs will imitate what’s already in the code base. But bad code multiplies too. Inevitably he introduces another variation of rampant debt metaphors: Cognitive debt accumulates when a team uses abstractions it no longer understands. Generative debt accumulates when a codebase contains confused concepts that models are likely to continue. Cognitive debt is about what the team no longer understands. Generative debt is about what the model is now likely to reproduce. ❄ ❄ ❄ ❄ ❄ Jason Koebler, from the very worthwhile 404 media, has written a plaintive essay on how AI-generated slop is driving us crazy. Not just because its filling the web with this slop, but also because how it’s making us humans react to slop and the threat of slop. We review our own writing and notice: it’s not just reading AI slop that hurts us, it’s the risk that we write something that looks like AI slop. If I use phrasing that AI copied from me, does it seem like I’m copying AI? This has led to the appearance of “humanizers” - AI tools that make our writing look less like AI. Humanizers add typos, randomly replaces words, removes “AI tells,” and sometimes inserts random characters. It’s another step on the way to the Zombie internet: I called it the Zombie Internet because the truth is that large parts of the internet are not just bots talking to bots or bots talking to people. It’s people talking to bots, people talking to people, people creating “AI agents” and then instructing them to interact with people. […] It’s my email inbox, in which I used to occasionally get poorly-formatted, poorly written, extremely long emails from delusional people who were positive the CIA had imprisoned them in a virtual torture chamber using undisclosed secret technology but where I now get well-formatted, passably written, extremely long emails from delusional people who are positive they have proven AI sentience and have the AI transcripts to prove it. ❄ ❄ ❄ ❄ ❄ Andy Osmani points out that spawning lots of agents is like launching a bunch of parallel processes that all rely on a single orchestrating thread - yourself. Python has the Global Interpreter Lock (GIL). You can spawn as many threads as you want but only one executes python bytecode at a time because they must acquire the lock. You are the GIL of your AI agents. They all can run at once. But when any of their work needs genuine understanding of the architecture or resolving merge conflicts, that work has to acquire the lock. There is one lock. You hold it. This means you must design the workflow with the agents with that GIL in mind. You shouldn’t launch more agents than you can properly review. It’s handy to separate background tasks that can be offloaded to an agent from complex tasks that require applied attention. Don’t use that precious brain for things that the machine can verify itself. [And I’d add - do get the machine to build tools that ease human verification. For example, it’s better to surface test case data in tables rather than buried in assert statements.] Spawning agents is not the skill. Anyone can run 20. The real skill is designing the system around the one serial resource that cannot be cloned or parallelized. That resource is your attention. ❄ ❄ ❄ ❄ ❄ Jamie Hurst is a Principal Engineer at booking.com, where he works in developer experience with a focus on AI tooling. He’s written realistically about the gains and losses of using LLMs in this work. The cost of building has collapsed, but the cost of aligning organisationally has not. If anything, it’s gone up. When three different teams can each produce a working solution to the same problem in the time it used to take to write a proposal, the bottleneck moves from engineering to coordination. He thinks he’s able to do more as a senior engineer, but is concerned about how sustainable it is, both for him personally and for the organization he works for. He’s able to shape directions for multiple workstreams at once, in a way that he couldn’t three years ago. But one loss is that he doesn’t have enough time for mentoring, which will exact a toll on his employer in the longer term. He also finds he doesn’t have enough time to think. The productivity gains from AI got captured by output volume rather than output quality. The org’s expectations rose to absorb the speed-up, and the slack that used to exist between tasks, the unstructured time where strategic thinking actually happens, got eaten first because it’s invisible on a dashboard. I’m at a point in my career where thinking is supposed to be most of the job, and most of it now happens on holiday because the working week doesn’t accommodate it.
Backendopen article
UX Collective - Medium01/06/2026, 22:53
AI meets Sturgeon’s Law
Why does more content not mean more quality Continue reading on UX Collective »
Design / UXopen article
UX Collective - Medium01/06/2026, 22:53
Foreman, guardian, team builder: all this is a box
On a moving company’s box, and the long reach of a few quiet decisions Continue reading on UX Collective »
Design / UXopen article
UX Collective - Medium01/06/2026, 22:53
The register shift
Design / UXopen article
Towards Data Science01/06/2026, 18:49
RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem
Enterprise Document Intelligence [Vol.1 #3] - Why the ML toolkit (hyperparameter sweeps, train/test splits, explainability frameworks) solves the wrong problem, and what to use instead The post RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem appeared first on Towards Data Science.
AI / MLopen article
Towards Data Science01/06/2026, 17:30
How to Combine Claude Code and Codex for Maximum Coding Power
Get the most out of each coding model to have a very powerful coding setup The post How to Combine Claude Code and Codex for Maximum Coding Power appeared first on Towards Data Science.
AI / MLopen article
Hugging Face - Blog01/06/2026, 15:45
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
AI / MLopen article
Towards Data Science01/06/2026, 15:00
Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain
Applying blockchain primitives to dataset versioning, provenance, and integrity assurance The post Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain appeared first on Towards Data Science.
AI / MLopen article
Hugging Face - Blog01/06/2026, 13:51
Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic
AI / MLopen article
Towards Data Science01/06/2026, 13:30
It’s the Lessons We Learned Along the Way. Or, Is It?
Research projects in the age of AI The post It’s the Lessons We Learned Along the Way. Or, Is It? appeared first on Towards Data Science.
AI / MLopen article
MachineLearningMastery.com01/06/2026, 12:00
The Roadmap for Mastering LLMOps in 2026
The LLMOps market is projected to grow from <a href="https://www.
AI / MLopen article
Towards Data Science01/06/2026, 12:00
Escaping the Valley of Choice in BI
Why Agentic BI threatens an entire profession The post Escaping the Valley of Choice in BI appeared first on Towards Data Science.
AI / MLopen article
UX Collective - Medium01/06/2026, 11:17
Can VR treat addiction?
Design / UXopen article
Codrops01/06/2026, 10:56
Beaucoup and the Pursuit of Memorable Experiences
Through branding, immersive web experiences, and 3D design, Beaucoup explores new ways to make brands feel memorable.
Frontendopen article
UX Collective - Medium01/06/2026, 08:12
Designing for AI, the permalink problem, vibe designing
Design / UXopen article
UX Collective - Medium01/06/2026, 08:10
Is your website ready for AI agents?
Your website just got ghosted by an AI agent and why accessibility is your new best friend. Continue reading on UX Collective »
Design / UXopen article
Hugging Face - Blog01/06/2026, 04:44
Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action
AI / MLopen article
Towards Data Science31/05/2026, 17:00
Solving a Murder Mystery Using Bayesian Inference
How Knives Out teaches Bayesian thinking (without you realizing it) The post Solving a Murder Mystery Using Bayesian Inference appeared first on Towards Data Science.
AI / MLopen article
UX Collective - Medium31/05/2026, 12:22
AI Token Scarcity and Arcade Economics
The rising cost of computing is prompting a reevaluation of designers' expectations. Continue reading on UX Collective »
Design / UXopen article
Articles on Smashing Magazine — For Web Designers And Developers31/05/2026, 08:00
June Is For Exploring (2026 Wallpapers Edition)
Let’s kick off June — and the beginning of summer — with some fresh inspiration! Artists and designers from across the globe once again tickled their creativity to welcome the new month with a new collection of desktop wallpapers. Enjoy!
Frontendopen article
Codrops30/05/2026, 13:06
The Future of Emotional Technology: Comfort Websites for Growth and Self-Awareness
Personalized digital spaces designed with care and emotional intention may offer a new way to support self-awareness, emotional growth, and the processing of difficult experiences in an increasingly AI-driven world.
Frontendopen article
MachineLearningMastery.com30/05/2026, 02:54
Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient
This article is divided into four parts; they are: • The Problem with Static Batching • Code Example of Static Batching • Continuous Batching: Dynamic Scheduling and Ragged Batching • Full Implementation The simplest way to serve multiple requests together is to use static batching, by grouping them into fixed-size batches and processing each batch together.
AI / MLopen article
Netflix TechBlog - Medium29/05/2026, 18:49
High-Throughput Graph Abstraction at Netflix: Part I
Backendopen article
NN/g latest articles and announcements29/05/2026, 17:00
How to Get Research Recommendations on the Roadmap
To influence the roadmap: join planning early, learn constraints, tie research to PM metrics, and give clear recommendations at the right time.
Design / UXopen article
NN/g latest articles and announcements29/05/2026, 17:00
Using RAS to Guide UX Research Resource Allocation and Strategy
RAS helps managers allocate resources based on actual impact, shifting focus from outputs to outcomes and enabling data-driven UX strategies.
Design / UXopen article
Netflix TechBlog - Medium29/05/2026, 14:01
From Silos to Service Topology: Why Netflix Built a Real-Time Service Map
Backendopen article
Hugging Face - Blog29/05/2026, 00:00
Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
AI / MLopen article
Codrops28/05/2026, 13:01
The Never Ending Story: Building a Seamless Infinite Scroll Experience with GSAP & Lenis
Most pages end when you stop scrolling. This one doesn’t. Learn how to build a seamless infinite scroll with a layered parallax depth effect.
Frontendopen article
Articles on Smashing Magazine — For Web Designers And Developers28/05/2026, 13:00
Algorithmic Theming Engines: Building Self-Correcting Color Systems With `contrast-color()`
Seventy percent of websites still fail basic WCAG contrast checks in 2025. After years of design system tooling, accessibility linters, and JavaScript libraries, nothing moved the needle. We didn’t need better libraries. We needed better CSS. `contrast-color()` is that better CSS.
Frontendopen article
MachineLearningMastery.com28/05/2026, 12:00
Building a Context Pruning Pipeline for Long-Running Agents
Modern AI agents built on top of large language models (LLMs) are designed to run continuously.
AI / MLopen article
Stripe Blog28/05/2026, 00:00
Solo founding is at an all-time high: Top performers have these traits in common
In 2025, solo founders in the top decile generated 61 times the revenue of the median solo founder in their first six months. We analyzed the data to understand what drives that gap.
Backendopen article
Martin Fowler27/05/2026, 19:40
Fragments: May 27
At the GOTO Conference in Copenhagen in 2025, Kent Beck and I spent some time on stage talking and answering questions from the audience - a format I refer to as “two old geezers on a park bench”. We talk about our experiences with LLM-augmented programming (at that point - October 2025), we show our frustration that things we’ve been saying for thirty years still need to be said, we say how anything like a manifesto reunion needs to be led by a younger generation, and opine on what junior developers should be focusing on in their career. ❄ ❄ ❄ ❄ ❄ Ian Johnson has written a series of posts about restructuring a gnarly codebase The story follows a real Laravel + React codebase over ~3 months and ~258 commits from a legacy monolith with no tests to a well-structured application with automated quality gates, a React SPA migration in progress, and an AI agent that reliably ships production code with minimal supervision. The series covers the steps in decent detail, and his approach follows the kinds of steps I’d use. First get everything under the control of decent characterization tests, add static analysis, introduce the right patterns to make things flow easily. With all of this, is his use of AI, which changed during the exercise: For the first two months of this project, I used Claude Code with auto-approve turned off. Every file edit, every terminal command, every change… I reviewed it before it executed. […] The results were good. The code was clean. But I was doing most of the thinking and half the typing. The agent was a fancy autocomplete with better suggestions. I wasn’t getting the leverage I’d hoped for. I read an article about “on-the-loop” versus “in-the-loop” human-AI collaboration. The framing clicked immediately […] I was micromanaging because I didn’t trust the agent to do the right thing. And I didn’t trust the agent because there was nothing forcing it to do the right thing. His early steps put in tests, static analysis, and the right architectural patterns. With those in place, he could let the agent do more work. My role shifted from writer to curator. I don’t write most of the code anymore. I Define the patterns […] Review the test specs […] Review the output […] Update the harness […] Make strategic decisions […] He finishes the series with conclusions about how he’d generalize his experience to other circumstances. ❄ ❄ ❄ ❄ ❄ Back in the land of my birth, there was some notable groans when the National Health Service decided to close nearly all of their Open Source repositories, supposedly to the security threat of LLMs. Closing repos like this isn’t an effective counter to LLM-augmented attackers. I suspect it’s no coincidence to see GDS (Government Data Services), the highly-regarded IT enablers in the UK government publish their position Moving code from public to private as a substitute for investment in secure-by-design delivery, ownership and remediation is a warning sign because it reduces sharing and scrutiny, can slow coordinated improvement across government and suppliers, and does not remove the underlying weaknesses in a running service. Terence Eden memorably sums up his view on this: Within the UK’s Civil Service you occasionally hear the expression “being invited to a meeting without biscuits”. It implies a rather frosty discussion without any of the polite niceties of a normal meeting. ❄ ❄ ❄ ❄ ❄ I’ve seen a few cases where those developers who are most involved in working with LLMs find they are running into a problem with cognitive endurance, Adam Tornhill has joined this group: One of the big wins with agents is that they let us stay with the higher-level problem for longer. We get less sidetracked by details, dependency cleanup, and similar secondary tasks that used to break concentration. But there is a cost we are still underestimating. Agentic coding is mentally expensive. I can usually sustain the pace for a couple of hours. Then I need a break. The pace is simply too intense. And based on conversations with other engineers, I do not think I am alone in that. He explains that working with The Genie means we are making more decisions in less time, this increase in decision density is hard on the brain. He responds by keeping agent tasks small, automating everything he can, and accepting that he won’t know every line of code as long as he has good verification mechanisms in place. Notably, he has not gone in the direction of doing his work with swarms of agents that he coordinates. Instead has one long-running task that he babysits and one focus task That last point is important given the running-twenty-agents-in-parallel hype. I cannot even think about twenty meaningful things to build, and even less so about the resulting cognitive tax of the likely interruptions. It’s exactly the wrong thing to even consider. At least for humans. (And yes, I understand sub-agents and machine parallelisation. That is not what I’m objecting to. It is the parallelisation of human attention that does not scale). I liked that he included some thoughts about what folks can do in time outside this intense programming time. Not just “have a coffee” (although he includes that) but also about learning about the domain that the software supports. ❄ ❄ ❄ ❄ ❄ A couple of pithy quotes from social media Lorin Hochstein “Metaphor debt” is when all of your metaphors involve the concept of “debt” because you can’t think of any other metaphors anymore. ❄ ❄ Daniel Terhorst-North If a vegan crossfit fan is using Claude to write Rust, which thing do they tell you first? ❄ ❄ ❄ ❄ ❄ Karl Bode reacts to speakers getting booed when mentioning AI during commencement addresses. He points out that younger folks are increasingly unhappy with the tech oligarchy and their fruits. The thing is the kids aren’t stupid. They see the field clearly. They see the difference between what’s being sold to them by tech companies, the press, and commencement speakers, and what they have repeatedly seen with their own eyes. They’ve watched tech oligarchs spend the last decade mired in scandal after scandal, hype cycle after hype cycle, steadily enshittifying everything they touch along the way. […] The percentage of Gen Z that think AI’s benefits don’t counterbalance the risks now sits around fifty percent, up 11 percentage points in just the last year. Eight out of every ten believe that using AI makes the process of actual learning more difficult. He sees young people saddled with the perception of entering a worsening world - which leads them to rage against this latest fruit of the tech oligarchy. A rage that is easy for folks like me - with a comfortable retirement off-ramp - to properly appreciate. A rage that could have marked political and social consequences. ❄ ❄ ❄ ❄ ❄ Relevant to these concerns are a couple of items in last week’s Economist newspaper. The newspaper argues that historically major technological advances haven’t led to significant unemployment or drops in wages (paywalled article). The closest was the original industrial revolution in 19th Century Britain. There was a stagnation in wages during this period, but there was also a massive increase in population, from 4½ million to 12 million. It also points out that we’ll probably only understand the full consequences of all this when a recession hits, as this is when most unproductive jobs tend to be flushed out of the system. A second article (also paywalled) indicates that AI is having some effect on graduate hiring. They did an analysis of surveys of recent graduates, looking to see if employment varied depending on a job’s exposure to AI. The least exposed quintile of subjects saw employment rate fall by 1.5% over the last couple of years, while the most exposed quintile’s drop was 6.6%. ❄ ❄ ❄ ❄ ❄ Lawfare isn’t impressed with the latest efforts by the US Government to regulate AI. On [last] Wednesday, the White House invited leaders of OpenAI, Google, Anthropic, Meta, and Microsoft to the Oval Office for a signing ceremony the following afternoon. President Trump was to sign an executive order on AI and cybersecurity—the administration’s most formal effort yet to establish a voluntary process for reviewing frontier models before their release. But roughly three hours before the ceremony, when some company executives were already in the air to Washington, the White House called it off. They see the proposed regulations as mild, and including some valuable measures to harden defenses against cyber threats. But it’s worth underscoring the implications of postponing (if not outright canceling) this order, which, by its own terms, was about as modest a frontier-AI intervention as the federal government could put on paper: voluntary, focused on the government’s own defenses, and explicitly barred from becoming a licensing regime. The objection isn’t so much about government coercion as about the government having any settled role at all. Voluntary, in other words, isn’t the floor of frontier AI policy in this administration; it’s the ceiling. This is a questionable position given that the concerns animating this draft order will likely grow in the near future. It is also self-defeating for those who applauded the order’s delay or demise. Far from resolving the risk of government meddling in AI, killing the order just leaves in place what Ball has described as the “opaque and essentially lawless” alternative: government access happening through back channels, on terms set case by case, with no stable rules at all. One of the problems here is a distinct lack of governmental expertise, either in AI or in software in general. Too much is being decided at the whims of the tech oligarchy, there isn’t any attempt to engage in the broader issues at hand. That’s not entirely a bad thing, trying to regulate something that’s still evolving so fast is usually a fool’s errand - but the problem here is the impact of AI is so big that there’s real danger in being too far behind. ❄ ❄ Which leads me to a rare thing, an endorsement of a candidate for political office. If you are voting in congressional district MA-06 (North Shore of Massachusetts), I’d seriously look at Beth Anders-Beck, who is running for congress in that district. Beth has a long background in software development (including developing the notion of Forest and Desert), so would introduce expertise that Congress desperately needs. I’ve known Beth for decades, and have a high opinion of their intelligence, judgment, and ability to work with others. Congress doesn’t deserve Beth, but it does need her.
Backendopen article
Martin Fowler27/05/2026, 15:01
The test suite as a regression sensor
Birgitta Böckeler finishes her post on sensors for coding agents by examining the role of a test suite as a regression sensor, focusing on the role mutation testing can play. more…
Backendopen article
Martin Fowler27/05/2026, 14:03
The VibeSec Reckoning
Vibe coding has significantly accelerated software prototyping but AI agents frequently recommend insecure configurations, creating security problems. Gautam Koul, Lucian Moss, Neil Drew-Lopez, and Daberechi Ruth Edeokoh share their experience while building applications for Thoughtworks's global marketing. They learned that to combat this we need to write a security context file to guide the AI, be cautious with AI permission requests, create a daily security intelligence feed, and provide builders with a secure-by-default harness and templates. more…
Backendopen article
Codrops27/05/2026, 12:43
Whooshes, Snaps and Shaders: Adrien Vanderpotte and the Feeling of the Interface
An exploration of the meticulously messy hunt for a project’s digital signature: the singular, reactive behavior that makes a brand feel unforgettable.
Frontendopen article
MachineLearningMastery.com27/05/2026, 12:00
The Statistics of Token Selection: Logits, Temperature, and Top-P Walkthrough
When large language models, or LLMs for short, produce outputs, several criteria are at stake, including not only overall response relevance but also coherence and creativity.
AI / MLopen article
Stripe Blog27/05/2026, 00:00
Expanding Stripe Radar to protect more of your business
Radar now blocks high-risk transactions across all supported payment methods; defends against new fraud types like multi-account abuse and pay-as-you-go abuse, regardless of which payment processor you use; and gives platforms new tools to evaluate and mitigate merchant risk on and off Stripe.
Backendopen article
Hugging Face - Blog27/05/2026, 00:00
Reachy Mini goes fully local
AI / MLopen article
Hugging Face - Blog27/05/2026, 00:00
Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL
AI / MLopen article
Codrops26/05/2026, 13:18
Building a Scroll-Driven 3D Cube Gallery in Webflow with GSAP
A step-by-step guide to building a scroll-driven 3D cube gallery in Webflow with GSAP animations and CMS-powered content.
Frontendopen article
MachineLearningMastery.com26/05/2026, 12:00
Building a Multi-Tool Gemma 4 Agent with Error Recovery
In a <a href="https://machinelearningmastery.
AI / MLopen article
MachineLearningMastery.com25/05/2026, 12:00
Implementing Hybrid Semantic-Lexical Search in RAG
Implementing hybrid search strategies is a critical step in building modern RAG (Retrieval-Augmented Generation) systems , especially when shifting from prototype to production-ready solutions.
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers25/05/2026, 12:00
Your Prototype Is Not Being Honest With Your Users (And Here’s How To Fix It)
There’s a moment in almost every usability session where a participant pauses at the login screen, types something, and glances up: checking whether they’re “doing it right.” That pause is a clear sign. They’ve already clocked that this isn’t a real app, and every data point collected after that moment is filtered through that awareness.
Frontendopen article
Hugging Face - Blog25/05/2026, 00:00
Harness, Scaffold, and the AI Agent Terms Worth Getting Right
AI / MLopen article
NN/g latest articles and announcements22/05/2026, 17:00
The Case for Design Disposables
Design disposables are rough artifacts you make to think, not to deliver. Learn to tell them apart from deliverables and avoid the sunk-cost trap.
Design / UXopen article
NN/g latest articles and announcements22/05/2026, 17:00
Closing the Loop: What to Do After a Design Critique Ends
Most designers invest in running critiques but skip the followup. That missing step is often why feedback culture breaks down.
Design / UXopen article
Hugging Face - Blog22/05/2026, 15:25
Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers22/05/2026, 13:00
Four Levels Of Customer Understanding
What people say, feel, think, and do are often very different things. To understand the underlying reasons for user behavior, it helps to look beyond the surface and explore hidden motivations, root causes, and the different layers of reality that shape how people act. Brought to you by Measuring UX Impact, **friendly video course on UX** and design patterns by Vitaly.
Frontendopen article
MachineLearningMastery.com22/05/2026, 12:00
Building Context-Aware Search in Python with LLM Embeddings + Metadata
Keyword search breaks the moment a user types something a document doesn't literally say.
AI / MLopen article
Codrops21/05/2026, 15:22
Creating Scroll-Driven SVG Map Animations with GSAP
A tutorial on building cinematic scroll-driven SVG map animations with GSAP using path drawing, motion tracking, and smooth camera movement.
Frontendopen article
MachineLearningMastery.com21/05/2026, 12:00
How to Build a Multi-Agent Research Assistant in Python
I have been experimenting with the OpenAI Agents SDK, and it has quickly become one of my favorite ways to build agentic AI applications.
AI / MLopen article
Martin Fowler21/05/2026, 11:49
Bliki: Vibe Coding
Vibe coding is building a software application by prompting an LLM, telling it what to build, trying it out, prompting for changes - but without looking at any of the code that the LLM generates. This technique can be used by people without any knowledge of programming. However the resulting software often shows problems with maintainability, correctness, and security - so is best used for disposable software written for a limited audience. The term was coined in February 2025 by Andrej Karpathy, an experienced programmer, in a post on X: There's a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like “decrease the padding on the sidebar by half” because I'm too lazy to find it. I “Accept All” always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. -- Andrej Karpathy The key point about vibe coding is “forget that the code even exists”. This is what gives it much of its usefulness, but also its limitations. Since the November Inflection many programmers are getting LLMs to write all their code, commenting that they may never write a line of code directly again. However they do care about this code, reviewing it, paying attention to its internal structure. In that case, they aren't forgetting the code exists, so it's really a different thing that I call Agentic Programming. Sadly the term “vibe coding” really caught on, so many people use it to mean agentic programming. However I feel that despite this rapid Semantic Diffusion, it's worth trying to keep the concepts of vibe coding and agentic programming separate, as they are both different to use and different in their consequences. Because a vibe coder doesn't look at the code, they don't need programming skills, so it's perfect for someone with no programming knowledge to build applications for their own use. Experienced programmers may also find it handy for rapid development of disposable software or prototypes. Vibe coding is still new, so we are exploring its limitations, and those limitations change as the sophistication of models and their harnesses change. These limitations do introduce considerable risks, particularly if the vibed software is used widely or has access to sensitive information. Perhaps the most serious risk is that of security. LLMs are inherently vulnerable as they provide a large attack surface for predators. Vibe coded applications can often expose sensitive information or worse, credentials to attack deeper into an organization's systems. Even non-programmers need to be aware of the Lethal Trifecta. With little attention to the code, vibed software can rapidly produce many lines of code of a very low quality. Such code makes it difficult, even for an LLM, to modify and enhance the software in the future. While it's possible that growing LLM capabilities will allow it to work with even the largest bowls of spaghetti software, thus far it seems clear that well-structured software makes life easier for LLMs too. LLMs are famous for habit of hallucinating incorrect facts and presenting these with great confidence. This habit also leads them to create software that behaves incorrectly - and those errors may not be manifest to the user. Furthermore the non-determinism of LLMs means that it's likely that asking an LLM to enhance some software could easily lead it to introduce errors, even in parts of the code that shouldn't change due to the new request. We should thus treat LLM-generated software with skepticism, it can still be useful, but we need to be aware of the risks. On the whole vibe coding software is best used for disposable software that's only used by its author or a close group of collaborators who understand and accept the risks involved. Code that is more complex, more widely-used, and with more consequences to its risks should not be forgotten about.
Backendopen article
Articles on Smashing Magazine — For Web Designers And Developers21/05/2026, 08:00
Advanced Tree Counting: Mathematical Layouts With `sibling-index()` And `sibling-count()`
Meet `sibling-index()` and `sibling-count()`. Staggered cascade effect in one line of CSS without `:nth-child()` rules or JS workarounds. Works for 5 items or 5,000.
Frontendopen article
Martin Fowler20/05/2026, 14:27
Three more static code analysis sensors
Birgitta Böckeler adds discussion of three more sensors for static code analysis, focusing on checking and enforcing better modularity. Computational sensors for dependency checks were good at enforcing rules, but the rules were limited. Building a computational sensor for coupling data proved lackluster. Prompting an inferential sensor to review modularity was more effective. more…
Backendopen article
MachineLearningMastery.com20/05/2026, 14:15
Agentic Programming: A Roadmap
Here is the number that defines the current state of things: <a href="https://svitla.
AI / MLopen article
Codrops20/05/2026, 12:24
Made With Gsap: Building a Fun Gravity-Based Mouse Trail
A creative GSAP experiment where images cascade from the mouse position and rebound off the bottom of the screen with satisfying motion.
Frontendopen article
Hugging Face - Blog19/05/2026, 18:38
OlmoEarth v1.1: A more efficient family of Earth observation models
AI / MLopen article
Martin Fowler19/05/2026, 16:38
Maintainability sensors for coding agents
In her recent article about harness engineering for coding agent users, Birgitta Böckeler laid out a mental model for expanding a coding agent harness: a system of guides and sensors that increase the probability of good agent outputs and enable self-correction before issues reach human eyes. Birgitta has now started publishing an article where she walks though her experiences using sensors to keep a codebase maintainable. This part looks at static analysis with basic code linting. more…
Backendopen article
Codrops19/05/2026, 10:52
80s Business Tech and Seamless Scene Transitions: Inside Shader.se’s Scroll-Driven WebGPU Pipeline
How we built the scroll-driven WebGPU pipeline behind Shader.se, from selective scene rendering to seamless scene transitions using React Three Fiber.
Frontendopen article
NN/g latest articles and announcements19/05/2026, 06:52
UX Conference August Announced (Aug 17 - Aug 28)
Take up to 5 in-depth training courses, teaching user experience best practices for successful design. Training focused on long-lasting skills for UX professionals. August 17 - August 28, 2026.
Design / UXopen article
Codrops18/05/2026, 12:12
Merlin: The Code Boutique Turning Motion Into Digital Magic
A code boutique from Amsterdam crafting high-performance digital experiences where motion, technology, and design become one seamless language.
Frontendopen article
NN/g latest articles and announcements15/05/2026, 17:00
Small by Design: The Strength of Lean Design-System Teams
Lean design-system teams, when strategically planned, can move faster, prioritize sharply, and scale impact beyond their size.
Design / UXopen article
NN/g latest articles and announcements15/05/2026, 17:00
What Designers Actually Struggle with on Product Teams
Designers' top struggles aren't about design skills. They're about alignment, influence, and navigating org complexity — the work no one taught them to do.
Design / UXopen article
Articles on Smashing Magazine — For Web Designers And Developers15/05/2026, 10:00
Ten Data-Backed Truths Of User Experience ROI
Every extra second of friction has a measurable business cost. Carrie Webster shares ten data-backed UX facts that link user experience directly to revenue, retention, and long-term growth.
Frontendopen article
Martin Fowler14/05/2026, 21:52
Fragments: May 14
Last week I spent a day at The Orchard Retreat, hosted by Mechanical Orchard. that brought together several people working in software development to talk about the profession’s future with the rise of agentic programming. The event was help under the Chatham House Rule, so I can’t attribute the comments and stories I heard. (If anyone recognizes themselves, and would like attribution, let me know.) Here are a few tidbits that caught my notebook. ❄ ❄ One group developed a behavioral clone of GNU Cobol compiler in Rust. The result is 70K lines of Rust and was built in 3 days. This is yet another sign of the ability of LLMs to do a good job of porting existing code to a new platform. Good regression tests are extremely valuable here (and I don’t know how good GNU Cobol’s are). There’s also the possibility of building a test suite if you have access an existing implementation. ❄ ❄ Large spec documents can be complex for a human to review. One attendee shared the idea of getting the LLM to interview a human expert, asking the human questions to verify the correctness of the specification, a form of Interrogatory LLM. ❄ ❄ Not specifically about AI - but I liked how one attendee commented that the first thing they do when consulting with an organization is to read the guidelines for their change-control board. This is the scar tissue of what’s gone wrong in the past. I’ve often said that to understand why a thing is the way it is, you need to understand the history of how it got there. This seems like an excellent way to tap into important parts of that history. ❄ ❄ My colleagues who work with modernizing legacy systems have long been rather sniffy about “Lift and Shift” - porting a legacy system to a new platform while retaining Feature Parity. We see this pattern as a huge missed opportunity. Often the old systems have bloated over time, with many features unused by users (50% according to a 2014 Standish Group report) and business processes that have evolved over time. Replacing these features is a waste. Instead, try to muster the energy to take a step back and understand what users currently need, and prioritize these needs against business outcomes and metrics. But this point of view was developed before LLM’s ability to port code appeared. One attendee who does a lot of work in this field said they believed that lifting and shifting to a new platform should now be always the first step in a legacy migration. The cost is no longer as formidable as it used to be, and a better environment makes further evolution much cheaper. Just don’t stop there. ❄ ❄ Several attendees were from the financial industry, and thus were immersed the problems of complex legacy environments coupled with regulatory controls and significant risk should software do something wrong with money. One of their issues is the complexity we run into when a financial product is offered in multiple jurisdictions, each with their own regulations to satisfy. There’s a lot of software complexity in deciding which jurisdiction applies, and picking the right set of rules at the right point of the workflow. The question here is whether the rapidity of agentic programming means that we can build individual, simpler systems for each jurisdiction. We would then use LLMs to ensure consistency between them, so that as the product rules change, each system reflects that into its own environment. A large part of software design is about identifying what is the same and what differs between various business contexts. Where things are the same, and need to be the same, we are rightly wary of duplicating code, since this increases the cost of updates the dangers of inconsistency. The interesting question is what role LLMs can play to give us new tools to tackle this. ❄ ❄ As is usually the case in gatherings like this, folks were concerned about junior developers. When we work with The Genie, our value comes from good judgment - how do we teach that? This group did have one common tool - Pair Programming. One of the key benefits of pairing has always been skills transfer, and here an experienced agentic programmer can pass on their judgment for software design and how to use the genie to get there. And the junior will often have a trick or two to share too - that fresh pair of eyes in particularly valuable in the shift to our agentic future. ❄ ❄ Historically, we use computer systems to bring order to chaotic human processes. Is AI reversing that? ❄ ❄ So much software is involved in data transformation. Those records over there need to be consumed by these APIs over here, but there are differences in how the data is structured, often due to being in different Bounded Contexts, so we have to do some conversion. Agents are particularly adept at writing this kind of transformation code, which is often more tedious than we’d like. ❄ ❄ Chaos Engineering has become a valuable technique to improve resiliency, made famous by Netflix’s Chaos Monkey that randomly breaks live services to see how well the ecosystem reacts and recovers. What would a Chaos Monkey for AI look like? Would it deliberately introduce hallucinations into a pipeline to see if sensors were able to catch them? ❄ ❄ ❄ ❄ ❄ Back at my desk There’s been a bunch of questions about the article on Structured-Prompt-Driven Development (SPDD) that the authors answered in a Q&A section. One in particular caught my eye: Have you considered having an agent do the prompt/spec review itself — not a human reviewing the Canvas, but an agent that reads the REASONS Canvas alongside the code diff and verifies alignment? The reply talks about how there is an available command to do this, but there downsides. In particular one reason not to do this automatically is: Letting humans learn. Review is also where humans learn from the AI’s choices — patterns, trade-offs, options they had not thought of. Cutting humans out speeds things up, but it blocks the long-term skill growth that SPDD is designed to protect. […] Once enough decision rules build up to give us real confidence, we may shift more of the review to the agent step by step — but the part where humans learn from the AI is something we plan to keep. One of the ways we should judge the value of an AI tool is how much it helps us humans learn more about the world we inhabit and build. ❄ ❄ ❄ ❄ ❄ In some strange way I injured my elbow last week. No idea how, there was no event where I said “oh shit”. It just gradually started hurting and swelling. My life-long strategy to avoid sports injuries1 had defied me. I applied ice and ibuprofin, the swelling went down, but my range of motion got worse. I’m glad I learned to use a knife and fork in English childhood, so I normally eat with my left hand. I noticed that that loss of range of motion occurred after I got home, when I started spending all day at the computer again. I might not use my elbow directly, but my right hand does a lot of typing and mousing. My desk set up is pretty ergonomic, with a good keyboard, a wrist rest for the mouse, and arm rests on my chair. But even so, did my computer use make my elbow get worse once I got home? I can’t imagine not using the computer, for me writing has become an unstoppable habit. But maybe I should use this opportunity to explore voice input - after all most people can speak faster than they can type. I tried this many years ago, when a colleague told me how good voice recognition was once it trained to you. I tried it, and indeed the voice recognition, even in those pre-AI days, was very good. But it didn’t work for me. When I’m writing I rapidly type words into Emacs, but almost immediately I go back to edit them. Write two sentences, edit them, write another, re-edit the paragraph. The back-and-forth between seeing my words and thinking about them is tight - I can’t just dictate my words. That made me reflect further. I only started using a computer for my writing in my 20s. At school I had to write longhand, and in university to type on typewriter. But those media don’t support the constant rewriting that I do now. Would I even have become a writer had the text editor not been invented? ❄ ❄ ❄ ❄ ❄ James Pritchard thinks that many developers are over-using agents at run-time in their products, when LLMs are better used as functions. The problem with agents isn’t that they don’t work. It’s that they work unpredictably. You trade a known execution path for “autonomy” that mostly means “I don’t know what it’s going to do.” When an agent-powered feature breaks in production, you’re debugging a conversation transcript, not a stack trace. Most “agent” use cases are actually workflows, a known sequence of steps where one or two of those steps happen to involve an LLM. You don’t need autonomy for that. You need a function call. He points out that functions compose predictably, so if you know the workflow, then composing in a program text is better than agents figuring out how to coordinate themselves. It’s faster, and needs less tokens. It’s usually easier to deal with failures, since the scope of the interaction is smaller. ❄ ❄ ❄ ❄ ❄ Pritchard also thinks that people use skills far more than they should. He thinks people accumulate folders of markdown skill files but LLMs use them inconsistently, often missing them when they’re needed, or bloating context when they are not. Many things that should go in skills should be other parts of a harness, preferably computational. Skills should only be used with deliberate, infrequent workflows. The skills obsession is a symptom of a deeper pattern: people reaching for configuration when they should be reaching for architecture. “The LLM doesn’t write good tests.” Don’t write a testing skill. Are your existing tests inconsistent? Is the test setup complex? Fix those things and it’ll write good tests without being told how. Point it at a test file you’re proud of. Code is clearer than English. […] The best setup is one where you barely need to configure the LLM at all. A clean codebase with clear patterns, a short project config for the non-obvious stuff, hooks for automation, and maybe one or two skills for specific workflows you run intentionally. That’s it. ❄ ❄ ❄ ❄ ❄ An oft-stated point about the rise of agentic programming is that we have to start dealing with non-determinism in our work. Of course that’s somewhat of a simplification, because some aspects of software development have long had to face non-determinism. A notable example of this is distributed systems, and a notable figure in helping us probe the truly uncomfortable waters of distributed systems is Kyle Kingsbury (Aphyr). Last month he dropped a long article (the pdf is 32 pages) on how he sees our LLM-enabled future. The title “The Future of Everything is Lies, I Guess” betrays his lack of enthusiasm for this future. Some readers are undoubtedly upset that I have not devoted more space to the wonders of machine learning—how amazing LLMs are at code generation, how incredible it is that Suno can turn hummed melodies into polished songs. But this is not an article about how fast or convenient it is to drive a car. We all know cars are fast. I am trying to ask what will happen to the shape of cities. It’s worth the long read, even if it isn’t terribly cheerful. Kingsbury brings up many of worries about AI’s growth from the perspective of someone who is clearly well-informed about their capabilities. His view is that the best response to all this is that we should stop. He wants to avoid using AIs for his writing, software, or personal life. He thinks those working for the AI companies should quit. And yet he also knows that these tools are useful, and wants to use them. I’m both a hoper and a doomer when it comes to our AI future. Fundamentally I see any powerful technology as a big bus: we are either on it, or get run over by it. I’m onboard the bus because I don’t think putting up some barriers would stop me being crushed by its wheels. Maybe if I’m on the bus I can join some people to influence the driver a bit. I’m also very reluctant to speculate on the future outcomes of anything, let alone something as powerful as this. Did the early industrialists in the late eighteenth century have any clue what the industrial revolution they unleashed would do? While it created many harms, it also created a massive rise in the living standards of millions of people, at least those whose countries were on the bus. AI may create benefits that I can’t really dream of, although I can glimpse it when it helps a friend stave off Parkinson’s disease. Those hopes are there, but Kingsbury’s article shines a light on the darker elements of the here-and-now, asking serious questions of responsibility a part of my work as a moderator of a Mastodon instance is to respond to user reports, and occasionally those reports are for CSAM, and I am legally obligated to review and submit that content to the NCMEC. I do not want to see these images, and I really wish I could unsee them. On dark mornings, when I sit down at my computer and find a moderation report for AI-generated images of sexual assault, I sometimes wish that the engineers working at OpenAI etc. had to see these images too. Perhaps it would make them reflect on the technology they are ushering into the world, and how “alignment” is working out in practice. Don’t do sports ↩
Backendopen article
Martin Fowler14/05/2026, 15:04
Bliki: Interrogatory LLM
When we need an LLM to perform a complex task, we often need to feed it a lot of context. Coming up with a design for a new feature requires descriptions of how we want the feature to appear to the user, guidelines on how it should be implemented, information on external systems to consult, and so on. All this can be several pages of markdown. The obvious way to do this is for a human to write this context, but an alternative is to use an LLM to write this context after interviewing a human. The way I can do this is to prompt the LLM to interrogate me. It should ask me all the questions it needs to create this appropriate context. I can feed much of the information it needs, and tell it other sources it needs to consult if it can't figure those out itself. Once it's done, it can then create the context report for another session (perhaps with another model) to carry out the next step. I first saw a decent description of this approach in Harper Reed's blog. A striking element of his approach is insisting that the LLM ask only one question at a time. (When I tried it, I found it needed to be frequently reminded of this.) Another way to use an interrogatory LLM is to give it a document, such as a software specification, that captures knowledge about a domain - and then ask the LLM to interview a human expert to determine if the document is accurate. This is an alternative to getting the human expert to read the document to review it. People often find reviewing hard, so a conversation with an LLM might be more fruitful, particularly if the document isn't well-written. Naturally we can use both of these, using one interrogatory LLM to build a document, then using other interrogatory LLMs to review it with other experts. The above is getting an LLM to create or assess context for a particular use of an LLM. But the technique is more broadly applicable. I've become a natural writer, someone who finds the process of writing an essential part of thinking. To really understand something, I need to write about it. But different people are different. Many folks find writing hard, often very hard. This can be a real problem when we need to get information out of someone's head into a form that other humans can consume. Maybe such people would find it easier to ask an LLM to interview them than to write a document themselves. Certainly the result will have that tang of AI-writing that folks like me shudder at - but that's better than not having the information itself, either due to rushed writing or no writing at all.
Backendopen article
Codrops14/05/2026, 10:37
Designing Ourselves: The New Obys Identity and Website
An inside look at the new Obys – from rethinking our identity to building a system where type, motion, and structure work as one, shaping a more precise and consistent experience.
Frontendopen article
Articles on Smashing Magazine — For Web Designers And Developers13/05/2026, 13:00
Practical Interface Patterns For AI Transparency (Part 2)
Why traditional loading patterns like spinners fail in agentic AI experiences, and how interface patterns that reveal the system’s process, status, and decision-making can improve transparency and build user trust.
Frontendopen article
Martin Fowler12/05/2026, 13:30
What is Code
Increasingly humans delegate writing code to agents. Will there even be source code in the future? To wrestle with this question, we have to understand what code is. Unmesh Joshi sees code as having two distinct but intertwined purposes: instructions to a machine and a conceptual model of the problem domain. He explores why it's vital to build a vocabulary to talk to the machine, how programming languages are thinking tools, and how this affects our future as we work with LLMs. more…
Backendopen article
Stripe Blog11/05/2026, 00:00
Five vertical SaaS insights from Sessions 2026
AI is forcing platforms to expand beyond pure software. See how vertical SaaS platforms are using payments, financial services, and agentic commerce to build more durable businesses.
Backendopen article
LogRocket Blog10/05/2026, 13:00
Why are AI companies buying the teams behind your favorite dev tools?
AI companies are buying developer tools as coding agents turn runtimes, package managers, and linters into strategic infrastructure. The post Why are AI companies buying the teams behind your favorite dev tools? appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog09/05/2026, 13:00
AI-assisted development governance: A practical guide
Learn how AI-assisted development governance uses rules, agents, hooks, and protocols to help AI coding tools produce safer, more consistent code. The post AI-assisted development governance: A practical guide appeared first on LogRocket Blog.
Frontendopen article
NN/g latest articles and announcements08/05/2026, 17:00
Designing AI Agents: 4 Lessons from China’s Qwen Agent
A study of Qwen's AI agent reveals 4 design lessons: support discoverability, reuse familiar patterns, handle personal data carefully, and protect user autonomy.
Design / UXopen article
NN/g latest articles and announcements08/05/2026, 17:00
UX Writing: FAQs from Practitioners
Get answers to frequently asked questions about UX writing from attendees of NN/G’s Writing Compelling Digital Copy course.
Design / UXopen article
Netflix TechBlog - Medium08/05/2026, 15:55
Scaling ArchUnit with Nebula ArchRules
Backendopen article
LogRocket Blog07/05/2026, 15:30
The project that made me question the UX process
A three-week mobile banking project taught me that the “proper” UX process is not always realistic. Sometimes, the better approach is to work with what you know, identify what you still need to learn, and make the strongest decision possible under real constraints. The post The project that made me question the UX process appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog06/05/2026, 21:52
How to rethink scope creep as strategic flexibility
Controlled scope creep can help PMs use capacity buffers, AI tools, and clear guardrails to turn new ideas into better outcomes. The post How to rethink scope creep as strategic flexibility appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog06/05/2026, 15:07
LaunchPod: AI Agents Fail for 2 Reasons. Crowdsourcing Solved Both.
According to SVP Julia Dalton, managing humans at scale and managing AI agents have a lot more in common than most people realize. The post LaunchPod: AI Agents Fail for 2 Reasons. Crowdsourcing Solved Both. appeared first on LogRocket Blog.
Frontendopen article
Articles on Smashing Magazine — For Web Designers And Developers06/05/2026, 10:00
The Architecture Of Local-First Web Development
An honest perspective on building local-first web apps in 2026, written for developers who’ve been doing this long enough to be skeptical of silver bullets.
Frontendopen article
LogRocket Blog05/05/2026, 15:30
Understanding A/B testing in UX research
A/B testing compares two versions of a design to see which performs better with real users. Here’s how UX teams can use it to test hypotheses, measure outcomes, and make smarter product decisions. The post Understanding A/B testing in UX research appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog05/05/2026, 13:00
How to build your first MCP server with Node.js
A step-by-step guide to building your first MCP server using Node.js, covering core concepts, tool design, and upgrading from file storage to MySQL. The post How to build your first MCP server with Node.js appeared first on LogRocket Blog.
Frontendopen article
Articles on Smashing Magazine — For Web Designers And Developers05/05/2026, 08:00
Rethinking The Experience Of System Tools
Design always starts with function — function shapes form. But if that function can’t be made completely invisible and people still have to interact with it, it inevitably becomes part of their experience. In this article, Kyrylo Levashov shares four common software design assumptions.
Frontendopen article
Netflix TechBlog - Medium04/05/2026, 16:01
Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph
Backendopen article
LogRocket Blog04/05/2026, 13:30
Using Next.js security headers to strengthen app security
Using security headers in your Next.js apps is a highly effective way to secure websites from common security threats. The post Using Next.js security headers to strengthen app security appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog04/05/2026, 13:00
AI dev tool power rankings & comparison [May 2026]
A deep dive into May 2026’s AI model and tool rankings. We break down performance, usability, pricing, and real-world capabilities across 50+ features to help you pick the right tools for your development workflow. The post AI dev tool power rankings & comparison [May 2026] appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog04/05/2026, 13:00
Exploring Agent Browser: AI agents on the web
A practical guide to Agent Browser CLI. Learn how AI agents navigate, snapshot, and interact with web pages using stable references, enabling efficient automation and exploratory testing. The post Exploring Agent Browser: AI agents on the web appeared first on LogRocket Blog.
Frontendopen article
Netflix TechBlog - Medium01/05/2026, 21:03
State of Routing in Model Serving
Backendopen article
NN/g latest articles and announcements01/05/2026, 17:00
Selection Criteria: How to Pick Your Participants
Rigorous selection criteria protect study validity. Learn how to define inclusion, exclusion, and diversity criteria to avoid costly misrecruits.
Design / UXopen article
Articles on Smashing Magazine — For Web Designers And Developers01/05/2026, 08:00
Designing Stable Interfaces For Streaming Content
Streaming UIs are an easy concept on the surface, but are quite complicated in practice. There are many considerations that need to be accounted for, from layout shifts and motion preferences to proper markup and various states, that may not be instantly obvious. What happens if the stream is interrupted? Can users tab through the UI on the keyboard as it shifts? What ARIA attributes might be needed?
Frontendopen article
Stripe Blog29/04/2026, 00:00
Giving agents the ability to pay
Link’s wallet for agents gives agents programmatic access to Link, including the ability to generate a one-time-use card or Shared Payment Token (SPT) backed by the cards and bank accounts already in your wallet. It’s built on Stripe’s new Issuing for agents.
Backendopen article
Stripe Blog29/04/2026, 00:00
Everything we announced at Sessions 2026
We’re making Stripe even more programmable; protecting and propelling your business with the strength of the Stripe network; and building economic infrastructure for AI.
Backendopen article
Netflix TechBlog - Medium24/04/2026, 15:06
Scaling Camera File Processing at Netflix
Backendopen article
Netflix TechBlog - Medium17/04/2026, 15:01
The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale
Backendopen article
Netflix TechBlog - Medium10/04/2026, 16:26
Evaluating Netflix Show Synopses with LLM-as-a-Judge
Backendopen article
Stripe Blog07/04/2026, 00:00
How agents, digital wallets, and trust are rewriting checkout
We analyzed checkout activity across more than 20K businesses, surveyed shoppers and ecommerce leaders, and gathered insights from businesses on the Stripe network to understand what’s changing in online conversion.
Backendopen article
Netflix TechBlog - Medium06/04/2026, 22:15
Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale
Backendopen article
Netflix TechBlog - Medium04/04/2026, 00:44
Powering Multimodal Intelligence for Video Search
Backendopen article
Stripe Blog02/04/2026, 00:00
Insights from Shoptalk 2026: How agents are changing retail
Retailers know search and discovery have already shifted. What comes next is less settled. From embedded checkout to emerging third-party surfaces, here’s how ecommerce and AI leaders are integrating agentic commerce.
Backendopen article
Stripe Blog24/03/2026, 00:00
How Stripe Radar helps prevent free trial abuse
Radar now helps prevent free trial abuse with just one click. When enabled, Radar predicts the presence of abusive behavior that violates common trial terms, such as repeated trial signup or missed cancellations, with 90% accuracy.
Backendopen article
Stripe Blog20/03/2026, 00:00
Three of the biggest fraud trends from MRC Vegas 2026
The most sophisticated fraud teams are shifting from one-size-fits-all fraud approaches to more dynamic, tailored interventions. They are removing friction for trusted users, embedding fraud detection directly into agentic transactions, and investing in multilayered identity verification to combat deepfakes.
Backendopen article
Stripe Blog19/03/2026, 00:00
Testing the impact of Adaptive Pricing across 1.5M subscription checkout sessions
Adaptive Pricing is now available for subscriptions, allowing businesses to automatically localize prices in 150+ countries while Stripe handles currency conversion. In an A/B test across 1.5 million subscription checkouts, businesses saw 4.7% higher conversion and 5.4% higher LTV per session, on average.
Backendopen article
Google AI Blog29/03/2024, 18:03
Generative AI to quantify uncertainty in weather forecasting
Posted by Lizao (Larry) Li, Software Engineer, and Rob Carver, Research Scientist, Google Research Accurate weather forecasts can have a direct impact on people’s lives, from helping make routine decisions, like what to pack for a day’s activities, to informing urgent actions, for example, protecting people in the face of hazardous weather conditions. The importance of accurate and timely weather forecasts will only increase as the climate changes. Recognizing this, we at Google have been investing in weather and climate research to help ensure that the forecasting technology of tomorrow can meet the demand for reliable weather information. Some of our recent innovations include MetNet-3, Google's high-resolution forecasts up to 24-hours into the future, and GraphCast, a weather model that can predict weather up to 10 days ahead. Weather is inherently stochastic. To quantify the uncertainty, traditional methods rely on physics-based simulation to generate an ensemble of forecasts. However, it is computationally costly to generate a large ensemble so that rare and extreme weather events can be discerned and characterized accurately. With that in mind, we are excited to announce our latest innovation designed to accelerate progress in weather forecasting, Scalable Ensemble Envelope Diffusion Sampler (SEEDS), recently published in Science Advances. SEEDS is a generative AI model that can efficiently generate ensembles of weather forecasts at scale at a small fraction of the cost of traditional physics-based forecasting models. This technology opens up novel opportunities for weather and climate science, and it represents one of the first applications to weather and climate forecasting of probabilistic diffusion models, a generative AI technology behind recent advances in media generation. The need for probabilistic forecasts: the butterfly effect American Association for the Advancement of Science meeting in Washington, D.C., MIT meteorology professor Ed Lorenz gave a talk entitled, “Does the Flap of a Butterfly's Wings in Brazil Set Off a Tornado in Texas?” which contributed to the term “butterfly effect”. He was building on his earlier, landmark 1963 paper where he examined the feasibility of “very-long-range weather prediction” and described how errors in initial conditions grow exponentially when integrated in time with numerical weather prediction models. This exponential error growth, known as chaos, results in a deterministic predictability limit that restricts the use of individual forecasts in decision making, because they do not quantify the inherent uncertainty of weather conditions. This is particularly problematic when forecasting extreme weather events, such as hurricanes, heatwaves, or floods. Recognizing the limitations of deterministic forecasts, weather agencies around the world issue probabilistic forecasts. Such forecasts are based on ensembles of deterministic forecasts, each of which is generated by including synthetic noise in the initial conditions and stochasticity in the physical processes. Leveraging the fast error growth rate in weather models, the forecasts in an ensemble are purposefully different: the initial uncertainties are tuned to generate runs that are as different as possible and the stochastic processes in the weather model introduce additional differences during the model run. The error growth is mitigated by averaging all the forecasts in the ensemble and the variability in the ensemble of forecasts quantifies the uncertainty of the weather conditions. While effective, generating these probabilistic forecasts is computationally costly. They require running highly complex numerical weather models on massive supercomputers multiple times. Consequently, many operational weather forecasts can only afford to generate ~10–50 ensemble members for each forecast cycle. This is a problem for users concerned with the likelihood of rare but high-impact weather events, which typically require much larger ensembles to assess beyond a few days. For instance, one would need a 10,000-member ensemble to forecast the likelihood of events with 1% probability of occurrence with a relative error less than 10%. Quantifying the probability of such extreme events could be useful, for example, for emergency management preparation or for energy traders. SEEDS: AI-enabled advances paper, we present the Scalable Ensemble Envelope Diffusion Sampler (SEEDS), a generative AI technology for weather forecast ensemble generation. SEEDS is based on denoising diffusion probabilistic models, a state-of-the-art generative AI method pioneered in part by Google Research. SEEDS can generate a large ensemble conditioned on as few as one or two forecasts from an operational numerical weather prediction system. The generated ensembles not only yield plausible real-weather–like forecasts but also match or exceed physics-based ensembles in skill metrics such as the rank histogram, the root-mean-squared error (RMSE), and the continuous ranked probability score (CRPS). In particular, the generated ensembles assign more accurate likelihoods to the tail of the forecast distribution, such as ±2σ and ±3σ weather events. Most importantly, the computational cost of the model is negligible when compared to the hours of computational time needed by supercomputers to make a forecast. It has a throughput of 256 ensemble members (at 2° resolution) per 3 minutes on Google Cloud TPUv3-32 instances and can easily scale to higher throughput by deploying more accelerators. SEEDS generates an order-of-magnitude more samples to in-fill distributions of weather patterns. Generating plausible weather forecasts Global Ensemble Forecast System, GEFS) for a particular date during the 2022 European heat waves. We also compare the results to the forecasts from a Gaussian model that predicts the univariate mean and standard deviation of each atmospheric field at each location, a common and computationally efficient but less sophisticated data-driven approach. This Gaussian model is meant to characterize the output of pointwise post-processing, which ignores correlations and treats each grid point as an independent random variable. In contrast, a real weather map would have detailed correlational structures. Because SEEDS directly models the joint distribution of the atmospheric state, it realistically captures both the spatial covariance and the correlation between mid-tropospheric geopotential and mean sea level pressure, both of which are closely related and are commonly used by weather forecasters for evaluation and verification of forecasts. Gradients in the mean sea level pressure are what drive winds at the surface, while gradients in mid-tropospheric geopotential create upper-level winds that move large-scale weather patterns. The generated samples from SEEDS shown in the figure below (frames Ca–Ch) display a geopotential trough west of Portugal with spatial structure similar to that found in the operational U.S. forecasts or the reanalysis based on observations. Although the Gaussian model predicts the marginal univariate distributions adequately, it fails to capture cross-field or spatial correlations. This hinders the assessment of the effects that these anomalies may have on hot air intrusions from North Africa, which can exacerbate heat waves over Europe. Stamp maps over Europe on 2022/07/14 at 0:00 UTC. The contours are for the mean sea level pressure (dashed lines mark isobars below 1010 hPa) while the heatmap depicts the geopotential height at the 500 hPa pressure level. (A) The ERA5 reanalysis, a proxy for real observations. (Ba-Bb) 2 members from the 7-day U.S. operational forecasts used as seeds to our model. (Ca-Ch) 8 samples drawn from SEEDS. (Da-Dh) 8 non-seeding members from the 7-day U.S. operational ensemble forecast. (Ea-Ed) 4 samples from a pointwise Gaussian model parameterized by the mean and variance of the entire U.S. operational ensemble. Covering extreme events more accurately SEEDS provides better statistical coverage of the 2022/07/14 European extreme heat event, denoted by the brown star . Each plot shows the values of the total column-integrated water vapor (TCVW) vs. temperature over a grid point near Lisbon, Portugal from 16,384 samples generated by our models, shown as green dots, conditioned on 2 seeds (blue squares) taken from the 7-day U.S. operational ensemble forecasts (denoted by the sparser brown triangles). The valid forecast time is 1:00 local time. The solid contour levels correspond to iso-proportions of the kernel density of SEEDS, with the outermost one encircling 95% of the mass and 11.875% between each level. Conclusion and future outlook Acknowledgements All SEEDS authors, Lizao Li, Rob Carver, Ignacio Lopez-Gomez, Fei Sha and John Anderson, co-authored this blog post, with Carla Bromberg as Program Lead. We also thank Tom Small who designed the animation. Our colleagues at Google Research have provided invaluable advice to the SEEDS work. Among them, we thank Leonardo Zepeda-Núñez, Zhong Yi Wan, Stephan Rasp, Stephan Hoyer, and Tapio Schneider for their inputs and useful discussion. We thank Tyler Russell for additional technical program management, as well as Alex Merose for data coordination and support. We also thank Cenk Gazen, Shreya Agrawal, and Jason Hickey for discussions in the early stage of the SEEDS work.
AI / MLopen article
Google AI Blog28/03/2024, 20:53
AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks
Posted by Urs Köster, Software Engineer, Google Research Time series problems are ubiquitous, from forecasting weather and traffic patterns to understanding economic trends. Bayesian approaches start with an assumption about the data's patterns (prior probability), collecting evidence (e.g., new time series data), and continuously updating that assumption to form a posterior probability distribution. Traditional Bayesian approaches like Gaussian processes (GPs) and Structural Time Series are extensively used for modeling time series data, e.g., the commonly used Mauna Loa CO2 dataset. However, they often rely on domain experts to painstakingly select appropriate model components and may be computationally expensive. Alternatives such as neural networks lack interpretability, making it difficult to understand how they generate forecasts, and don't produce reliable confidence intervals. To that end, we introduce AutoBNN, a new open-source package written in JAX. AutoBNN automates the discovery of interpretable time series forecasting models, provides high-quality uncertainty estimates, and scales effectively for use on large datasets. We describe how AutoBNN combines the interpretability of traditional probabilistic approaches with the scalability and flexibility of neural networks. AutoBNN line of research that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learned kernel structures. The kernel function of a GP encodes assumptions about the function being modeled, such as the presence of trends, periodicity or noise. With learned GP kernels, the kernel function is defined compositionally: it is either a base kernel (such as Linear, Quadratic, Periodic, Matérn or ExponentiatedQuadratic) or a composite that combines two or more kernel functions using operators such as Addition, Multiplication, or ChangePoint. This compositional kernel structure serves two related purposes. First, it is simple enough that a user who is an expert about their data, but not necessarily about GPs, can construct a reasonable prior for their time series. Second, techniques like Sequential Monte Carlo can be used for discrete searches over small structures and can output interpretable results. Bayesian neural networks (BNNs) while retaining the compositional kernel structure. A BNN is a neural network with a probability distribution over weights rather than a fixed set of weights. This induces a distribution over outputs, capturing uncertainty in the predictions. BNNs bring the following advantages over GPs: First, training large GPs is computationally expensive, and traditional training algorithms scale as the cube of the number of data points in the time series. In contrast, for a fixed width, training a BNN will often be approximately linear in the number of data points. Second, BNNs lend themselves better to GPU and TPU hardware acceleration than GP training operations. Third, compositional BNNs can be easily combined with traditional deep BNNs, which have the ability to do feature discovery. One could imagine "hybrid" architectures, in which users specify a top-level structure of Add(Linear, Periodic, Deep), and the deep BNN is left to learn the contributions from potentially high-dimensional covariate information. How might one translate a GP with compositional kernels into a BNN then? A single layer neural network will typically converge to a GP as the number of neurons (or "width") goes to infinity. More recently, researchers have discovered a correspondence in the other direction — many popular GP kernels (such as Matern, ExponentiatedQuadratic, Polynomial or Periodic) can be obtained as infinite-width BNNs with appropriately chosen activation functions and weight distributions. Furthermore, these BNNs remain close to the corresponding GP even when the width is very much less than infinite. For example, the figures below show the difference in the covariance between pairs of observations, and regression results of the true GPs and their corresponding width-10 neural network versions. Comparison of Gram matrices between true GP kernels (top row) and their width 10 neural network approximations (bottom row). Comparison of regression results between true GP kernels (top row) and their width 10 neural network approximations (bottom row). BNN analogues of the Addition and Multiplication operators over GPs, and input warping to produce periodic kernels. BNN addition is straightforwardly given by adding the outputs of the component BNNs. BNN multiplication is achieved by multiplying the activations of the hidden layers of the BNNs and then applying a shared dense layer. We are therefore limited to only multiplying BNNs with the same hidden width. Using AutoBNN package is available within Tensorflow Probability. It is implemented in JAX and uses the flax.linen neural network library. It implements all of the base kernels and operators discussed so far (Linear, Quadratic, Matern, ExponentiatedQuadratic, Periodic, Addition, Multiplication) plus one new kernel and three new operators: a OneLayer kernel, a single hidden layer ReLU BNN, a ChangePoint operator that allows smoothly switching between two kernels, a LearnableChangePoint operator which is the same as ChangePoint except position and slope are given prior distributions and can be learnt from the data, and a WeightedSum operator. WeightedSum combines two or more BNNs with learnable mixing weights, where the learnable weights follow a Dirichlet prior. By default, a flat Dirichlet distribution with concentration 1.0 is used. WeightedSums allow a "soft" version of structure discovery, i.e., training a linear combination of many possible models at once. In contrast to structure discovery with discrete structures, such as in AutoGP, this allows us to use standard gradient methods to learn structures, rather than using expensive discrete optimization. Instead of evaluating potential combinatorial structures in series, WeightedSum allows us to evaluate them in parallel. To easily enable exploration, AutoBNN defines a number of model structures that contain either top-level or internal WeightedSums. The names of these models can be used as the first parameter in any of the estimator constructors, and include things like sum_of_stumps (the WeightedSum over all the base kernels) and sum_of_shallow (which adds all possible combinations of base kernels with all operators). Illustration of the sum_of_stumps model. The bars in the top row show the amount by which each base kernel contributes, and the bottom row shows the function represented by the base kernel. The resulting weighted sum is shown on the right. M3 dataset. The six base structures were ExponentiatedQuadratic (which is the same as the Radial Basis Function kernel, or RBF for short), Matern, Linear, Quadratic, OneLayer and Periodic kernels. The figure shows the MAP estimates of their weights over an ensemble of 32 particles. All of the high likelihood particles gave a large weight to the Periodic component, low weights to Linear, Quadratic and OneLayer, and a large weight to either RBF or Matern. Parallel coordinates plot of the MAP estimates of the base kernel weights over 32 particles. The sum_of_stumps model was trained on the N374 series from the M3 dataset (insert in blue). Darker lines correspond to particles with higher likelihoods. WeightedSums as the inputs to other operators, it is possible to express rich combinatorial structures, while keeping models compact and the number of learnable weights small. As an example, we include the sum_of_products model (illustrated in the figure below) which first creates a pairwise product of two WeightedSums, and then a sum of the two products. By setting some of the weights to zero, we can create many different discrete structures. The total number of possible structures in this model is 216, since there are 16 base kernels that can be turned on or off. All these structures are explored implicitly by training just this one model. Illustration of the "sum_of_products" model. Each of the four WeightedSums have the same structure as the "sum_of_stumps" model. Periodic and either the Matern or ExponentiatedQuadratic) lead to overfitting on many datasets. To prevent this, we have defined model classes like sum_of_safe_shallow that exclude such products when performing structure discovery with WeightedSums. For training, AutoBNN provides AutoBnnMapEstimator and AutoBnnMCMCEstimator to perform MAP and MCMC inference, respectively. Either estimator can be combined with any of the six likelihood functions, including four based on normal distributions with different noise characteristics for continuous data and two based on the negative binomial distribution for count data. Result from running AutoBNN on the Mauna Loa CO2 dataset in our example colab. The model captures the trend and seasonal component in the data. Extrapolating into the future, the mean prediction slightly underestimates the actual trend, while the 95% confidence interval gradually increases. scikit-learn–inspired estimator interface: import autobnn as ab model = ab.operators.Add( bnns=(ab.kernels.PeriodicBNN(width=50), ab.kernels.LinearBNN(width=50), ab.kernels.MaternBNN(width=50))) estimator = ab.estimators.AutoBnnMapEstimator( model, 'normal_likelihood_logistic_noise', jax.random.PRNGKey(42), periods=[12]) estimator.fit(my_training_data_xs, my_training_data_ys) low, mid, high = estimator.predict_quantiles(my_training_data_xs) Conclusion AutoBNN provides a powerful and flexible framework for building sophisticated time series prediction models. By combining the strengths of BNNs and GPs with compositional kernels, AutoBNN opens a world of possibilities for understanding and forecasting complex data. We invite the community to try the colab, and leverage this library to innovate and solve real-world challenges. Acknowledgements AutoBNN was written by Colin Carroll, Thomas Colthurst, Urs Köster and Srinivas Vasudevan. We would like to thank Kevin Murphy, Brian Patton and Feras Saad for their advice and feedback.
AI / MLopen article
Google AI Blog20/03/2024, 20:54
Computer-aided diagnosis for lung cancer screening
Posted by Atilla Kiraly, Software Engineer, and Rory Pilgrim, Product Manager, Google Research Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans. The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability. At Google we have previously developed machine learning (ML) models for lung cancer detection, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential. To that end, in “Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan”, published in Radiology AI, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (Lung-RADSs V1.1 and Sendai Score) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have open-sourced code to process CT images and generate images compatible with the picture archiving and communication system (PACS) used by radiologists. Developing an interface to communicate model results alpha-numeric score to indicate the lung cancer risk and follow-up recommendations. When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions. Our first step was to improve the previously developed ML models through additional training data and architectural improvements, including self-attention. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system. Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view. prior work. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a Google Kubernetes Engine (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in DICOM stores. Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store. Reader studies area under the ROC curve (AUC) values. These were compared with and without assistance. A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (blue) and then with assistance (orange) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering. specificity) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as more people become eligible for screening. Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same. Translating this into real-world impact through partnership DeepHealth, a leading AI-powered health informatics provider; and Apollo Radiology International a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by open sourcing code used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field. Acknowledgements Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.
AI / MLopen article
Google AI Blog20/03/2024, 16:06
Using AI to expand global access to reliable flood forecasts
Posted by Yossi Matias, VP Engineering & Research, and Grey Nearing, Research Scientist, Google Research Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-related disasters has more than doubled since the year 2000 partly due to climate change. Nearly 1.5 billion people, making up 19% of the world’s population, are exposed to substantial risks from severe flood events. Upgrading early warning systems to make accurate and timely information accessible to these populations can save thousands of lives per year. Driven by the potential impact of reliable flood forecasting on people’s lives globally, we started our flood forecasting effort in 2017. Through this multi-year journey, we advanced research over the years hand-in-hand with building a real-time operational flood forecasting system that provides alerts on Google Search, Maps, Android notifications and through the Flood Hub. However, in order to scale globally, especially in places where accurate local data is not available, more research advances were required. In “Global prediction of extreme floods in ungauged watersheds”, published in Nature, we demonstrate how machine learning (ML) technologies can significantly improve global-scale flood forecasting relative to the current state-of-the-art for countries where flood-related data is scarce. With these AI-based technologies we extended the reliability of currently-available global nowcasts, on average, from zero to five days, and improved forecasts across regions in Africa and Asia to be similar to what are currently available in Europe. The evaluation of the models was conducted in collaboration with the European Center for Medium Range Weather Forecasting (ECMWF). These technologies also enable Flood Hub to provide real-time river forecasts up to seven days in advance, covering river reaches across over 80 countries. This information can be used by people, communities, governments and international organizations to take anticipatory action to help protect vulnerable populations. Flood forecasting at Google launched a pilot early warning system in the Ganges-Brahmaputra river basin in India, with the hypothesis that ML could help address the challenging problem of reliable flood forecasting at scale. The pilot was further expanded the following year via the combination of an inundation model, real-time water level measurements, the creation of an elevation map and hydrologic modeling. In collaboration with academics, and, in particular, with the JKU Institute for Machine Learning we explored ML-based hydrologic models, showing that LSTM-based models could produce more accurate simulations than traditional conceptual and physics-based hydrology models. This research led to flood forecasting improvements that enabled the expansion of our forecasting coverage to include all of India and Bangladesh. We also worked with researchers at Yale University to test technological interventions that increase the reach and impact of flood warnings. Our hydrological models predict river floods by processing publicly available weather data like precipitation and physical watershed information. Such models must be calibrated to long data records from streamflow gauging stations in individual rivers. A low percentage of global river watersheds (basins) have streamflow gauges, which are expensive but necessary to supply relevant data, and it’s challenging for hydrological simulation and forecasting to provide predictions in basins that lack this infrastructure. Lower gross domestic product (GDP) is correlated with increased vulnerability to flood risks, and there is an inverse correlation between national GDP and the amount of publicly available data in a country. ML helps to address this problem by allowing a single model to be trained on all available river data and to be applied to ungauged basins where no data are available. In this way, models can be trained globally, and can make predictions for any river location. There is an inverse (log-log) correlation between the amount of publicly available streamflow data in a country and national GDP. Streamflow data from the Global Runoff Data Center. estimate uncertainty in river forecasts and showed how ML river forecast models synthesize information from multiple data sources. They demonstrated that these models can simulate extreme events reliably, even when those events are not part of the training data. In an effort to contribute to open science, in 2023 we open-sourced a community-driven dataset for large-sample hydrology in Nature Scientific Data. The river forecast model LSTMs perform well on the task of river forecasting. A diagram of the LSTM, which is a neural network that operates sequentially in time. An accessible primer can be found here. mixture density networks to produce a probabilistic forecast (i.e., predicted parameters of a probability distribution over streamflow). Specifically, the model predicts the parameters of a mixture of heavy-tailed probability density functions, called asymmetric Laplacian distributions, at each forecast time step. The result is a mixture density function, called a Countable Mixture of Asymmetric Laplacians (CMAL) distribution, which represents a probabilistic prediction of the volumetric flow rate in a particular river at a particular time. LSTM-based river forecast model architecture. Two LSTMs are applied in sequence, one ingesting historical weather data and one ingesting forecasted weather data. The model outputs are the parameters of a probability distribution over streamflow at each forecasted timestep. Input and training data Static watershed attributes representing geographical and geophysical variables: From the HydroATLAS project, including data like long-term climate indexes (precipitation, temperature, snow fractions), land cover, and anthropogenic attributes (e.g., a nighttime lights index as a proxy for human development). Historical meteorological time-series data: Used to spin up the model for one year prior to the issue time of a forecast. The data comes from NASA IMERG, NOAA CPC Global Unified Gauge-Based Analysis of Daily Precipitation, and the ECMWF ERA5-land reanalysis. Variables include daily total precipitation, air temperature, solar and thermal radiation, snowfall, and surface pressure. Forecasted meteorological time series over a seven-day forecast horizon: Used as input for the forecast LSTM. These data are the same meteorological variables listed above, and come from the ECMWF HRES atmospheric model. Training data are daily streamflow values from the Global Runoff Data Center over the time period 1980 - 2023. A single streamflow forecast model is trained using data from 5,680 diverse watershed streamflow gauges (shown below) to improve accuracy. Location of 5,680 streamflow gauges that supply training data for the river forecast model from the Global Runoff Data Center. Improving on the current state-of-the-art GloFAS version 4, the current state-of-the-art global flood forecasting system. These experiments showed that ML can provide accurate warnings earlier and over larger and more impactful events. The figure below shows the distribution of F1 scores when predicting different severity events at river locations around the world, with plus or minus 1 day accuracy. F1 scores are an average of precision and recall and event severity is measured by return period. For example, a 2-year return period event is a volume of streamflow that is expected to be exceeded on average once every two years. Our model achieves reliability scores at up to 4-day or 5-day lead times that are similar to or better, on average, than the reliability of GloFAS nowcasts (0-day lead time). Distributions of F1 scores over 2-year return period events in 2,092 watersheds globally during the time period 2014-2023 from GloFAS (blue) and our model (orange) at different lead times. On average, our model is statistically as accurate as GloFAS nowcasts (0–day lead time) up to 5 days in advance over 2-year (shown) and 1-year, 5-year, and 10-year events (not shown). paper for more information. Looking into the future Adaptation and Resilience efforts and reflects Google's commitment to address climate change while helping global communities become more resilient. We believe that AI and ML will continue to play a critical role in helping advance science and research towards climate action. We actively collaborate with several international aid organizations (e.g., the Centre for Humanitarian Data and the Red Cross) to provide actionable flood forecasts. Additionally, in an ongoing collaboration with the World Meteorological Organization (WMO) to support early warning systems for climate hazards, we are conducting a study to help understand how AI can help address real-world challenges faced by national flood forecasting agencies. While the work presented here demonstrates a significant step forward in flood forecasting, future work is needed to further expand flood forecasting coverage to more locations globally and other types of flood-related events and disasters, including flash floods and urban floods. We are looking forward to continuing collaborations with our partners in the academic and expert communities, local governments and the industry to reach these goals.
AI / MLopen article
Google AI Blog19/03/2024, 20:15
ScreenAI: A visual language model for UI and visually-situated language understanding
Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research Screen user interfaces (UIs) and infographics, such as charts, diagrams and tables, play important roles in human communication and human-machine interaction as they facilitate rich and interactive user experiences. UIs and infographics share similar design principles and visual language (e.g., icons and layouts), that offer an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, because of their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge. To that end, we introduce “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”. ScreenAI improves upon the PaLI architecture with the flexible patching strategy from pix2struct. We train ScreenAI on a unique mixture of datasets and tasks, including a novel Screen Annotation task that requires the model to identify UI element information (i.e., type, location and description) on a screen. These text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. At only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks (WebSRC and MoTIF), and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability. ScreenAI PaLI, composed of a multimodal encoder block and an autoregressive decoder. The PaLI encoder uses a vision transformer (ViT) that creates image embeddings and a multimodal encoder that takes the concatenation of the image and text embeddings as input. This flexible architecture allows ScreenAI to solve vision tasks that can be recast as text+image-to-text problems. On top of the PaLI architecture, we employ a flexible patching strategy introduced in pix2struct. Instead of using a fixed-grid pattern, the grid dimensions are selected such that they preserve the native aspect ratio of the input image. This enables ScreenAI to work well across images of various aspect ratios. The ScreenAI model is trained in two stages: a pre-training stage followed by a fine-tuning stage. First, self-supervised learning is applied to automatically generate data labels, which are then used to train ViT and the language model. ViT is frozen during the fine-tuning stage, where most data used is manually labeled by human raters. ScreenAI model architecture. Data generation publicly accessible web pages and following the programmatic exploration approach used for the RICO dataset for mobile apps. We then apply a layout annotator, based on the DETR model, that identifies and labels a wide range of UI elements (e.g., image, pictogram, button, text) and their spatial relationships. Pictograms undergo further analysis using an icon classifier capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle information conveyed through icons. For icons that are not covered by the classifier, and for infographics and images, we use the PaLI image captioning model to generate descriptive captions that provide contextual information. We also apply an optical character recognition (OCR) engine to extract and annotate textual content on screen. We combine the OCR text with the previous annotations to create a detailed description of each screen. A mobile app screenshot with generated annotations that include UI elements and their descriptions, e.g., TEXT elements also contain the text content from OCR, IMAGE elements contain image captions, LIST_ITEMs contain all their child elements. LLM-based data generation PaLM 2 to generate input-output pairs in a two-step process. First, screen annotations are generated using the technique outlined above, then we craft a prompt around this schema for the LLM to create synthetic data. This process requires prompt engineering and iterative refinement to find an effective prompt. We assess the generated data's quality through human validation against a quality threshold. You only speak JSON. Do not write text that isn’t JSON. You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows: questions: [ {{question: the question, answer: the answer }}, ... ] {THE SCREEN SCHEMA} A sample prompt for QA data generation. Question answering: The model is asked to answer questions regarding the content of the screenshots, e.g., “When does the restaurant open?” Screen navigation: The model is asked to convert a natural language utterance into an executable action on a screen, e.g., “Click the search button.” Screen summarization: The model is asked to summarize the screen content in one or two sentences. Block diagram of our workflow for generating data for QA, summarization and navigation tasks using existing ScreenAI models and LLMs. Each task uses a custom prompt to emphasize desired aspects, like questions related to counting, involving reasoning, etc. LLM-generated data. Examples for screen QA, navigation and summarization. For navigation, the action bounding box is displayed in red on the screenshot. Experiments and results ChartQA, DocVQA, Multi page DocVQA, InfographicVQA, OCR VQA, Web SRC and ScreenQA. For navigation, datasets used include Referring Expressions, MoTIF, Mug, and Android in the Wild. Finally, we use Screen2Words for screen summarization and Widget Captioning for describing specific UI elements. Along with the fine-tuning datasets, we evaluate the fine-tuned ScreenAI model using three novel benchmarks: Screen Annotation: Enables the evaluation model layout annotations and spatial understanding capabilities. ScreenQA Short: A variation of ScreenQA, where its ground truth answers have been shortened to contain only the relevant information that better aligns with other QA tasks. Complex ScreenQA: Complements ScreenQA Short with more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. The fine-tuned ScreenAI model achieves state-of-the-art results on various UI and infographic-based tasks (WebSRC and MoTIF) and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. ScreenAI achieves competitive performance on Screen2Words and OCR-VQA. Additionally, we report results on the new benchmark datasets introduced to serve as a baseline for further research. Comparing model performance of ScreenAI with state-of-the-art (SOTA) models of similar size. Model performance increases with size, and the performance has not saturated even at the largest size of 5B params. Conclusion Acknowledgements This project is the result of joint work with Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma. We thank Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their insightful feedback and discussions, along with Rahul Aralikatte, Hao Cheng and Daniel Kim for their support in data preparation. We also thank Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their leadership, vision and support. We are very grateful toTom Small for helping us create the animation in this post.
AI / MLopen article
Google AI Blog19/03/2024, 15:00
SCIN: A new resource for representative dermatology images
Posted by Pooja Rao, Research Scientist, Google Research Health datasets play a crucial role in research and medical education, but it can be challenging to create a dataset that represents the real world. For example, dermatology conditions are diverse in their appearance and severity and manifest differently across skin tones. Yet, existing dermatology image datasets often lack representation of everyday conditions (like rashes, allergies and infections) and skew towards lighter skin tones. Furthermore, race and ethnicity information is frequently missing, hindering our ability to assess disparities or create solutions. To address these limitations, we are releasing the Skin Condition Image Network (SCIN) dataset in collaboration with physicians at Stanford Medicine. We designed SCIN to reflect the broad range of concerns that people search for online, supplementing the types of conditions typically found in clinical datasets. It contains images across various skin tones and body parts, helping to ensure that future AI tools work effectively for all. We've made the SCIN dataset freely available as an open-access resource for researchers, educators, and developers, and have taken careful steps to protect contributor privacy. Example set of images and metadata from the SCIN dataset. Dataset composition tanning propensity (self-reported Fitzpatrick Skin Type, i.e., sFST), and to describe the texture, duration and symptoms related to their concern. One to three dermatologists labeled each contribution with up to five dermatology conditions, along with a confidence score for each label. The SCIN dataset contains these individual labels, as well as an aggregated and weighted differential diagnosis derived from them that could be useful for model testing or training. These labels were assigned retrospectively and are not equivalent to a clinical diagnosis, but they allow us to compare the distribution of dermatology conditions in the SCIN dataset with existing datasets. The SCIN dataset contains largely allergic, inflammatory and infectious conditions while datasets from clinical sources focus on benign and malignant neoplasms. Monk Skin Tone (eMST) for the images. This allowed comparison of the skin condition and skin type distributions to those in existing dermatology datasets. Although we did not selectively target any skin types or skin tones, the SCIN dataset has a balanced Fitzpatrick skin type distribution (with more of Types 3, 4, 5, and 6) compared to similar datasets from clinical sources. Self-reported and dermatologist-estimated Fitzpatrick Skin Type distribution in the SCIN dataset compared with existing un-enriched dermatology datasets (Fitzpatrick17k, PH², SKINL2, and PAD-UFES-20). Fitzpatrick Skin Type scale was originally developed as a photo-typing scale to measure the response of skin types to UV radiation, and it is widely used in dermatology research. The Monk Skin Tone scale is a newer 10-shade scale that measures skin tone rather than skin phototype, capturing more nuanced differences between the darker skin tones. While neither scale was intended for retrospective estimation using images, the inclusion of these labels is intended to enable future research into skin type and tone representation in dermatology. For example, the SCIN dataset provides an initial benchmark for the distribution of these skin types and tones in the US population. The SCIN dataset has a high representation of women and younger individuals, likely reflecting a combination of factors. These could include differences in skin condition incidence, propensity to seek health information online, and variations in willingness to contribute to research across demographics. Crowdsourcing method research paper co-authored with investigators at Stanford Medicine. This approach empowers individuals to play an active role in healthcare research. It allows us to reach people at earlier stages of their health concerns, potentially before they seek formal care. Crucially, this method uses advertisements on web search result pages — the starting point for many people’s health journey — to connect with participants. Our results demonstrate that crowdsourcing can yield a high-quality dataset with a low spam rate. Over 97.5% of contributions were genuine images of skin conditions. After performing further filtering steps to exclude images that were out of scope for the SCIN dataset and to remove duplicates, we were able to release nearly 90% of the contributions received over the 8-month study period. Most images were sharp and well-exposed. Approximately half of the contributions include self-reported demographics, and 80% contain self-reported information relating to the skin condition, such as texture, duration, or other symptoms. We found that dermatologists’ ability to retrospectively assign a differential diagnosis depended more on the availability of self-reported information than on image quality. Dermatologist confidence in their labels (scale from 1-5) depended on the availability of self-reported demographic and symptom information. Data Use License prohibits attempts to re-identify contributors. We hope the SCIN dataset will be a helpful resource for those working to advance inclusive dermatology research, education, and AI tool development. By demonstrating an alternative to traditional dataset creation methods, SCIN paves the way for more representative datasets in areas where self-reported data or retrospective labeling is feasible. Acknowledgements We are grateful to all our co-authors Abbi Ward, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, Pradeep Kumar S, Tiya Tiyasirisokchai, Sunny Virmani, Renee Wong, Yossi Matias, Greg S. Corrado, Dale R. Webster, Dawn Siegel (Stanford Medicine), Steven Lin (Stanford Medicine), Justin Ko (Stanford Medicine), Alan Karthikesalingam and Christopher Semturs. We also thank Yetunde Ibitoye, Sami Lachgar, Lisa Lehmann, Javier Perez, Margaret Ann Smith (Stanford Medicine), Rachelle Sico, Amit Talreja, Annisah Um’rani and Wayne Westerlind for their essential contributions to this work. Finally, we are grateful to Heather Cole-Lewis, Naama Hammel, Ivor Horn, Michael Howell, Yun Liu, and Eric Teasley for their insightful comments on the study design and manuscript.
AI / MLopen article
Google AI Blog18/03/2024, 18:41
MELON: Reconstructing 3D objects from images with unknown poses
Posted by Mark Matthews, Senior Software Engineer, and Dmitry Lagun, Research Scientist, Google Research A person's prior experience and understanding of the world generally enables them to easily infer what an object looks like in whole, even if only looking at a few 2D pictures of it. Yet the capacity for a computer to reconstruct the shape of an object in 3D given only a few images has remained a difficult algorithmic problem for years. This fundamental computer vision task has applications ranging from the creation of e-commerce 3D models to autonomous vehicle navigation. A key part of the problem is how to determine the exact positions from which images were taken, known as pose inference. If camera poses are known, a range of successful techniques — such as neural radiance fields (NeRF) or 3D Gaussian Splatting — can reconstruct an object in 3D. But if these poses are not available, then we face a difficult “chicken and egg” problem where we could determine the poses if we knew the 3D object, but we can’t reconstruct the 3D object until we know the camera poses. The problem is made harder by pseudo-symmetries — i.e., many objects look similar when viewed from different angles. For example, square objects like a chair tend to look similar every 90° rotation. Pseudo-symmetries of an object can be revealed by rendering it on a turntable from various angles and plotting its photometric self-similarity map. Self-Similarity map of a toy truck model. Left: The model is rendered on a turntable from various azimuthal angles, θ. Right: The average L2 RGB similarity of a rendering from θ with that of θ*. The pseudo-similarities are indicated by the dashed red lines. ill-posed, with naïve approaches often converging to local minima. In practice, such an approach might mistake the back view as the front view of an object, because they share a similar silhouette. Previous techniques (such as BARF or SAMURAI) side-step this problem by relying on an initial pose estimate that starts close to the global minima. But how can we approach this if those aren’t available? Methods, such as GNeRF and VMRF leverage generative adversarial networks (GANs) to overcome the problem. These techniques have the ability to artificially “amplify” a limited number of training views, aiding reconstruction. GAN techniques, however, often have complex, sometimes unstable, training processes, making robust and reliable convergence difficult to achieve in practice. A range of other successful methods, such as SparsePose or RUST, can infer poses from a limited number views, but require pre-training on a large dataset of posed images, which aren’t always available, and can suffer from “domain-gap” issues when inferring poses for different types of images. In “MELON: NeRF with Unposed Images in SO(3)”, spotlighted at 3DV 2024, we present a technique that can determine object-centric camera poses entirely from scratch while reconstructing the object in 3D. MELON (Modulo Equivalent Latent Optimization of NeRF) is one of the first techniques that can do this without initial pose camera estimates, complex training schemes or pre-training on labeled data. MELON is a relatively simple technique that can easily be integrated into existing NeRF methods. We demonstrate that MELON can reconstruct a NeRF from unposed images with state-of-the-art accuracy while requiring as few as 4–6 images of an object. MELON convolutional neural network (CNN) encoder that regresses camera poses from training images. We pass a downscaled training image to a four layer CNN that infers the camera pose. This CNN is initialized from noise and requires no pre-training. Its capacity is so small that it forces similar looking images to similar poses, providing an implicit regularization greatly aiding convergence. The second technique is a modulo loss that simultaneously considers pseudo symmetries of an object. We render the object from a fixed set of viewpoints for each training image, backpropagating the loss only through the view that best fits the training image. This effectively considers the plausibility of multiple views for each image. In practice, we find N=2 views (viewing an object from the other side) is all that’s required in most cases, but sometimes get better results with N=4 for square objects. These two techniques are integrated into standard NeRF training, except that instead of fixed camera poses, poses are inferred by the CNN and duplicated by the modulo loss. Photometric gradients back-propagate through the best-fitting cameras into the CNN. We observe that cameras generally converge quickly to globally optimal poses (see animation below). After training of the neural field, MELON can synthesize novel views using standard NeRF rendering methods. We simplify the problem by using the NeRF-Synthetic dataset, a popular benchmark for NeRF research and common in the pose-inference literature. This synthetic dataset has cameras at precisely fixed distances and a consistent “up” orientation, requiring us to infer only the polar coordinates of the camera. This is the same as an object at the center of a globe with a camera always pointing at it, moving along the surface. We then only need the latitude and longitude (2 degrees of freedom) to specify the camera pose. MELON uses a dynamically trained lightweight CNN encoder that predicts a pose for each image. Predicted poses are replicated by the modulo loss, which only penalizes the smallest L2 distance from the ground truth color. At evaluation time, the neural field can be used to generate novel views. Results peak signal-to-noise ratio (PSNR) against held out test views. We see that MELON quickly converges to the approximate poses of most cameras within the first 1,000 steps of training, and achieves a competitive PSNR of 27.5 dB after 50k steps. Convergence of MELON on a toy truck model during optimization. Left: Rendering of the NeRF. Right: Polar plot of predicted (blue x), and ground truth (red dot) cameras. Reconstruction quality comparison between ground-truth (GT) and MELON on NeRF-Synthetic scenes after 100k training steps. Noisy images novel view synthesis from extremely noisy, unposed images. We add varying amounts, σ, of white Gaussian noise to the training images. For example, the object in σ=1.0 below is impossible to make out, yet MELON can determine the pose and generate novel views of the object. Novel view synthesis from noisy unposed 128×128 images. Top: Example of noise level present in training views. Bottom: Reconstructed model from noisy training views and mean angular pose error. RawNeRF have demonstrated NeRF’s excellent de-noising capabilities with known camera poses. The fact that MELON works for noisy images of unknown camera poses so robustly was unexpected. Conclusion paper and MELON site to learn more. Acknowledgements We would like to thank our paper co-authors Axel Levy, Matan Sela, and Gordon Wetzstein, as well as Florian Schroff and Hartwig Adam for continuous help in building this technology. We also thank Matthew Brown, Ricardo Martin-Brualla and Frederic Poitevin for their helpful feedback on the paper draft. We also acknowledge the use of the computational resources at the SLAC Shared Scientific Data Facility (SDF).
AI / MLopen article
Google AI Blog15/03/2024, 18:22
HEAL: A framework for health equity assessment of machine learning performance
Posted by Mike Schaekermann, Research Scientist, Google Research, and Ivor Horn, Chief Health Equity Officer & Director, Google Core Health equity is a major societal concern worldwide with disparities having many causes. These sources include limitations in access to healthcare, differences in clinical treatment, and even fundamental differences in the diagnostic technology. In dermatology for example, skin cancer outcomes are worse for populations such as minorities, those with lower socioeconomic status, or individuals with limited healthcare access. While there is great promise in recent advances in machine learning (ML) and artificial intelligence (AI) to help improve healthcare, this transition from research to bedside must be accompanied by a careful understanding of whether and how they impact health equity. Health equity is defined by public health organizations as fairness of opportunity for everyone to be as healthy as possible. Importantly, equity may be different from equality. For example, people with greater barriers to improving their health may require more or different effort to experience this fair opportunity. Similarly, equity is not fairness as defined in the AI for healthcare literature. Whereas AI fairness often strives for equal performance of the AI technology across different patient populations, this does not center the goal of prioritizing performance with respect to pre-existing health disparities. Health equity considerations. An intervention (e.g., an ML-based tool, indicated in dark blue) promotes health equity if it helps reduce existing disparities in health outcomes (indicated in lighter blue). Health Equity Assessment of machine Learning performance (HEAL): a framework and dermatology AI model case study”, published in The Lancet eClinicalMedicine, we propose a methodology to quantitatively assess whether ML-based health technologies perform equitably. In other words, does the ML model perform well for those with the worst health outcomes for the condition(s) the model is meant to address? This goal anchors on the principle that health equity should prioritize and measure model performance with respect to disparate health outcomes, which may be due to a number of factors that include structural inequities (e.g., demographic, social, cultural, political, economic, environmental and geographic). The health equity framework (HEAL) Framework for Health Equity Assessment of machine Learning performance (HEAL). Our guiding principle is to avoid exacerbating health inequities, and these steps help us identify disparities and assess for inequitable model performance to move towards better outcomes for all. Case study on a dermatology model prior work. This example dermatology model was trained to classify 288 skin conditions using a development dataset of 29k cases. The input to the model consists of three photos of a skin concern along with demographic information and a brief structured medical history. The output consists of a ranked list of possible matching skin conditions. Using the HEAL framework, we evaluated this model by assessing whether it prioritized performance with respect to pre-existing health outcomes. The model was designed to predict possible dermatologic conditions (from a list of hundreds) based on photos of a skin concern and patient metadata. Evaluation of the model is done using a top-3 agreement metric, which quantifies how often the top 3 output conditions match the most likely condition as suggested by a dermatologist panel. The HEAL metric is computed via the anticorrelation of this top-3 agreement with health outcome rankings. We used a dataset of 5,420 teledermatology cases, enriched for diversity in age, sex and race/ethnicity, to retrospectively evaluate the model’s HEAL metric. The dataset consisted of “store-and-forward” cases from patients of 20 years or older from primary care providers in the USA and skin cancer clinics in Australia. Based on a review of the literature, we decided to explore race/ethnicity, sex and age as potential factors of inequity, and used sampling techniques to ensure that our evaluation dataset had sufficient representation of all race/ethnicity, sex and age groups. To quantify pre-existing health outcomes for each subgroup we relied on measurements from public databases endorsed by the World Health Organization, such as Years of Life Lost (YLLs) and Disability-Adjusted Life Years (DALYs; years of life lost plus years lived with disability). HEAL metric for all dermatologic conditions across race/ethnicity subpopulations, including health outcomes (YLLs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* Higher is better; measures the likelihood the model performs equitably with respect to the axes in this table.) HEAL metric for all dermatologic conditions across sexes, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.) HEAL metrics for all cancer and non-cancer dermatologic conditions across age groups, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.) Putting things in context Pareto condition (discussed further in the paper), which restricts model changes so that outcomes for each subpopulation are either unchanged or improved compared to the status quo, and performance does not worsen for any subpopulation. The HEAL framework, in its current form, assesses the likelihood that an ML-based model prioritizes performance for subpopulations with respect to pre-existing health disparities for specific subpopulations. This differs from the goal of understanding whether ML will reduce disparities in outcomes across subpopulations in reality. Specifically, modeling improvements in outcomes requires a causal understanding of steps in the care journey that happen both before and after use of any given model. Future research is needed to address this gap. Conclusion Acknowledgements The research described here is joint work across many teams at Google. We are grateful to all our co-authors: Terry Spitz, Malcolm Pyles, Heather Cole-Lewis, Ellery Wulczyn, Stephen R. Pfohl, Donald Martin, Jr., Ronnachai Jaroensri, Geoff Keeling, Yuan Liu, Stephanie Farquhar, Qinghan Xue, Jenna Lester, Cían Hughes, Patricia Strachan, Fraser Tan, Peggy Bui, Craig H. Mermel, Lily H. Peng, Yossi Matias, Greg S. Corrado, Dale R. Webster, Sunny Virmani, Christopher Semturs, Yun Liu, and Po-Hsuan Cameron Chen. We also thank Lauren Winer, Sami Lachgar, Ting-An Lin, Aaron Loh, Morgan Du, Jenny Rizk, Renee Wong, Ashley Carrick, Preeti Singh, Annisah Um'rani, Jessica Schrouff, Alexander Brown, and Anna Iurchenko for their support of this project.
AI / MLopen article
Google AI Blog14/03/2024, 19:38
Cappy: Outperforming and boosting large multi-task language models with a small scorer
Posted by Yun Zhu and Lijuan Liu, Software Engineers, Google Research Large language model (LLM) advancements have led to a new paradigm that unifies various natural language processing (NLP) tasks within an instruction-following framework. This paradigm is exemplified by recent multi-task LLMs, such as T0, FLAN, and OPT-IML. First, multi-task data is gathered with each task following a task-specific template, where each labeled example is converted into an instruction (e.g., "Put the concepts together to form a sentence: ski, mountain, skier”) paired with a corresponding response (e.g., "Skier skis down the mountain"). These instruction-response pairs are used to train the LLM, resulting in a conditional generation model that takes an instruction as input and generates a response. Moreover, multi-task LLMs have exhibited remarkable task-wise generalization capabilities as they can address unseen tasks by understanding and solving brand-new instructions. The demonstration of the instruction-following pre-training of multi-task LLMs, e.g., FLAN. Pre-training tasks under this paradigm improves the performance for unseen tasks. FLAN-11B, T0-11B and OPT-IML-175B). As a result, operating such sizable models poses significant challenges because they demand considerable computational power and impose substantial requirements on the memory capacities of GPUs and TPUs, making their training and inference expensive and inefficient. Extensive storage is required to maintain a unique LLM copy for each downstream task. Moreover, the most powerful multi-task LLMs (e.g., FLAN-PaLM-540B) are closed-sourced, making them impossible to be adapted. However, in practical applications, harnessing a single multi-task LLM to manage all conceivable tasks in a zero-shot manner remains difficult, particularly when dealing with complex tasks, personalized tasks and those that cannot be succinctly defined using instructions. On the other hand, the size of downstream training data is usually insufficient to train a model well without incorporating rich prior knowledge. Hence, it is long desired to adapt LLMs with downstream supervision while bypassing storage, memory, and access issues. Certain parameter-efficient tuning strategies, including prompt tuning and adapters, substantially diminish storage requirements, but they still perform back-propagation through LLM parameters during the tuning process, thereby keeping their memory demands high. Additionally, some in-context learning techniques circumvent parameter tuning by integrating a limited number of supervised examples into the instruction. However, these techniques are constrained by the model's maximum input length, which permits only a few samples to guide task resolution. In “Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer”, presented at NeurIPS 2023, we propose a novel approach that enhances the performance and efficiency of multi-task LLMs. We introduce a lightweight pre-trained scorer, Cappy, based on continual pre-training on top of RoBERTa with merely 360 million parameters. Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction. Cappy functions either independently on classification tasks or serves as an auxiliary component for LLMs, boosting their performance. Moreover, Cappy efficiently enables downstream supervision without requiring any finetuning, which avoids the need for back-propagation through LLM parameters and reduces memory requirements. Finally, adaptation with Cappy doesn’t require access to LLM parameters as it is compatible with closed-source multi-task LLMs, such as those only accessible via WebAPIs. Cappy takes an instruction and response pair as input and outputs a score ranging from 0 to 1, indicating an estimation of the correctness of the response with respect to the instruction. Pre-training PromptSource that were used to train T0. This collection encompasses a wide range of task types, such as question answering, sentiment analysis, and summarization. Each dataset is associated with one or more templates that convert each instance from the original datasets into an instruction paired with its ground truth response. Cappy's regression modeling requires each pre-training data instance to include an instruction-response pair along with a correctness annotation for the response, so we produce a dataset with correctness annotations that range from 0 to 1. For every instance within a generation task, we leverage an existing multi-task LLM to generate multiple responses by sampling, conditioned on the given instruction. Subsequently, we assign an annotation to the pair formed by the instruction and every response, using the similarity between the response and the ground truth response of the instance. Specifically, we employ Rouge-L, a commonly-used metric for measuring overall multi-task performance that has demonstrated a strong alignment with human evaluation, to calculate this similarity as a form of weak supervision. As a result, we obtain an effective regression dataset of 160 million instances paired with correctness score annotations. The final Cappy model is the result of continuous pre-training using the regression dataset on top of the RoBERTa model. The pre-training of Cappy is conducted on Google's TPU-v4, with RedCoast, a lightweight toolkit for automating distributed training. Data augmentation with a multi-task LLM to construct a weakly supervised regression dataset for Cappy’s pre-training and fine-tuning. Applying Cappy Adapting multi-task LLMs with Cappy Downstream adaptation comparison between Cappy and approaches that rely on an LLM’s parameters, such as fine-tuning and prompt tuning. Cappy’s application enhances multi-task LLMs. Results PromptSource. We demonstrate that Cappy, with 360M parameters, outperforms OPT-175B and OPT-IML-30B, and matches the accuracy of the best existing multi-task LLMs (T0-11B and OPT-IML-175B). These findings highlight Cappy’s capabilities and parameter efficiency, which can be credited to its scoring-based pre-training strategy that integrates contrastive information by differentiating between high-quality and low-quality responses. On the contrary, previous multi-task LLMs depend exclusively on teacher-forcing training that utilizes only the ground truth responses. The overall accuracy averaged over eleven test tasks from PromptSource. “RM” refers to a pre-trained RLHF reward model. Cappy matches the best ones among existing multi-task LLMs. BIG-Bench, a set of manually curated tasks that are considered beyond the capability of many LLMs. We focus on all the 45 generation BIG-Bench tasks, specifically those that do not offer pre-established answer choices. We evaluate the performance using the Rouge-L score (representing the overall similarity between model generations and corresponding ground truths) on every test set, reporting the average score across 45 tests. In this experiment, all variants of FLAN-T5 serve as the backbone LLMs, and the foundational FLAN-T5 models are frozen. These results, shown below, suggest that Cappy enhances the performance of FLAN-T5 models by a large margin, consistently outperforming the most effective baseline achieved through sample selection using self-scoring of the LLM itself. The averaged Rouge-L score over 45 complex tasks within BIG-Bench. The x-axis refers to FLAN-T5 models of different sizes. Every dashed line represents an approach working on FLAN-T5s. Self-scoring refers to using the cross-entropy of LLM to select responses. Cappy enhances the performance of FLAN-T5 models by a large margin. Conclusion Acknowledgments Thanks to Bowen Tan, Jindong Chen, Lei Meng, Abhanshu Sharma and Ewa Dominowska for their valuable feedback. We would also like to thank Eric Xing and Zhiting Hu for their suggestions.
AI / MLopen article
Google AI Blog12/03/2024, 21:15
Talk like a graph: Encoding graphs for large language models
Posted by Bahare Fatemi and Bryan Perozzi, Research Scientists, Google Research Imagine all the things around you — your friends, tools in your kitchen, or even the parts of your bike. They are all connected in different ways. In computer science, the term graph is used to describe connections between objects. Graphs consist of nodes (the objects themselves) and edges (connections between two nodes, indicating a relationship between them). Graphs are everywhere now. The internet itself is a giant graph of websites linked together. Even the knowledge search engines use is organized in a graph-like way. Furthermore, consider the remarkable advancements in artificial intelligence — such as chatbots that can write stories in seconds, and even software that can interpret medical reports. This exciting progress is largely thanks to large language models (LLMs). New LLM technology is constantly being developed for different uses. Since graphs are everywhere and LLM technology is on the rise, in “Talk like a Graph: Encoding Graphs for Large Language Models”, presented at ICLR 2024, we present a way to teach powerful LLMs how to better reason with graph information. Graphs are a useful way to organize information, but LLMs are mostly trained on regular text. The objective is to test different techniques to see what works best and gain practical insights. Translating graphs into text that LLMs can understand is a remarkably complex task. The difficulty stems from the inherent complexity of graph structures with multiple nodes and the intricate web of edges that connect them. Our work studies how to take a graph and translate it into a format that an LLM can understand. We also design a benchmark called GraphQA to study different approaches on different graph reasoning problems and show how to phrase a graph-related problem in a way that enables the LLM to solve the graph problem. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: 1) the graph encoding method, 2) the nature of the graph task itself, and 3) interestingly, the very structure of the graph considered. These findings give us clues on how to best represent graphs for LLMs. Picking the right method can make the LLM up to 60% better at graph tasks! Pictured, the process of encoding a graph as text using two different approaches and feeding the text and a question about the graph to the LLM. Graphs as text GraphQA. Think of GraphQA as an exam designed to evaluate powerful LLMs on graph-specific problems. We want to see how well LLMs can understand and solve problems that involve graphs in different setups. To create a comprehensive and realistic exam for LLMs, we don’t just use one type of graph, we use a mix of graphs ensuring breadth in the number of connections. This is mainly because different graph types make solving such problems easier or harder. This way, GraphQA can help expose biases in how an LLM thinks about the graphs, and the whole exam gets closer to a realistic setup that LLMs might encounter in the real world. Overview of our framework for reasoning with graphs using LLMs. Erdős-Rényi, scale-free networks, Barabasi-Albert model, and stochastic block model, as well as simpler graph structures like paths, complete graphs, and star graphs, providing a diverse set of data for training. When working with graphs, we also need to find ways to ask graph-related questions that LLMs can understand. Prompting heuristics are different strategies for doing this. Let's break down the common ones: Zero-shot: simply describe the task ("Is there a cycle in this graph?") and tell the LLM to go for it. No examples provided. Few-shot: This is like giving the LLM a mini practice test before the real deal. We provide a few example graph questions and their correct answers. Chain-of-Thought: Here, we show the LLM how to break down a problem step-by-step with examples. The goal is to teach it to generate its own "thought process" when faced with new graphs. Zero-CoT: Similar to CoT, but instead of training examples, we give the LLM a simple prompt, like "Let's think step-by-step," to trigger its own problem-solving breakdown. BAG (build a graph): This is specifically for graph tasks. We add the phrase "Let's build a graph..." to the description, helping the LLM focus on the graph structure. We explored different ways to translate graphs into text that LLMs can work with. Our key questions were: Node encoding: How do we represent individual nodes? Options tested include simple integers, common names (people, characters), and letters. Edge encoding: How do we describe the relationships between nodes? Methods involved parenthesis notation, phrases like "are friends", and symbolic representations like arrows. Various node and edge encodings were combined systematically. This led to functions like the ones in the following figure: Examples of graph encoding functions used to encode graphs via text. Analysis and results How LLMs handle graph tasks LLMs struggle: On most of these basic tasks, LLMs did not do much better than a random guess. Encoding matters significantly: How we represent the graph as text has a great effect on LLM performance. The "incident" encoding excelled for most of the tasks in general. Our results are summarized in the following chart. Comparison of various graph encoder functions based on their accuracy on different graph tasks. The main conclusion from this figure is that the graph encoding functions matter significantly. Bigger is (usually) better PaLM 2. Here is a summary of our findings: In general, bigger models did better on graph reasoning tasks. It seems like the extra parameters gave them space to learn more complex patterns. Oddly, size didn't matter as much for the “edge existence” task (finding out if two nodes in a graph are connected). Even the biggest LLM couldn't consistently beat a simple baseline solution on the cycle check problem (finding out if a graph contains a cycle or not). This shows LLMs still have room to improve with certain graph tasks. Effect of model capacity on graph reasoning task for PaLM 2-XXS, XS, S, and L. Do different graph shapes confuse LLMs Samples of graphs generated with different graph generators from GraphQA. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively. Comparing different graph generators on different graph tasks. The main observation here is that graph structure has a significant impact on the LLM’s performance. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively. Conclusion How to translate the graph to text: how we represent the graph as text significantly influences LLM performance. The incident encoding excelled for most of the tasks in general.. Task type: Certain types of graph questions tend to be harder for LLMs, even with a good translation from graph to text. Graph structure: Surprisingly, the "shape" of the graph that on which we do inference (dense with connections, sparse, etc.) influences how well an LLM does. This study revealed key insights about how to prepare graphs for LLMs. The right encoding techniques can significantly boost an LLM's accuracy on graph problems (ranging from around 5% to over 60% improvement). Our new benchmark, GraphQA, will help drive further research in this area. Acknowledgements We would like to express our gratitude to our co-author, Jonathan Halcrow, for his valuable contributions to this work. We express our sincere gratitude to Anton Tsitsulin, Dustin Zelle, Silvio Lattanzi, Vahab Mirrokni, and the entire graph mining team at Google Research, for their insightful comments, thorough proofreading, and constructive feedback which greatly enhanced the quality of our work. We would also like to extend special thanks to Tom Small for creating the animation used in this post.
AI / MLopen article

AI-created document fatigue: how I designed my way out of it

Code Is Cheap. Engineering Judgement Is Now the Scarce Resource

From Local App to Public Website in Minutes

Holo3.1: Fast & Local Computer Use Agents

From Regex to Vision Models: Which RAG Technique Fits Which Problem

Scikit-LLM vs. Traditional Text Classifiers: When Should You Use an LLM?

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

The most important part of building your taste is to hand it off

Default Bias: Who chose your settings?

Fragments: June 2

AI meets Sturgeon’s Law

Foreman, guardian, team builder: all this is a box

The register shift

RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

How to Combine Claude Code and Codex for Maximum Coding Power

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

It’s the Lessons We Learned Along the Way. Or, Is It?

The Roadmap for Mastering LLMOps in 2026

Escaping the Valley of Choice in BI

Can VR treat addiction?

Beaucoup and the Pursuit of Memorable Experiences

Designing for AI, the permalink problem, vibe designing

Is your website ready for AI agents?

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

Solving a Murder Mystery Using Bayesian Inference

AI Token Scarcity and Arcade Economics

June Is For Exploring (2026 Wallpapers Edition)

The Future of Emotional Technology: Comfort Websites for Growth and Self-Awareness

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

High-Throughput Graph Abstraction at Netflix: Part I

How to Get Research Recommendations on the Roadmap

Using RAS to Guide UX Research Resource Allocation and Strategy

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

The Never Ending Story: Building a Seamless Infinite Scroll Experience with GSAP & Lenis

Algorithmic Theming Engines: Building Self-Correcting Color Systems With `contrast-color()`

Building a Context Pruning Pipeline for Long-Running Agents

Solo founding is at an all-time high: Top performers have these traits in common

Fragments: May 27

The test suite as a regression sensor

The VibeSec Reckoning

Whooshes, Snaps and Shaders: Adrien Vanderpotte and the Feeling of the Interface

The Statistics of Token Selection: Logits, Temperature, and Top-P Walkthrough

Expanding Stripe Radar to protect more of your business

Reachy Mini goes fully local

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Building a Scroll-Driven 3D Cube Gallery in Webflow with GSAP

Building a Multi-Tool Gemma 4 Agent with Error Recovery

Implementing Hybrid Semantic-Lexical Search in RAG

Your Prototype Is Not Being Honest With Your Users (And Here’s How To Fix It)

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

The Case for Design Disposables

Closing the Loop: What to Do After a Design Critique Ends

Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook

Four Levels Of Customer Understanding

Building Context-Aware Search in Python with LLM Embeddings + Metadata

Creating Scroll-Driven SVG Map Animations with GSAP

How to Build a Multi-Agent Research Assistant in Python

Bliki: Vibe Coding

Advanced Tree Counting: Mathematical Layouts With `sibling-index()` And `sibling-count()`

Three more static code analysis sensors

Agentic Programming: A Roadmap

Made With Gsap: Building a Fun Gravity-Based Mouse Trail

OlmoEarth v1.1: A more efficient family of Earth observation models

Maintainability sensors for coding agents

80s Business Tech and Seamless Scene Transitions: Inside Shader.se’s Scroll-Driven WebGPU Pipeline

UX Conference August Announced (Aug 17 - Aug 28)

Merlin: The Code Boutique Turning Motion Into Digital Magic

Small by Design: The Strength of Lean Design-System Teams

What Designers Actually Struggle with on Product Teams

Ten Data-Backed Truths Of User Experience ROI

Fragments: May 14

Bliki: Interrogatory LLM

Designing Ourselves: The New Obys Identity and Website

Practical Interface Patterns For AI Transparency (Part 2)

What is Code

Five vertical SaaS insights from Sessions 2026

Why are AI companies buying the teams behind your favorite dev tools?