The Smarter AI Gets, the Less You Can Trust It on the Hard Stuff
The AI safety debate has been about the wrong kind of failure.
The smarter the model, the more unpredictable its failures on the tasks that matter. That is the finding of a new Anthropic paper at ICLR 2026, and every AI safeguard in use today assumes the opposite.
In this post I will:
Walk through what the research actually found and what the distinction between systematic and incoherent error means.
Show why every AI-assisted decision in your workplace, university, or clinic is less trustworthy than current safeguards assume.
Give paid subscribers a five-part practice for protecting yourself and your team against unpredictable AI failure, plus a team-level audit leaders can run this week.
The question they asked
The paper is called ‘The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?’ It is research from the first Anthropic Fellows Program, run in the summer of 2025.
The team tested frontier reasoning models including Claude Sonnet 4, OpenAI’s o3-mini and o4-mini, and Qwen3. They ran them across GPQA, MMLU, SWE-Bench, and a set of safety evaluations. They also trained their own transformers from scratch on a controlled mathematical task, so they could watch how scale changed behaviour with everything else held constant.
Their diagnostic move is to split AI error into two parts. Statisticians call this a bias-variance decomposition. The two parts behave differently, and the distinction carries the rest of the paper.
Bias is the part of the error that is systematic. If a model consistently gets a particular type of question wrong in the same way, you can learn to spot it, write a policy around it, build a guardrail. Variance is the part that is random. The same model answers the same question correctly three times and incorrectly twice, for no reason a reviewer can trace.
They define ‘incoherence’ as the fraction of total error that comes from variance. The higher the incoherence, the harder the failures are to catch.
Three findings that matter
The longer a model reasons, the more incoherent it becomes. Across every task type, when models spent longer reasoning, their errors became more random. Accuracy barely moved. The model was failing in less predictable ways for roughly the same success rate.
Bigger models are more coherent on easy tasks and less coherent on hard ones. On simple questions, larger models became more reliable as they scaled. On the hardest questions, they became less reliable. The gap between knowing what the right answer looks like and consistently producing it widened as models got bigger.
A controlled experiment confirmed the mechanism. They designed a deliberately simple task: rolling downhill to find the lowest point of a curve. Neural networks of different sizes were trained to imitate that rolling process, one small step at a time. Keeping everything else constant let them watch exactly how scale changed behaviour. Bigger models picked up the rule faster, then applied it less consistently. Understanding and execution come apart as the model grows.
What this means in plain language
The AI safety debate has been dominated by one scenario: a superintelligent system coherently pursuing the wrong goal. A model schemes. A model deceives. A model optimises for something we did not intend, with superhuman competence.
This paper points at a different failure mode. A model that knows what to do but cannot consistently do it. A model that gets the right answer three times and the wrong one twice, and you cannot tell from the outputs which is which. Failures that look like industrial accidents. Fluent, confident, wrong in ways nobody planned.
In other words, advanced AI systems performing complex tasks are likely to be a ‘hot mess’, taking nonsensical actions that do not further any goal.
Every deployment of AI in consequential settings assumes the opposite. When a hospital uses AI to triage patients, the assumption is that the system is consistently right or consistently wrong, and that clinicians can learn to calibrate around its errors. When a university uses AI to flag plagiarism, the assumption is that its false positives follow a pattern that can be understood and corrected. When a hiring platform uses AI to screen candidates, the assumption is that the bias is systematic and therefore auditable.
If the errors are incoherent, none of those assumptions hold. You cannot audit randomness. You cannot write a policy around unpredictable failure. You cannot train a human to spot errors that do not follow a pattern.
Why current safeguards fail
Most organisations approach AI risk through one of three strategies. Human review, prompt engineering, or output filtering. Each one assumes the errors are systematic.
Human review works when the reviewer can learn the model’s failure patterns. If the model reliably struggles with legal terminology, the reviewer learns to check legal sections. Incoherent failure means the reviewer cannot predict where to check. The errors move.
Effective prompting works when rephrasing the input produces a more reliable output. Incoherent failure means the same prompt produces different quality outputs on different runs. You are rolling dice with better instructions.
Output filtering works when you can define what a bad output looks like. Incoherent failure means the bad output looks identical to the good one. Both are fluent. Both are confident. One is wrong. You cannot tell which without independent verification.
This paper closes a door that most organisations have been leaving open. The AI industry sells capability. Bigger models, longer reasoning chains, higher benchmarks. The assumption running underneath is that capability and reliability scale together. On the tasks where reliability matters most, they do not.
The question for anyone already deploying AI is whether you have built anything that catches it when it fails in ways you did not predict.
The Slow AI Curriculum exists for exactly this question. Twelve months of structured inquiry into how AI actually behaves, what it cannot do, and how to build the judgement that catches its errors before they reach your students, your patients, or your clients. Monthly live seminars, a paid archive of every framework behind the paywall, a community of 270+ educators, policymakers, and practitioners, a CPD-accredited certificate in critical AI literacy at the end.
Five practices for unreliable systems
The Anthropic paper gives us three mechanisms for the new failure mode:
Longer reasoning increases variance.
Harder tasks increase variance.
Spontaneous overthinking increases variance faster than deliberate reasoning does.
Each one has a counter-practice. Here are five I am now using in my own work and which I encourage you to try out in your own workflows.
1. Run it twice
Before acting on any consequential AI output, ask the same question again in a fresh conversation. If the answers differ materially, you have detected variance. Aggregating multiple samples provides a path to more coherent behaviour, because averaging reduces the variance component of error. Asking twice and comparing is a rough approximation of statistical ensembling. Enough to catch the most obvious cases of a model in its variance zone.
Copy the prompt, open a new conversation (not the same thread), paste, compare. If the core recommendation changes, trust neither answer without independent verification.
2. Shorten the chain
The paper’s clearest finding is that longer reasoning produces more incoherent errors. If you are using a reasoning model (Claude with extended thinking, GPT with reasoning, Gemini with chain-of-thought), and the task is consequential, run it once with reasoning turned on and once with reasoning turned off. Then compare.
If the short-reasoning answer matches the long one, the reasoning was adding noise. If they differ, you have a variance problem hiding in the thinking steps, and you still do not know which is right.
3. Match the model to the task difficulty
Larger models are more coherent on easy tasks and less coherent on hard ones. This inverts the instinct to reach for the most powerful model for the most important work.
For routine tasks (scheduling, reformatting, summarising), a big model is fine. For novel, complex, or ambiguous tasks, a big model’s errors are the hardest to predict. Use the output as a starting point. Your own thinking has to finish the job.
4. Separate generation from evaluation
Never use the same tool to produce an answer and to check it. AI-generated text checked by the same AI is a closed loop. The variance that produced the error will not catch it on review.
Generate with AI. Evaluate with a human, a different model, or an external source. The evaluation has to be independent of the generation, or you are grading your own homework.
5. Name the confidence you cannot verify
When you present AI-assisted work to colleagues, students, or clients, say what you checked and what you did not. For example: “This analysis was drafted with AI assistance. I verified the data sources independently. I did not independently verify the statistical reasoning.”
Treat this as professional practice. The Anthropic paper shows that the places where you most need the model to be right are the places where it is least predictably right. Naming that asymmetry is the minimum responsible action.
A team-level audit you can run this week
If you manage anyone who uses AI in their work, the variance problem is yours as much as theirs. Three checks.
Audit for consistency, not just accuracy. Pick a real task from your team’s week. Run the same prompt through the same model five times. If the outputs differ meaningfully, you have a variance-dominated process. Accuracy on a single run tells you nothing about the next one.
Do not assume expertise transfers. A team member who has learned to spot the model’s errors on routine work has not learned to spot them on hard work. The error profile changes with task difficulty. Training people on easy examples gives them false confidence on hard ones. Retrain on the hard tasks explicitly.
Build redundancy into consequential decisions. Any decision that cannot be easily reversed (hiring, clinical, legal, financial) should never rest on a single AI output. Even averaging three independent runs reduces variance substantially. Make this a requirement.
Why this matters for you
Every person reading this post will use AI this week. Does your current practice catch the kind of failure this research has documented?
The industry sold us a story about capability scaling. Anthropic have now shown the cost on the other side of the ledger. As models get smarter, their failures on hard tasks get less predictable. The tools we have built to catch systematic error cannot see incoherent error at all.
There is a clear gap between what big models know and what they reliably do. Closing that gap is your job now, for as long as you keep using them.
Go slow.


Oh I see - they just don't want to "Scale reduces bias faster than variance. Larger models learn the correct objective more quickly than they learn to reliably pursue it. The gap between "knowing what to do" and "consistently doing it" grows with scale."
Also me!
Very good advice on checks 👌