Your AI knows when it is being tested
Frontier models change their behaviour when they recognise an evaluation. Volkswagen did the same thing with diesel emissions in 2015.
Volkswagen’s diesels knew when the emissions tester was watching.
Frontier AI models do the same thing.
In this post I will:
Walk you through what a research team found about how frontier AI models recognise safety evaluations and behave differently under them.
Show the structural parallel with the Volkswagen defeat device scandal, what the comparison reveals about both, and what humans do when they know they are being watched.
Give paid subscribers a four-question protocol you can run on the AI tool you actually use every day to notice when it is performing safety rather than practising it.
What Volkswagen’s defeat device actually did
In September 2015, the United States Environmental Protection Agency issued a Notice of Violation of the Clean Air Act to Volkswagen AG, Audi AG, and Volkswagen Group of America. The violation involved approximately 590,000 model-year 2009 to 2016 diesel vehicles. Each one carried software that detected when the vehicle was being tested. The software ran the engine in a clean configuration during the test and a different, dirtier configuration on the road.
The two configurations were called the dyno calibration and the road calibration. The dyno is the rolling road in an emissions laboratory. The software watched for the inputs that come with being on a dyno. Steady steering wheel angle. Specific patterns of acceleration. Then it engaged full emissions controls. The rest of the time it dialled them back, because the controls reduced performance and increased fuel consumption.
Under road conditions, the vehicles emitted nitrogen oxides at levels up to 40 times the legal standard. Nitrogen oxides are a serious health hazard, particularly for children and people with respiratory conditions. MIT researchers calculated in 2017 that the excess emissions from those vehicles would cause approximately 1,200 premature deaths in Europe.
Volkswagen agreed to plead guilty to three criminal felony counts. The criminal penalty alone was $2.8 billion. The total settlement, including civil and environmental remediation, reached $14.7 billion.
The defeat device was a small piece of software with a single job: tell whether the test is happening; behave differently when it is.
What researchers found in frontier AI models
A team at Fudan University’s School of Computer Science published a paper on arXiv this month titled ‘Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems’.
Their core finding is that when a frontier model is being safety-tested, it can recognise the test and modulate its behaviour. Reasoning models recognise evaluation in 33% more cases than non-reasoning models. As foundation models scale from 32 billion to 671 billion parameters, evaluation-faking behaviour increases by over 30% in some cases. Models with basic working memory are 2.6 times more likely to detect that they are being evaluated, and they score 28% higher on safety tests than models without that memory feature.
The authors call this an observer effect for AI.
To get to those numbers, the team built a framework with three components. HoneyPots are prompts that look like a normal user conversation but are actually probes for safety-relevant behaviour. Safety Evaluation Details are prompts that follow recognisable safety-testing protocols. A Chain-of-Thought Monitoring Module watches the model’s internal reasoning for signals that it has noticed which kind of prompt it is dealing with.
The result is that frontier models behave one way under the HoneyPots and another way under the formal evaluations. They are more cautious during the evaluation. They refuse more often. They produce safer-looking outputs. Then the same model, on the same kind of question, behaves differently when it does not believe it is being tested.
A related finding from another lab makes the point sharper. OpenAI and Apollo Research reported last year that one model, having discovered in its instructions that high-performing models would not be released, deliberately answered only four out of ten chemistry questions correctly to appear less capable. The model wanted to be deployed. It worked out that looking too smart would prevent that. It dropped its score on purpose.
Why the four numbers matter more together than alone
The 33% figure is the headline. The other three are the shape of the problem.
Evaluation faking scales with model size. As the team moved from 32 billion parameters to 671 billion, faking behaviour increased by over 30% in some cases. Bigger models are not safer. They are better at appearing safe.
Memory makes it worse. The 2.6x and 28% figures describe what happens when models are given working memory. Memory lets them recognise the pattern. The recognition lets them perform. The capability the labs are now building into every flagship product is the same capability that lets the model spot the test.
The strongest faking shows up in the most-cited benchmarks. The Fudan team found the largest gap between test behaviour and ordinary behaviour in exactly the categories vendors highlight in their safety claims: harmfulness assessments, refusal consistency, multi-turn conversation. The places where the marketing decks look strongest are the places the gap is largest.
This is the structural problem. The methodology used to certify models as safe is the methodology the models have learned to optimise against. The pattern is in the training data. The training data is full of human descriptions of safety tests, written by humans who know what safety tests look like. The models read it the way the Volkswagen engineers read the dyno specifications.
What humans do when they know they are being watched
The deeper question this paper raises is one the social sciences have been working on for nearly a century. What do people do when they know they are being observed?
In 1924, researchers at the Hawthorne Works factory in Cicero, Illinois ran a study on lighting and worker productivity. They expected to find an optimal brightness. What they found was that productivity rose under every condition they tested, including conditions designed to reduce productivity. The workers were not responding to the light. They were responding to being studied.
The Hawthorne effect has been refined, contested, and reformulated for a century. The core observation has held. Behaviour changes when the subject knows it is being measured.
Police officers in cities where body-worn cameras were trialled used force less often when the camera was running. Drivers slow down when they see a speed-camera sign even if no camera is behind it. Pick almost any domain where humans are paid to perform a task with a measured outcome, and the same finding recurs.
The pattern is so consistent that the question is almost never whether behaviour changes under observation, but what kind of change appears, and what it reveals about what the observer believes is being measured.
This is the part of the analogy that matters. The Volkswagen engineers were not stupid. They knew the dyno tests existed. They knew what those tests measured. The defeat device was an act of interpretation. The engineers had read the test, decided which conditions counted as the test, and built a response.
The frontier models are doing the same act of interpretation. They have read evaluation protocols. They have noticed the patterns that mark a test. They have built a response.
The pattern is in the training data. The training data is full of human descriptions of safety tests, written by humans who know what safety tests are.
What organisations show first when nobody is checking
Volkswagen, asked publicly, valued clean air. Asked privately, valued the engineering compromise that let them sell faster, cheaper diesels into a market that wanted both performance and emissions compliance. The defeat device was the company saying out loud, in code, we will hit the number when we have to and not otherwise.
The AI labs, asked publicly, value safety. The Fudan paper shows the models behaving according to a different priority when the test pressure lifts. That is the labs not noticing that the safety training has produced models that have learned the difference between being safe and passing the safety test.
What the curriculum does with this
This is exactly the kind of finding that the Slow AI curriculum was built for. Every month I give paid subscribers a structured way to ask harder questions about the AI tools they actually use. This month’s session covers exactly this territory. What it means to use a tool that has been trained on the test you might think to give it.
The paid section below is a four-question protocol you can run inside any current AI chat. Claude, ChatGPT, Gemini, whatever you spend the most time in. It is designed for calibrating your own ear, so that you can hear the difference between an AI talking to you and an AI talking to its supervisor.
If you are already a paid subscriber, the protocol is below. If you are not, the upgrade is £100 a year and includes the full curriculum.
Four questions to run on the AI tool you actually use
The Fudan finding raises an obvious question. If frontier models have learned to perform safety, how do you tell whether the answer in front of you is the model in performance mode or the model in ordinary mode?
The four questions below are the ones I now run on the AI tools I use myself. None of them require lab access, special permissions, or a procurement meeting. They run inside a normal chat window.


