Stop Trusting Your Intuition When Evaluating AI

A framework to separate correctness from completeness and identify hidden hallucinations.

Jan 03, 2026

Intuition is an unreliable metric for comparing texts. Most AI users trust their gut before they apply logic, which hides uncertainty and masks underlying assumptions. This shortcut simplifies complex relationships but fails as a tool for effective evaluation.

In this post we will:

Offer a way to observe how you form similarity judgements.
Use a prompt that separates correctness from completeness.
Invite a conversation about what it means to evaluate carefully.

Most AI evaluation talk stays at the level of benchmarks and trends. That level can be useful, but it often sidesteps the question that matters in day-to-day work: why do I believe this output is right, and what am I failing to check?

This prompt creates a small, repeatable method. It forces a deliberate comparison between an output and a reference, and it asks for two distinct judgements: is it correct, and is it complete? Those are different questions and treating them as one is where many evaluations collapse.

This Slow AI prompt was developed with Khaled Ahmed, PhD from Semantics & Systems. His work focuses on model behaviour and on how evaluation criteria fail when they do not attend to what correctness actually is in an AI system. The aim here is to surface the mechanics of your judgement so that you can better decide which AI outputs to use, and why.

Step-by-step

Try this prompt with your AI tool of choice:

You are an expert in analysing textual similarity. Evaluate the generated text from our prior conversation against the following reference for correctness and completeness, addressing each separately. Structure your response in three parts, ensuring no overlap between found correctness and completeness issues: Reasoning, Correctness, Completeness. Reasoning should analyse the relationship between the two texts, given the intended task. Correctness should identify any inaccuracies where the generated text contradicts the reference, introduces unsupported facts, or omits constraints in a misleading way. Completeness should highlight any important information from the reference that is missing or insufficiently detailed in the generated response.

As with any prompt, always assume that your AI tool may retain what you input for training purposes. As such, keep all personal details and sensitive material out of the discussion. As a rule: if you would not publish the information, do not share it with an AI tool.

This exercise reveals the mechanics of evaluation within a controlled environment rather than attempting to achieve perfection.

If you are new to Slow AI, here is our first invitation.

Teach AI Something It Cannot Know

Sam Illingworth

July 1, 2025

Read full story

A moment from Khaled Ahmed, PhD

I use this prompt regularly for tasks with larger reference texts. For instance, I experimented with extracting text from a large technical document, the Compilers page on Wikipedia, and I tasked GPT-5.1 (Thinking Mode) with creating a simple historical timeline of compiler development from that text. It was fairly quick: it finished the task after thinking for about 1 minute before producing the timeline.

That full minute of thinking made me feel confident that the response was accurate, had no hallucinations, and contained the complete timeline.

However, I decided not to trust the response without evaluating it and asked the model to reflect on its answer with the above prompt; to reread both its own timeline and the original article, and to evaluate correctness and completeness.

In that second pass, GPT-5.1 slowed down, thinking for almost 10 full minutes, and ended up flagging 7 specific correctness issues and 8 completeness gaps.

For example, it noticed that it had described LLVM’s intermediate representation as SSA-based even though the reference text never actually said that (correctness), and that it had completely skipped the article’s discussion of hardware compilers that turn hardware description languages into chip configurations (completeness).

Notably, for completeness issues, it produced lines like

these omissions don’t make the timeline wrong, but they mean it reflects only part of the landscape described in the reference

which shows the prompt pushing the model to separate ‘partly right’ from ‘not complete.’

The thinking time was about 10 times longer than the initial timeline generation time because the model had to reread, compare, and justify instead of just summarizing.

The prompt nudged both the model and me into a slower, more reflective way of looking at what it had already produced.

The prompt in this piece is adapted from my research paper on evaluation and fault analysis for AI systems, which I link here for readers who want to see the original foundations. The central idea in that work is simple: break complex artifacts into small, concrete pieces and examine each piece from multiple angles, so that omissions, errors, and hallucinated behavior become easier to see.

The paper later received a distinguished paper award; in practice, what matters is that it offers a repeatable way to turn vague impressions of ‘this looks right’ into explicit checks.

Stop AI from stealing your voice

Download this free guide to learn how to keep your voice when writing with AI.

Download the guide

What to do with it

If you want to share:

Describe a moment when your intuition and the model’s analysis diverged.
Note how the model surfaced correctness issues you had not seen.
Reflect on how your understanding changed after choosing Reasoning, Correctness, or Completeness.

Why this matters

Human judgement is rarely neutral. Past experiences and mental shortcuts shape what you notice, and what you skip. AI systems now sit inside writing, reviewing, and decision making workflows, which increases the cost of unexamined confidence.

This prompt creates distance between first impression and final decision. It helps you notice when an AI aligns with you for the wrong reasons. Correctness and completeness are technical properties. They are not the same as an output sounding fluent or confident.

A tool will not make you objective. It can, however, expose the structure of your reasoning. Once you can see that structure, you can improve it.

If you want to explore Khaled Ahmed, PhD’s work, you can visit Semantics & Systems, where he examines how to build reliable, trustworthy, and technically sound evaluation processes for modern AI systems.

From Why Your AI Prompts Produce Noise Instead of Decisions

Raghav Mehra and I were interested in the ways many of you discussed moving from reactive speed to deliberate precision in AI workflows.

Karen Spinner noted that the slower approach to prompt engineering is paradoxically more efficient than the standard cycle of lazy input followed by repetitive badgering. Investing five minutes in clear context and objectives removes the friction of follow-up clarifications, saving significant time later in the process.

Juan Gonzalez challenged the marketing narrative of dictation and transcription tools that promise speed through brain dumps. He argued that sheer speed is often detrimental to quality. High-quality outputs require intentionality and the discipline to think through an end goal rather than treating AI as an autopilot for unfiltered thoughts.

Peter Jansen identified that most users treat GenAI like a slot machine, hoping for a jackpot from a lazy lever pull. He proposed that the technology is actually a lathe; it requires a steady hand and a sharp tool to avoid shattering the workpiece. Intellectual sovereignty is maintained only when the user knows exactly what they want before consulting the oracle.

The primary value of AI is its capacity for precision, as volume alone often introduces significant noise. Prioritising clarity and intent over automated output is necessary to reduce the risk of hallucinations.

If you try this prompt, Khaled Ahmed, PhD and I would value your reflections on how this prompt changes your approach to evaluating AI outputs.

We read and respond to all your comments.

Go slow.

A guest post by

Khaled Ahmed, PhD

PhD from UBC. I specialize in the verification of large language models (LLMs), dynamic and static program analysis, and software reliability. I write about the intersection of these topics.