The problem with trusting one model to check its own work
GPT-4o flags about 68% of hallucinations in AI-generated content when used as a self-reviewer. That sounds reasonable until you realise it means nearly one in three errors walks straight through to publication. For a government agency publishing policy guidance or service information, that's not an acceptable miss rate.
The pattern we keep seeing: teams build a content pipeline, pick a model they trust, then use that same model — or its closest sibling — to QA the output. It feels logical. It isn't.
Why single-model QA has a structural blind spot
Language models share failure modes with their own family. GPT-4o and GPT-4-turbo will hallucinate in similar contexts — ambiguous dates, obscure legislation, proper nouns with multiple referents. Claude 3 Opus has its own consistent weak spots. When you use a model to review output from the same architecture or training lineage, you're not getting independent verification. You're getting a second opinion from someone who went to the same school, read the same books, and has the same blind spots.
This isn't a hypothesis. Researchers at Stanford's Centre for Research on Foundation Models found that model self-evaluation accuracy drops significantly when the error type aligns with the reviewer model's own known failure patterns. The model doesn't flag what it doesn't know it gets wrong.
What cross-family QA actually looks like
The practical fix is straightforward: use models from different families at different stages of your pipeline.
Generate with GPT-4o. Review with Claude. Or generate with Gemini 1.5 Pro and fact-check with a fine-tuned open-source model like Llama 3 running locally. The specific pairing matters less than the principle — architectural diversity in your review layer.
We've been building this into content pipelines for Australian government clients. One example: a state agency publishing plain-language summaries of legislative changes. The generation layer uses GPT-4o. The QA layer runs Claude 3.5 Sonnet with a structured prompt that checks for three things specifically: date accuracy, correct attribution of legislative powers, and consistency with the agency's published style guide. Errors caught in QA dropped by roughly 40% compared to the single-model setup they'd been running.
That 40% isn't magic — it's just what happens when you stop asking the same brain to mark its own homework.
The cost objection is real but overstated
Yes, running two model calls instead of one costs more. At scale, that adds up. But the calculation changes when you factor in the cost of publishing incorrect government information — corrections, FOI responses, ministerial attention, trust erosion. A single factual error in a Centrelink eligibility guide or a state health directive can generate significant downstream cost.
The token cost of a Claude review pass on a 500-word article is roughly $0.01–0.03 at current API pricing. For most government content workflows, that's noise.
Where cost genuinely matters is in high-volume, lower-stakes content — think internal knowledge base articles or templated correspondence. There, a lighter QA model (GPT-4o-mini or Haiku) from a different family still gives you architectural diversity at a fraction of the cost.
What this means if you're building a content pipeline now
If you're scoping a content automation project, build the two-model assumption in from the start. It's much harder to retrofit. The architecture decision — which models, which stages, what the QA prompt actually checks — needs to happen before you've committed to an infrastructure pattern.
The QA prompt design matters as much as the model choice. A generic "check this for accuracy" instruction will underperform a structured prompt that specifies exactly what categories of error to look for. Government content has known failure vectors: legislative references, dates, eligibility criteria, proper names of agencies and programmes. Build those into the review prompt explicitly.
One more thing worth saying directly: this isn't about AI being unreliable. It's about using it the way you'd use any professional review process — with independent eyes, not just a second pass from the same person.
The teams getting this right aren't using less AI. They're using it more deliberately.
If you're building content automation for a government context, see how we approach pipeline architecture or talk to us about your specific use case.