
When executives ask for “the hallucination rate,” they usually mean: “How often will this system embarrass us?” That’s a reasonable question. It’s also the wrong shape of a question.
Hallucinations are not a single measurable property like battery life. They are a family of failure modes that spike or shrink depending on the task, the scoring incentives, whether the model can abstain, whether retrieval is used, and whether the evaluation treats “I’m not sure” as an acceptable outcome. In 2025, the best research doesn’t give you one number. It gives you a map: where hallucinations concentrate, how models trade correctness against refusal, and what mitigation techniques actually move the needle.
In practice, benchmarks usually operationalize hallucination in one of three ways.
Some measure short-form factuality: the model must produce a single correct fact or admit it can’t. SimpleQA is a canonical example. It grades each response as correct, incorrect, or not attempted, explicitly rewarding models that “know what they know” and abstain when uncertain.
Some measure grounding faithfulness: the model is given a document and must summarize or answer questions without inventing details not supported by the source. Vectara’s hallucination leaderboard targets exactly this behavior and describes a large, curated dataset spanning thousands of articles and an evaluation model that scores factual consistency.
Some measure hallucination/refusal tradeoffs on knowledge tasks at scale: HalluLens is designed to quantify both hallucination and refusal across different task types, including a setup (PreciseWikiQA) where the evaluation focuses on whether an answer is grounded and correct versus fabricated, and separately measures refusal behaviors.
If you put these three measurement families together, a pattern emerges: hallucinations aren’t a constant. They’re a symptom of pressure—pressure to answer, pressure to be fluent, pressure to appear helpful, pressure embedded in how we score and train.
SimpleQA provides an unusually clean view of the “guessing vs abstaining” dynamic because it treats abstention as a first-class outcome, and the questions are constructed to have a single indisputable answer. The paper describes a 4,326-question benchmark with answers determined and graded in a way designed to be reliable at scale.
OpenAI’s 2025 analysis makes the incentive issue explicit by showing SimpleQA-style metrics in a concrete comparison: in their published example, one model’s behavior yielded an error rate of 75% with only 1% abstentions, while another model abstained 52% of the time and cut errors dramatically, even though its raw accuracy number did not look dramatically higher. The key insight is not the absolute percentages; it’s the mechanism. Accuracy-only scoreboards push models toward guessing, and guessing produces hallucinations.
HalluLens (ACL 2025) shows the same dynamic at a broader benchmark level. In the PreciseWikiQA evaluation results, the authors report a range of hallucination rates across models and highlight the refusal/hallucination tradeoff explicitly. One of the most-cited figures from the paper is that GPT-4o has a reported hallucination rate of ~45% “when not refusing” in that setup, while other models show different balances of refusal frequency and hallucination frequency.
Vectara’s hallucination leaderboard focuses on grounded summarization: the model is supposed to stay faithful to the provided documents. Their methodology notes a dataset of 7,700+ articles and the use of a specialized hallucination evaluation model (HHEM) to score consistency; they publish model-level rates on that task. The important governance lesson is that hallucinations persist even when the model is given the source—because summarization is still generation, and generation still “fills gaps” unless tightly constrained.
Taken together, the “solid numbers” answer in late 2025 looks like this: on common factuality benchmarks, error/hallucination rates can be substantial; on some tasks and settings they can be measured in the tens of percent; and the measured rate depends heavily on whether the model is allowed to abstain, and whether the evaluation penalizes abstention or punishes confident errors.
OpenAI’s “Why language models hallucinate” argument is blunt: hallucinations persist because we reward behavior that produces them. When evals and scoreboards care mostly about accuracy, a model that guesses can outperform a model that admits uncertainty, even if the guessing model produces many more confident wrong answers. In that environment, “helpfulness” becomes indistinguishable from “always answer,” and “always answer” becomes indistinguishable from “sometimes fabricate.”
The companion paper formalizes this intuition and frames hallucinations as a statistical phenomenon arising from the mismatch between what is predictable from text patterns and what is essentially arbitrary, low-frequency truth. Some facts do not have sufficient signal in the training distributions to be reliably recovered as “truth,” yet the model is still pressured to output something that appears to be the truth. This is why the “just add more data” executive instinct fails. More data can reduce some errors, but it doesn’t eliminate the structural incentive to guess.
The most important strategic shift is that hallucinations are increasingly treated as a product behavior with downstream harm, not an academic curiosity.
In 2025, courts continued to confront AI-assisted filings polluted by fabricated or incorrect citations. Reuters’ coverage illustrates how judges are responding variably, from formal sanctions to procedural consequences, but with a consistent message: if you submit machine-generated fiction under your name, it’s still your filing.
Defamation and personal harm theories are also being tested. In Walters v. OpenAI, the court granted summary judgment for OpenAI on a defamation claim arising from allegedly false output, and the published order and reporting show how courts may analyze claims about AI output within existing legal frameworks. Even when an AI provider prevails, the fact pattern is instructive for governance: hallucinations about real people convert “model error” into “publication risk.”
Meanwhile, consumer-facing hallucinations have begun to generate public enforcement and policy reaction. The Apple notification summary suspension demonstrates how quickly a hallucination-like failure mode can become a reputational and platform-governance event when packaged as news.
In the EU context, organizations like NOYB have filed complaints alleging that fabricated personal allegations violate expectations of data accuracy. Whether or not every complaint succeeds, the direction is clear: “the system made it up” is increasingly being evaluated through compliance lenses, not just PR lenses.
For strategy, the conclusion is straightforward: if your deployment environment includes regulated advice, legal exposure, personal data, or safety-critical decision support, hallucination mitigation is not an optional enhancement. It’s a core control.
The most effective current mitigation pattern is to stop treating the model as an oracle and start treating it as a generator operating inside a verification loop. In practice, that means reducing the model’s freedom to improvise and increasing the system’s ability to check what was said.
Grounding via retrieval is the first lever: when a model is forced to answer from supplied sources, you reduce open-domain guessing. But retrieval doesn’t solve hallucinations by itself because retrieval can return irrelevant or incomplete material, and models will still “smooth” gaps into fluent conclusions unless they are trained and scored to prefer abstention over invention. This is exactly why grounded summarization benchmarks still show non-trivial inconsistency rates: the model is still rewarded for producing a coherent narrative, and coherence is not the same as truth.
The second lever is explicit abstention and calibration. OpenAI’s 2025 analysis emphasizes that a major driver of hallucinations is evaluation design that punishes uncertainty; if you change scoring to penalize confident errors more heavily and reward appropriate uncertainty, you push models toward safer behavior. The SimpleQA metrics they share are the clearest operational demonstration of this mechanism: abstention can be a safety feature, and low abstention can be a liability feature.
The third lever is task design: the more a prompt invites narrative synthesis, the more you should expect invented connective tissue. HalluLens explicitly measures hallucination and refusal as interacting behaviors; the paper’s reported rates underscore that when models rarely refuse, hallucination rates can spike on long-tail or difficult knowledge questions.
What makes all of this difficult is that the failure is not binary. Models can be partially correct in ways that look entirely right. They can cite a real paper with a wrong year, use a real author with a wrong title, or correctly explain a concept while inventing the attribution. Anthropic’s legal filing citation issue—where a citation was “real” but incorrectly specified—captures the practical reality: even when systems are used for “formatting,” they can introduce subtle factual errors that humans miss because the output looks professionally shaped.
Finally, there is an unavoidable governance constraint: if you demand maximum helpfulness and minimum refusal, you are explicitly choosing higher hallucination risk. Benchmarks increasingly quantify this tradeoff rather than pretending it doesn’t exist.
The future fix is not one technique. It is a system architecture shift: models should not merely “generate answers,” they should generate claims with provenance, each claim attached to a traceable source or an explicit uncertainty label.
In practical terms, that means a standard interface where every answer is internally structured into atomic assertions, each assertion either linked to retrieved evidence, linked to a trusted structured database, or flagged as unverified. Instead of one blob of persuasive text, you get a stack: what’s supported, what’s inferred, what’s unknown. The model becomes a narrator sitting on top of a fact layer, not a storyteller improvising reality.
In parallel, a serious future approach would treat hallucination as an insurance and compliance variable. Imagine enterprise deployments where answers above a certain risk tier require multi-model consensus plus external verification, and where organizations maintain auditable logs of “claims emitted” vs “claims verified” similar to how financial controls track approvals. The model is no longer a single point of failure; it becomes a component in a controlled decision pipeline.
The philosophical point is simple: hallucinations shrink when truth becomes cheaper than performance. That’s not a model problem. It’s an incentive design problem.
On a task where the model must summarize a provided short document (so there is a clear reference), Vectara’s hallucination leaderboard reports the best listed model at 1.8% hallucination rate (last updated December 18, 2025). This number is specifically “how often the summary introduces information not supported by the source text,” not “how often the model lies in open chat.”
HalluLens reports GPT-4o at 45.15% hallucination rate when not refusing on the PreciseWikiQA task, while also noting GPT-4o rarely refuses (4.13%) in that setup—illustrating the core trade-off: fewer refusals can correlate with more fabricated answers under uncertainty.
OpenAI’s 2025 analysis includes a SimpleQA example where o4-mini shows 1% abstention and 75% error rate, while gpt-5-thinking-mini shows 52% abstention and 26% error rate—a concrete demonstration that optimizing for “never say ‘I don’t know’” can drive dramatically higher wrong-answer rates.
SimpleQA is intentionally built to be easy to grade and hard to “game”: it contains 4,326 short, fact-seeking questions, uses human trainers to create and independently verify answers, and grades each response as correct / incorrect / not attempted. That design makes it useful for measuring “does the model know what it knows?” rather than long-form fluency.
These figures are all legitimate—but they measure different things. “Hallucination rate” is not one global constant; it is a function of task design, grounding, and whether the system is allowed (and incentivized) to abstain.