Why language models hallucinate—and how OpenAI is changing evaluations to boost reliability

TL;DR

Hallucinations occur when language models confidently output false information, even as they become more capable.
Standard accuracy-centric evaluations reward guessing, which can drive models to provide confident but incorrect answers.
OpenAI proposes uncertainty-aware evaluation and scoring to deter confident errors and reward appropriate abstentions.
GPT‑5 shows fewer hallucinations than earlier models, yet hallucinations persist; improving evaluation is key to broader reductions.
A structured approach—combining abstention, uncertainty signaling, and calibrated scoring—could advance safer, more reliable AI deployment.

Context and background

OpenAI reports ongoing work to make AI systems more useful and reliable, acknowledging that hallucinations remain a stubborn challenge for language models. Hallucinations are defined as plausible but false statements generated by models. Even widely used systems like ChatGPT exhibit this behavior. The company notes that GPT‑5 has significantly fewer hallucinations, particularly when reasoning, but they still occur. The broader point is that hallucinations are a fundamental challenge for large language models, and reducing them requires changes beyond improving model scale alone. A central claim of OpenAI’s new research is that hallucinations are not solely a problem of data quality or model size; they are exacerbated by the incentives embedded in standard training and evaluation procedures. In practice, many evaluations measure accuracy—the proportion of questions answered correctly—rather than whether the model should abstain or acknowledge uncertainty. The paper argues that this incentive structure encourages models to guess rather than refrain from answering when unsure. The comparison is illustrated with simple analogies: in a multiple-choice test, leaving a question blank yields zero points, while guessing can yield a nonzero score—even if the guess is incorrect. Over thousands of questions, this dynamic biases models toward confident but incorrect outputs. OpenAI emphasizes three categories of responses for questions with a single correct answer: accurate responses, errors, and abstentions (the model does not hazard a guess). Abstaining is framed as humility, a core value for the organization. They note that most scoreboards prioritize accuracy, and errors are treated worse than abstentions. This framing motivates the development of evaluation schemes that reward uncertainty and clarification when appropriate and penalize confident errors more heavily. A concrete example discussed in the paper compares models on the SimpleQA evaluation. The table contrasts GPT‑5 thinking mini with OpenAI o4‑mini and highlights how strategic guessing can boost apparent accuracy while increasing errors and, more importantly, hallucinations. The takeaway is that accuracy alone cannot fully capture a model’s reliability in real-world use, where some questions have no definitive answer or require calibration and context. The authors also provide a broader rationale for their approach. They argue that the root cause of certain hallucinations lies in the data distribution encountered during pretraining. Language models learn by predicting the next word in vast corpora of text, not by being labeled as true or false. As a result, they must approximate a distribution of fluent language without explicit negative labels, making it challenging to distinguish valid from invalid statements. This creates a propensity for low-frequency facts—like a person’s birthday—to be misremembered or fabricated if patterns alone cannot anchor the truth. The paper clarifies that while spurious outputs such as misspellings or punctuation errors tend to diminish with scale, arbitrary low-frequency facts can still lead to hallucinations. Even as models improve, post‑pretraining stages do not fully eliminate these issues for reasons described in the work. The authors present their analysis as a statistical lens to better understand where hallucinations come from and how evaluation shapes model behavior. OpenAI’s stance is not merely diagnostic; it calls for a concrete shift in how models are evaluated and how success is defined. The authors argue that fixing scoreboards—so that uncertainty is recognized and rewarded—can broaden the adoption of hallucination-reduction techniques, both new and those from prior research. The aim is to move beyond the dichotomy of right versus wrong to a spectrum that includes appropriate expressions of uncertainty and requests for clarification.

What’s new

The core contribution of the paper is to connect hallucinations with evaluation incentives through a statistical lens. The authors argue that current accuracy-based benchmarks drive models to guess, which in turn increases the likelihood of confident errors and hallucinations. They propose a straightforward fix: penalize confident errors more than uncertainty, and provide partial credit for appropriate abstention or cautious language. This stance is not presented as wholly novel in isolation—negative marking and partial credit have appeared in standardized tests and prior research—but the authors argue that the widely used accuracy-based evaluations remain the dominant force on leaderboards and model cards. They contend that updating main scoreboards to discourage guessing is essential for broader, real-world reductions in hallucinations. In their view, simply adding a few uncertainty-focused tests is insufficient; the primary evaluation framework itself must change to align incentives with honesty about uncertainty. The analysis also revisits the training dynamics that generate hallucinations. Pretraining targets a next-word distribution from large-scale text without explicit labels for “true” or “false.” Because some statements are inherently uncertain or impossible to determine from available information, the model’s tendency to guess can produce confident but incorrect outputs. The authors argue that better post‑pretraining safeguards can mitigate some of these issues, but a reliable long-term solution requires rethinking evaluation and scoring. GPT‑5 is highlighted as having fewer hallucinations relative to earlier iterations, particularly in reasoning tasks, but the authors stress that hallucinations remain a persistent risk across all large language models. They emphasize that progress will come from both model improvements and, crucially, evaluation reforms that reward calibrated language and discourage blind guessing.

Why it matters (impact for developers/enterprises)

For developers and enterprises, the paper outlines a practical pathway to safer, more trustworthy AI systems. If evaluation methods are redesigned to penalize confident errors and reward humility, models will learn to withhold judgment when information is uncertain and to ask clarifying questions when needed. This has several concrete implications:

Safer deployment: systems are less likely to provide confidently incorrect information in critical domains, reducing risk for users and organizations.
Better user experience: abstention and request for clarification can improve transparency and reliability, especially in complex or ambiguous scenarios.
Clearer compliance signals: uncertainty-aware outputs can align with governance and risk management requirements that favor cautious, well-supported answers.
Adoption of reduction techniques: more robust evaluation can accelerate the adoption of existing and new hallucination-reduction techniques by ensuring that benchmarks reflect real-world reliability, not just raw accuracy. The authors also underscore that reductions in hallucinations do not eliminate the need for careful model use and human oversight in high-stakes applications. Instead, there is a path toward more reliable AI by combining improved pretraining with evaluation reforms and calibrated responses that acknowledge uncertainty.

Technical details or Implementation (where this applies)

A central technical thread is the mismatch between how models are trained and how they are evaluated. Pretraining teaches models to predict the next word from large text corpora, with no explicit negative labels. Consequently, arbitrary low-frequency facts can become hallucinations because patterns alone cannot reliably anchor truth. The authors argue that later-stage training can reduce errors but is not sufficient to eliminate them, especially for facts that require current knowledge or context beyond the training data. The proposed implementation focuses on two levers:

Evaluation redesign: move beyond accuracy as the sole objective. Introduce scoring that penalizes confident errors more than uncertainty and offers partial credit for appropriate abstention or cautious language.
Uncertainty signaling: encourage models to present uncertainty, ask clarifying questions, or provide conditional answers when information is insufficient to determine a single truth. A practical example presented in the work uses the SimpleQA evaluation to illustrate the trade-offs between abstention, accuracy, and error rates. The table compares GPT‑5 thinking mini with OpenAI o4‑mini, showing that higher abstention can correlate with lower error rates yet a different accuracy profile. The broader point is that high accuracy on a narrow benchmark can mask a higher rate of confident errors when models are deployed in the wild. The authors reference the Model Spec, which endorses indicating uncertainty or seeking clarification as a preferred strategy over confidently asserting an uncertain fact. They also discuss broader research on uncertainty-aware evaluations that account for calibration and uncertainty quantification. In their view, updating the main scoreboards to discourage guessing is a practical, scalable step toward broader adoption of uncertainty-aware methodologies.

Key takeaways

Hallucinations arise in part because evaluation incentives reward guessing over acknowledging uncertainty.
Accuracy-only benchmarks can mask the prevalence of confident errors and other forms of hallucinations.
A practical fix is to penalize confident errors more than uncertainty, with partial credit for appropriate abstention.
Model improvements (e.g., GPT‑5) reduce hallucinations but do not eliminate them; evaluation reform is essential for further gains.
Implementing uncertainty signaling and reformed benchmarks can support safer deployment and broader adoption of hallucination-reduction techniques.

FAQ

What causes hallucinations according to the paper?

Hallucinations are driven by evaluation incentives that reward guessing rather than acknowledging uncertainty, combined with pretraining on next-word prediction without explicit truth labels.
How do current evaluations influence model behavior?

ccuracy-focused benchmarks encourage models to guess, which can increase confident errors and hallucinations, especially on questions without a clear right answer.
What is the proposed fix?

Penalize confident errors more than uncertainty, and provide partial credit for leaving questions blank or expressing uncertainty, effectively rewarding calibrated responses.
How do newer models compare to older ones in terms of hallucinations?

GPT‑5 has significantly fewer hallucinations, particularly in reasoning tasks, but hallucinations still occur; ChatGPT also hallucinates.
What is the SimpleQA example illustrating?

It demonstrates how strategies that maximize short-term accuracy can coincide with higher error and hallucination rates, highlighting the need for uncertainty-aware evaluation.