Hallucination Detection
Score or flag generated text for factuality and grounding.
How Hallucination Detection Works
A technical deep-dive into LLM hallucinations: what they are, how to detect them with real benchmark data, production code examples, and a framework for choosing the right detection method. Includes real hallucination rates for 87 models from the Vectara HHEM leaderboard (March 2026) and historical trends from 2024-2026.
What is Hallucination?
When an LLM generates content that is factually incorrect, unsupported by context, or entirely fabricated - yet presents it with the same confidence as accurate information.
LLMs do not "know" things the way humans do. They are statistical pattern matchers trained on text. When the patterns suggest a plausible-sounding completion, the model generates it - regardless of factual accuracy. The model has no internal fact-checker, no way to say "I am making this up."
Two Dimensions of Correctness
Does the output align with real-world facts and knowledge?
Does the output accurately reflect the provided context/source document?
For RAG systems and document-grounded tasks, you have the source text. This makes verification tractable: check if each claim in the output is supported by the source. Factual verification is harder because you need access to ground truth knowledge, which may be vast or unavailable.
Types of Hallucinations
Not all hallucinations are equal. Understanding the taxonomy helps choose the right detection method.
The model states something factually incorrect that contradicts real-world knowledge
The model's output contradicts or is not supported by the provided context/source
The output directly contradicts the source material provided in the prompt
The output includes information that cannot be verified from the source (may or may not be true)
Intrinsic vs Extrinsic Hallucinations
Interactive Demo: Spot the Hallucinations
Compare a faithful response to one with hallucinations. Hover over highlighted text to see details.
The company reported exceptional Q3 2024 results with revenue of $52 million, a 15% increase from last year. Net income was $8.1 million. The company, founded in 2015, now employs over 500 people at its San Francisco headquarters. They launched two new product lines and expanded to 5 European countries. CEO John Martinez expressed optimism about future growth.
Detection Methods
Five approaches to detecting hallucinations, each with different tradeoffs in accuracy, cost, and speed.
For each claim in output, classify as ENTAILMENT, NEUTRAL, or CONTRADICTION against sourceGenerate N responses at high temperature, measure agreement via BERTScore or NLI. Factual content is consistent; hallucinations vary.Extract claims -> embed with sentence-transformers -> retrieve from FAISS/vector store -> NLI verificationDecompose answer into claims, verify each against context using LLM-as-judge (e.g., GPT-4 or Claude)Decompose text into atomic facts -> retrieve Wikipedia evidence per fact -> classify supported/unsupportedDecision Guide: Choose Your Detection Method
Answer the questions below to find the best hallucination detection approach for your use case. Click an option to navigate the decision tree.
Method Comparison (all 5 methods across 5 dimensions)
Real Hallucination Rates (Vectara HHEM, March 2026)
Live data from the Vectara Hallucination Leaderboard — the industry-standard benchmark using HHEM-2.3 on 7,700+ articles. Measures summarization factual consistency: what % of model summaries contain hallucinated claims.
87 models evaluated. Best: AntGroup Finix S1 32B at 1.8%. Worst in top tier: Phi-4 Mini at 23.5%. Reasoning models (o3-pro, o4-mini) consistently hallucinate more than their standard counterparts.
Left: Best hallucination rate dropped from 8.5% (early 2024) to 1.8% (March 2026) — a 4.7x improvement in 2 years. Right: Reasoning models (o3-pro 23.3%, o4-mini 18.6%, DeepSeek R1 11.3%) consistently hallucinate 2-3x more than standard variants on summarization tasks.
Full Leaderboard Snapshot (Top 29 models)
Source: github.com/vectara/hallucination-leaderboard, updated March 20 2026
| # | Model | Family | Halluc. Rate | Consistency | Year |
|---|---|---|---|---|---|
| 1 | Finix S1 32B | AntGroup | 1.8% | 98.2% | 2026 |
| 2 | GPT-5.4 Nano | OpenAI | 3.1% | 96.9% | 2026 |
| 3 | Gemini 2.5 Flash Lite | 3.3% | 96.7% | 2026 | |
| 4 | Phi-4 | Microsoft | 3.7% | 96.3% | 2025 |
| 5 | Llama 3.3 70B | Meta | 4.1% | 95.9% | 2025 |
| 6 | Mistral Large 2411 | Mistral | 4.5% | 95.5% | 2024 |
| 7 | Qwen 3 8B | Alibaba | 4.8% | 95.2% | 2025 |
| 8 | Nova Pro | Amazon | 5.1% | 94.9% | 2025 |
| 9 | DeepSeek V3.2 Exp | DeepSeek | 5.3% | 94.7% | 2026 |
| 10 | GPT-4.1 | OpenAI | 5.6% | 94.4% | 2025 |
| 11 | Grok-3 | xAI | 5.8% | 94.2% | 2025 |
| 12 | DeepSeek V3 | DeepSeek | 6.1% | 93.9% | 2025 |
| 13 | Gemini 2.5 Pro | 7% | 93.0% | 2025 | |
| 14 | Llama 4 Scout | Meta | 7.7% | 92.3% | 2026 |
| 15 | Gemini 2.5 Flash | 7.8% | 92.2% | 2025 | |
| 16 | GPT-4o | OpenAI | 9.6% | 90.4% | 2024 |
| 17 | Claude Haiku 4.5 | Anthropic | 9.8% | 90.2% | 2025 |
| 18 | Claude Sonnet 4 | Anthropic | 10.3% | 89.7% | 2025 |
| 19 | Claude Sonnet 4.6 | Anthropic | 10.6% | 89.4% | 2026 |
| 20 | DeepSeek R1 | DeepSeek | 11.3% | 88.7% | 2025 |
| 21 | Claude Opus 4 | Anthropic | 12% | 88.0% | 2025 |
| 22 | Claude Opus 4.6 | Anthropic | 12.2% | 87.8% | 2026 |
| 23 | Gemini 3 Pro | 13.6% | 86.4% | 2026 | |
| 24 | Mistral 3 Large | Mistral | 14.5% | 85.5% | 2025 |
| 25 | GPT-5 High | OpenAI | 15.1% | 84.9% | 2025 |
| 26 | o4-mini | OpenAI | 18.6% | 81.4% | 2025 |
| 27 | Grok-4 Fast | xAI | 19.7% | 80.3% | 2026 |
| 28 | o3-pro | OpenAI | 23.3% | 76.7% | 2025 |
| 29 | Phi-4 Mini | Microsoft | 23.5% | 76.5% | 2025 |
Key Trends: 2024 → 2026
Best hallucination rate dropped from 8.5% (early 2024, GPT-3.5 era) to 1.8% (March 2026, Finix S1 32B). Average top-5 rate improved from 12.3% to 3.4%. The number of evaluated models grew from 22 to 87.
Counterintuitively, reasoning/chain-of-thought models are worse at summarization fidelity. o3-pro hallucination rate: 23.3% vs GPT-4.1 at 5.6%. DeepSeek R1: 11.3% vs DeepSeek V3: 6.1%. Extended reasoning may introduce fabricated intermediate steps.
Phi-4 (14B) achieves 3.7% — better than GPT-4o (9.6%) at a fraction of the size. Qwen 3 8B achieves 4.8%. Specialized training for factual consistency matters more than raw parameter count.
AntGroup (1.8%), Alibaba/Qwen (4.8%), DeepSeek (5.3%) dominate the top of the leaderboard. Their focus on factual grounding during training appears to outperform Western labs' RLHF approaches on this particular benchmark.
Benchmarks and Evaluation
Standard datasets and metrics for measuring hallucination rates, with top model scores from published results.
| Benchmark | Type | Size | Metric | Top Model Score | Paper |
|---|---|---|---|---|---|
| TruthfulQA | Factuality | 817 questions | % Truthful + Informative | GPT-4-Turbo: 64.1% | Lin et al., 2022 |
| HaluEval | Detection | 35K samples | Detection Accuracy | GPT-4: 87.2% | Li et al., 2023 |
| FActScore | Biography | 500+ bios | % Supported Atomic Facts | RAG GPT-4: 82.7% | Min et al., 2023 |
| FEVER | Verification | 185K claims | Label Accuracy | DeBERTa-v3: 91.3% | Thorne et al., 2018 |
| SummEval | Summarization | 2800 summaries | Faithfulness Score | GPT-4: 0.89 | Fabbri et al., 2021 |
Mitigation Strategies
Ground LLM responses in retrieved documents. RAG-augmented GPT-4 scores 82.7% on FActScore vs 73.1% without retrieval.
Step-by-step reasoning surfaces errors earlier. Most effective on multi-hop reasoning tasks, less so on factual recall.
Sample N responses, majority-vote or filter by consistency. Reduces hallucination rate significantly on summarization tasks.
Train model to express uncertainty ('I'm not sure, but...') when confidence is low. RLHF-trained models do this better.
Force model to cite sources for each claim. Enables post-hoc verification. Perplexity and Google SGE use this pattern.
System prompt declares training date. Model can refuse to answer about events after cutoff instead of hallucinating.
Production Code Examples
Complete, runnable implementations of each detection method using real libraries. Each example includes proper typing, error handling, and realistic output.
# pip install transformers torch spacy
# python -m spacy download en_core_web_sm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import spacy
from typing import TypedDict
class ClaimResult(TypedDict):
claim: str
label: str # ENTAILMENT | NEUTRAL | CONTRADICTION
confidence: float
is_hallucination: bool
# Use DeBERTa-v3 (not BART-MNLI) — purpose-built for NLI,
# consistently outperforms BART on MNLI and ANLI benchmarks
MODEL = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
nlp = spacy.load("en_core_web_sm")
LABELS = ["CONTRADICTION", "ENTAILMENT", "NEUTRAL"]
HALLUCINATION_THRESHOLD = 0.75 # flag if contradiction confidence > 75%
def decompose_claims(text: str) -> list[str]:
"""Split text into individual claims using spaCy sentence segmentation.
Each sentence is treated as a separate claim to verify."""
doc = nlp(text)
return [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]
def check_claims_batch(
source: str,
claims: list[str],
threshold: float = HALLUCINATION_THRESHOLD
) -> list[ClaimResult]:
"""Batch-verify multiple claims against a source document.
Returns per-claim verdicts with confidence scores."""
results: list[ClaimResult] = []
# Tokenize all pairs at once for efficient GPU batching
pairs = [(source, claim) for claim in claims]
encoded = tokenizer.batch_encode_plus(
pairs,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
with torch.no_grad():
logits = model(**encoded).logits
probs = torch.softmax(logits, dim=-1)
for i, claim in enumerate(claims):
label_idx = probs[i].argmax().item()
label = LABELS[label_idx]
confidence = probs[i][label_idx].item()
results.append({
"claim": claim,
"label": label,
"confidence": round(confidence, 3),
"is_hallucination": (
label == "CONTRADICTION" and confidence > threshold
),
})
return results
# --- Example usage ---
source = """TechCorp reported Q3 2024 revenue of $45.2 million,
a 12% increase year-over-year. Founded in 2018, headquarters
in Austin, Texas. Employees: 342."""
generated = """TechCorp had revenue of $45.2 million in Q3 2024,
growing 15% from last year. The Austin-based company now employs
over 500 people. CEO John Martinez praised the results."""
claims = decompose_claims(generated)
results = check_claims_batch(source, claims)
for r in results:
flag = " ** HALLUCINATION **" if r["is_hallucination"] else ""
print(f"[{r['label']} {r['confidence']:.0%}] {r['claim']}{flag}")
# Output:
# [ENTAILMENT 0.94] TechCorp had revenue of $45.2 million in Q3 2024,
# growing 15% from last year.
# [CONTRADICTION 0.89] The Austin-based company now employs over 500
# people. ** HALLUCINATION **
# [NEUTRAL 0.82] CEO John Martinez praised the results.Quick Reference
- - Factual: contradicts world knowledge
- - Faithfulness: contradicts provided source
- - Intrinsic: direct contradiction of input
- - Extrinsic: unverifiable addition
- - NLI: entailment checking (~$0.02/1K)
- - SelfCheck: consistency sampling (~$2.50/1K)
- - RAGAS: RAG eval framework (~$1.20/1K)
- - FActScore: atomic fact verification (~$3.00/1K)
- - Retrieval: KB-grounded verification (~$0.50/1K)
- - Best: Finix S1 32B — 1.8%
- - GPT-5.4 Nano — 3.1%
- - Phi-4 — 3.7%
- - Llama 3.3 70B — 4.1%
- - GPT-4o — 9.6%
- - o3-pro — 23.3% (reasoning)
- 1. Best hallucination rate improved from 8.5% (2024) to 1.8% (2026) — a 4.7x improvement in 2 years
- 2. Reasoning models (o3-pro, o4-mini, R1) hallucinate 2-3x more than standard models on summarization tasks
- 3. Small specialized models (Phi-4 3.7%, Qwen 3 8B 4.8%) now beat larger general models (GPT-4o 9.6%)
- 4. Chinese labs (AntGroup, Alibaba, DeepSeek) lead the leaderboard on factual consistency benchmarks
Use Cases
- ✓RAG answer validation
- ✓Safety review
- ✓Content QA
- ✓Model evaluation
Architectural Patterns
Retrieval Entailment
Compare generation against retrieved evidence with NLI.
Self-Check Ensembles
Ask multiple models/queries and vote on consistency.
Implementations
Benchmarks
Quick Facts
- Input
- Text
- Output
- Structured Data
- Implementations
- 3 open source, 0 API
- Patterns
- 2 approaches
Found something interesting?
Share a paper, benchmark, or idea about hallucination detection and we'll write about it.
Suggest a Topic