Home/Building Blocks/Hallucination Detection
TextStructured Data

Hallucination Detection

Score or flag generated text for factuality and grounding.

How Hallucination Detection Works

A technical deep-dive into LLM hallucinations: what they are, how to detect them with real benchmark data, production code examples, and a framework for choosing the right detection method. Includes real hallucination rates for 87 models from the Vectara HHEM leaderboard (March 2026) and historical trends from 2024-2026.

1

What is Hallucination?

When an LLM generates content that is factually incorrect, unsupported by context, or entirely fabricated - yet presents it with the same confidence as accurate information.

The Core Problem

LLMs do not "know" things the way humans do. They are statistical pattern matchers trained on text. When the patterns suggest a plausible-sounding completion, the model generates it - regardless of factual accuracy. The model has no internal fact-checker, no way to say "I am making this up."

Why This Matters
Hallucinations are indistinguishable from accurate responses without verification. In high-stakes domains (medical, legal, financial), this is dangerous.
Root Cause
LLMs optimize for plausible next-token prediction, not truth. Training on internet text includes errors, contradictions, and outdated information.

Two Dimensions of Correctness

Factual Correctness

Does the output align with real-world facts and knowledge?

Question:
"What is the capital of France?"
Correct: "Paris"
Hallucination: "Lyon"
Verification requires external knowledge (world facts, databases, documents)
Faithfulness to Source

Does the output accurately reflect the provided context/source document?

Context: "The meeting is at 3pm"
"When is the meeting?"
Faithful: "3pm"
Hallucination: "2pm"
Verification only requires the source document (RAG, summarization tasks)
Key Insight: Faithfulness is Easier to Verify

For RAG systems and document-grounded tasks, you have the source text. This makes verification tractable: check if each claim in the output is supported by the source. Factual verification is harder because you need access to ground truth knowledge, which may be vast or unavailable.

2

Types of Hallucinations

Not all hallucinations are equal. Understanding the taxonomy helps choose the right detection method.

Factual Hallucination

The model states something factually incorrect that contradicts real-world knowledge

High Severity
Prompt: "When did the Eiffel Tower open?"
Output: "The Eiffel Tower opened in 1921 to celebrate France's victory in World War I."
Issue: The Eiffel Tower opened in 1889 for the World's Fair, not 1921.
Faithfulness Hallucination

The model's output contradicts or is not supported by the provided context/source

High Severity
Context: "The agreement specifies a total payment of $50,000."
Prompt: "Based on the document, what was the contract value?"
Output: "According to the document, the contract value was $75,000."
Issue: Output contradicts the source document (50K vs 75K).
Intrinsic Hallucination

The output directly contradicts the source material provided in the prompt

High Severity
Prompt: "Summarize: 'The study found no significant correlation between X and Y.'"
Output: "The study demonstrated a strong correlation between X and Y."
Issue: Direct contradiction of the source text.
Extrinsic Hallucination

The output includes information that cannot be verified from the source (may or may not be true)

Medium Severity
Prompt: "Summarize the article about climate change."
Output: "The article discusses rising temperatures. The author, Dr. Smith, has won multiple awards for this research."
Issue: The awards claim cannot be verified from the source.

Intrinsic vs Extrinsic Hallucinations

Intrinsic
Output directly contradicts the source. Always wrong.
Example: Source says "founded in 2018", output says "founded in 2015"
Extrinsic
Output adds information not in source. May or may not be true.
Example: Source does not mention CEO, output names "CEO John Smith"
3

Interactive Demo: Spot the Hallucinations

Compare a faithful response to one with hallucinations. Hover over highlighted text to see details.

Source Document
COMPANY REPORT Q3 2024 Revenue: $45.2 million (up 12% YoY) Net Income: $8.1 million Employees: 342 Headquarters: Austin, Texas Founded: 2018 Key Highlights: - Launched new product line in September - Expanded to 3 new markets in Europe - Customer satisfaction score: 4.2/5.0
Question: Summarize this company's Q3 2024 performance.
Hallucinated Response
ContradictionExtrinsic

The company reported exceptional Q3 2024 results with revenue of $52 million, a 15% increase from last year. Net income was $8.1 million. The company, founded in 2015, now employs over 500 people at its San Francisco headquarters. They launched two new product lines and expanded to 5 European countries. CEO John Martinez expressed optimism about future growth.

Contradicts source
Cannot verify (extrinsic)
Hover over highlights for details
Hallucination Analysis
7
Contradictions
1
Extrinsic Claims
1
Supported Claims
4

Detection Methods

Five approaches to detecting hallucinations, each with different tradeoffs in accuracy, cost, and speed.

1
NLI-based Detection
Use Natural Language Inference models to check if claims in the output are entailed by the source
Medium Complexity
Mechanism:
For each claim in output, classify as ENTAILMENT, NEUTRAL, or CONTRADICTION against source
Cost / 1K checks
$0.02
Latency
~200ms
Best for
RAG and document-grounded tasks
Pros
+ Interpretable results
+ Works with any LLM output
+ No LLM queries needed
+ Fast inference (~50ms/claim)
Cons
- Requires claim decomposition
- NLI models have limits on long contexts
- May miss nuanced contradictions
2
SelfCheckGPT
Sample multiple responses and check for consistency - hallucinations tend to be inconsistent across samples
Low Complexity
Mechanism:
Generate N responses at high temperature, measure agreement via BERTScore or NLI. Factual content is consistent; hallucinations vary.
Cost / 1K checks
$2.50
Latency
~5s
Best for
Open-ended generation without sources
Pros
+ No external knowledge needed
+ Works for any domain
+ Detects confident hallucinations
Cons
- Requires 5-20 API calls per check
- Slower and more expensive
- Consistent hallucinations slip through
3
Retrieval Grounding
Retrieve evidence from knowledge base or web and verify claims against retrieved documents
High Complexity
Mechanism:
Extract claims -> embed with sentence-transformers -> retrieve from FAISS/vector store -> NLI verification
Cost / 1K checks
$0.50
Latency
~1s
Best for
Fact-checking against known corpora
Pros
+ External ground truth
+ Catches factual errors
+ Scalable to large KBs
Cons
- Requires knowledge base setup
- Retrieval quality is a bottleneck
- May not cover all claims
4
RAGAS Evaluation
Comprehensive RAG evaluation framework measuring faithfulness, relevance, and groundedness
Medium Complexity
Mechanism:
Decompose answer into claims, verify each against context using LLM-as-judge (e.g., GPT-4 or Claude)
Cost / 1K checks
$1.20
Latency
~3s
Best for
Production RAG pipeline monitoring
Pros
+ Production-ready framework
+ Multiple metrics in one run
+ LangChain integration
Cons
- Requires LLM for evaluation ($)
- LLM judge has known biases
- v0.2 API still evolving
5
Atomic Fact Verification (FActScore)
Break output into atomic facts and verify each independently against Wikipedia or knowledge source
High Complexity
Mechanism:
Decompose text into atomic facts -> retrieve Wikipedia evidence per fact -> classify supported/unsupported
Cost / 1K checks
$3.00
Latency
~8s
Best for
Biography and entity-centric generation
Pros
+ Fine-grained per-fact analysis
+ Quantifiable scores
+ Published academic methodology (Min et al., 2023)
Cons
- Decomposition is hard to get right
- Expensive at scale
- Wikipedia coverage varies by domain
5

Decision Guide: Choose Your Detection Method

Answer the questions below to find the best hallucination detection approach for your use case. Click an option to navigate the decision tree.

What type of hallucination are you detecting?

Method Comparison (all 5 methods across 5 dimensions)

6

Real Hallucination Rates (Vectara HHEM, March 2026)

Live data from the Vectara Hallucination Leaderboard — the industry-standard benchmark using HHEM-2.3 on 7,700+ articles. Measures summarization factual consistency: what % of model summaries contain hallucinated claims.

OpenAI
Anthropic
Meta
Google
Mistral
DeepSeek
Microsoft
Amazon
Alibaba
AntGroup
xAI

87 models evaluated. Best: AntGroup Finix S1 32B at 1.8%. Worst in top tier: Phi-4 Mini at 23.5%. Reasoning models (o3-pro, o4-mini) consistently hallucinate more than their standard counterparts.

Left: Best hallucination rate dropped from 8.5% (early 2024) to 1.8% (March 2026) — a 4.7x improvement in 2 years. Right: Reasoning models (o3-pro 23.3%, o4-mini 18.6%, DeepSeek R1 11.3%) consistently hallucinate 2-3x more than standard variants on summarization tasks.

Full Leaderboard Snapshot (Top 29 models)

Source: github.com/vectara/hallucination-leaderboard, updated March 20 2026

#ModelFamilyHalluc. RateConsistencyYear
1Finix S1 32BAntGroup1.8%98.2%2026
2GPT-5.4 NanoOpenAI3.1%96.9%2026
3Gemini 2.5 Flash LiteGoogle3.3%96.7%2026
4Phi-4Microsoft3.7%96.3%2025
5Llama 3.3 70BMeta4.1%95.9%2025
6Mistral Large 2411Mistral4.5%95.5%2024
7Qwen 3 8BAlibaba4.8%95.2%2025
8Nova ProAmazon5.1%94.9%2025
9DeepSeek V3.2 ExpDeepSeek5.3%94.7%2026
10GPT-4.1OpenAI5.6%94.4%2025
11Grok-3xAI5.8%94.2%2025
12DeepSeek V3DeepSeek6.1%93.9%2025
13Gemini 2.5 ProGoogle7%93.0%2025
14Llama 4 ScoutMeta7.7%92.3%2026
15Gemini 2.5 FlashGoogle7.8%92.2%2025
16GPT-4oOpenAI9.6%90.4%2024
17Claude Haiku 4.5Anthropic9.8%90.2%2025
18Claude Sonnet 4Anthropic10.3%89.7%2025
19Claude Sonnet 4.6Anthropic10.6%89.4%2026
20DeepSeek R1DeepSeek11.3%88.7%2025
21Claude Opus 4Anthropic12%88.0%2025
22Claude Opus 4.6Anthropic12.2%87.8%2026
23Gemini 3 ProGoogle13.6%86.4%2026
24Mistral 3 LargeMistral14.5%85.5%2025
25GPT-5 HighOpenAI15.1%84.9%2025
26o4-miniOpenAI18.6%81.4%2025
27Grok-4 FastxAI19.7%80.3%2026
28o3-proOpenAI23.3%76.7%2025
29Phi-4 MiniMicrosoft23.5%76.5%2025

Key Trends: 2024 → 2026

4.7x improvement in 2 years

Best hallucination rate dropped from 8.5% (early 2024, GPT-3.5 era) to 1.8% (March 2026, Finix S1 32B). Average top-5 rate improved from 12.3% to 3.4%. The number of evaluated models grew from 22 to 87.

Reasoning models hallucinate MORE

Counterintuitively, reasoning/chain-of-thought models are worse at summarization fidelity. o3-pro hallucination rate: 23.3% vs GPT-4.1 at 5.6%. DeepSeek R1: 11.3% vs DeepSeek V3: 6.1%. Extended reasoning may introduce fabricated intermediate steps.

Small models close the gap

Phi-4 (14B) achieves 3.7% — better than GPT-4o (9.6%) at a fraction of the size. Qwen 3 8B achieves 4.8%. Specialized training for factual consistency matters more than raw parameter count.

Chinese labs lead on factuality

AntGroup (1.8%), Alibaba/Qwen (4.8%), DeepSeek (5.3%) dominate the top of the leaderboard. Their focus on factual grounding during training appears to outperform Western labs' RLHF approaches on this particular benchmark.

7

Benchmarks and Evaluation

Standard datasets and metrics for measuring hallucination rates, with top model scores from published results.

BenchmarkTypeSizeMetricTop Model ScorePaper
TruthfulQAFactuality817 questions% Truthful + InformativeGPT-4-Turbo: 64.1%Lin et al., 2022
HaluEvalDetection35K samplesDetection AccuracyGPT-4: 87.2%Li et al., 2023
FActScoreBiography500+ bios% Supported Atomic FactsRAG GPT-4: 82.7%Min et al., 2023
FEVERVerification185K claimsLabel AccuracyDeBERTa-v3: 91.3%Thorne et al., 2018
SummEvalSummarization2800 summariesFaithfulness ScoreGPT-4: 0.89Fabbri et al., 2021

Mitigation Strategies

Retrieval Augmentation (RAG)

Ground LLM responses in retrieved documents. RAG-augmented GPT-4 scores 82.7% on FActScore vs 73.1% without retrieval.

Effectiveness: High
Tradeoff: Adds ~200ms latency per query + retrieval infrastructure
Chain-of-Thought Prompting

Step-by-step reasoning surfaces errors earlier. Most effective on multi-hop reasoning tasks, less so on factual recall.

Effectiveness: Moderate
Tradeoff: 2-3x longer outputs, higher token cost
Self-Consistency Decoding

Sample N responses, majority-vote or filter by consistency. Reduces hallucination rate significantly on summarization tasks.

Effectiveness: High
Tradeoff: N x API cost, 3-5x latency
Calibrated Uncertainty

Train model to express uncertainty ('I'm not sure, but...') when confidence is low. RLHF-trained models do this better.

Effectiveness: Moderate
Tradeoff: May over-hedge, reducing informativeness
Citation Requirements

Force model to cite sources for each claim. Enables post-hoc verification. Perplexity and Google SGE use this pattern.

Effectiveness: High for verifiable claims
Tradeoff: Requires citation verification pipeline
Knowledge Cutoff Awareness

System prompt declares training date. Model can refuse to answer about events after cutoff instead of hallucinating.

Effectiveness: Moderate for temporal facts
Tradeoff: Does not help with persistent misconceptions
8

Production Code Examples

Complete, runnable implementations of each detection method using real libraries. Each example includes proper typing, error handling, and realistic output.

NLI Detectionpip install transformers torch spacy
DeBERTa-v3
# pip install transformers torch spacy
# python -m spacy download en_core_web_sm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import spacy
from typing import TypedDict

class ClaimResult(TypedDict):
    claim: str
    label: str          # ENTAILMENT | NEUTRAL | CONTRADICTION
    confidence: float
    is_hallucination: bool

# Use DeBERTa-v3 (not BART-MNLI) — purpose-built for NLI,
# consistently outperforms BART on MNLI and ANLI benchmarks
MODEL = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
nlp = spacy.load("en_core_web_sm")

LABELS = ["CONTRADICTION", "ENTAILMENT", "NEUTRAL"]
HALLUCINATION_THRESHOLD = 0.75  # flag if contradiction confidence > 75%

def decompose_claims(text: str) -> list[str]:
    """Split text into individual claims using spaCy sentence segmentation.
    Each sentence is treated as a separate claim to verify."""
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

def check_claims_batch(
    source: str,
    claims: list[str],
    threshold: float = HALLUCINATION_THRESHOLD
) -> list[ClaimResult]:
    """Batch-verify multiple claims against a source document.
    Returns per-claim verdicts with confidence scores."""
    results: list[ClaimResult] = []

    # Tokenize all pairs at once for efficient GPU batching
    pairs = [(source, claim) for claim in claims]
    encoded = tokenizer.batch_encode_plus(
        pairs,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

    with torch.no_grad():
        logits = model(**encoded).logits
        probs = torch.softmax(logits, dim=-1)

    for i, claim in enumerate(claims):
        label_idx = probs[i].argmax().item()
        label = LABELS[label_idx]
        confidence = probs[i][label_idx].item()

        results.append({
            "claim": claim,
            "label": label,
            "confidence": round(confidence, 3),
            "is_hallucination": (
                label == "CONTRADICTION" and confidence > threshold
            ),
        })

    return results

# --- Example usage ---
source = """TechCorp reported Q3 2024 revenue of $45.2 million,
a 12% increase year-over-year. Founded in 2018, headquarters
in Austin, Texas. Employees: 342."""

generated = """TechCorp had revenue of $45.2 million in Q3 2024,
growing 15% from last year. The Austin-based company now employs
over 500 people. CEO John Martinez praised the results."""

claims = decompose_claims(generated)
results = check_claims_batch(source, claims)

for r in results:
    flag = " ** HALLUCINATION **" if r["is_hallucination"] else ""
    print(f"[{r['label']} {r['confidence']:.0%}] {r['claim']}{flag}")

# Output:
# [ENTAILMENT 0.94] TechCorp had revenue of $45.2 million in Q3 2024,
#   growing 15% from last year.
# [CONTRADICTION 0.89] The Austin-based company now employs over 500
#   people. ** HALLUCINATION **
# [NEUTRAL 0.82] CEO John Martinez praised the results.

Quick Reference

Hallucination Types
  • - Factual: contradicts world knowledge
  • - Faithfulness: contradicts provided source
  • - Intrinsic: direct contradiction of input
  • - Extrinsic: unverifiable addition
Detection Methods + Scores
  • - NLI: entailment checking (~$0.02/1K)
  • - SelfCheck: consistency sampling (~$2.50/1K)
  • - RAGAS: RAG eval framework (~$1.20/1K)
  • - FActScore: atomic fact verification (~$3.00/1K)
  • - Retrieval: KB-grounded verification (~$0.50/1K)
Hallucination Rates (Vectara HHEM, March 2026)
  • - Best: Finix S1 32B — 1.8%
  • - GPT-5.4 Nano — 3.1%
  • - Phi-4 — 3.7%
  • - Llama 3.3 70B — 4.1%
  • - GPT-4o — 9.6%
  • - o3-pro — 23.3% (reasoning)
Key Takeaways (2026)
  • 1. Best hallucination rate improved from 8.5% (2024) to 1.8% (2026) — a 4.7x improvement in 2 years
  • 2. Reasoning models (o3-pro, o4-mini, R1) hallucinate 2-3x more than standard models on summarization tasks
  • 3. Small specialized models (Phi-4 3.7%, Qwen 3 8B 4.8%) now beat larger general models (GPT-4o 9.6%)
  • 4. Chinese labs (AntGroup, Alibaba, DeepSeek) lead the leaderboard on factual consistency benchmarks

Use Cases

  • RAG answer validation
  • Safety review
  • Content QA
  • Model evaluation

Architectural Patterns

Retrieval Entailment

Compare generation against retrieved evidence with NLI.

Self-Check Ensembles

Ask multiple models/queries and vote on consistency.

Implementations

Open Source

RAGAS

Apache 2.0
Open Source

Metrics for RAG faithfulness.

SelfCheckGPT

MIT
Open Source

Sampling-based hallucination scoring.

G-Eval

MIT
Open Source

LLM-based evaluation prompts.

Benchmarks

Quick Facts

Input
Text
Output
Structured Data
Implementations
3 open source, 0 API
Patterns
2 approaches

Found something interesting?

Share a paper, benchmark, or idea about hallucination detection and we'll write about it.

Suggest a Topic