§ Building Blocks

Text→Structured Data

Hallucination Detection.

Score or flag generated text for factuality and grounding.

How Hallucination Detection Works

A technical deep-dive into LLM hallucinations: what they are, how to detect them with real benchmark data, production code examples, and a framework for choosing the right detection method. Includes real hallucination rates for 87 models from the Vectara HHEM leaderboard (March 2026) and historical trends from 2024-2026.

1. What is Hallucination 2. Types 3. Interactive Demo 4. Detection Methods 5. Decision Guide 6. Model Scores 7. Benchmarks 8. Code

What is Hallucination?

When an LLM generates content that is factually incorrect, unsupported by context, or entirely fabricated - yet presents it with the same confidence as accurate information.

The Core Problem

LLMs do not "know" things the way humans do. They are statistical pattern matchers trained on text. When the patterns suggest a plausible-sounding completion, the model generates it - regardless of factual accuracy. The model has no internal fact-checker, no way to say "I am making this up."

Why This Matters

Hallucinations are indistinguishable from accurate responses without verification. In high-stakes domains (medical, legal, financial), this is dangerous.

Root Cause

LLMs optimize for plausible next-token prediction, not truth. Training on internet text includes errors, contradictions, and outdated information.

Two Dimensions of Correctness

Factual Correctness

Does the output align with real-world facts and knowledge?

Question:

"What is the capital of France?"

Correct: "Paris"

Hallucination: "Lyon"

Verification requires external knowledge (world facts, databases, documents)

Faithfulness to Source

Does the output accurately reflect the provided context/source document?

Context: "The meeting is at 3pm"

"When is the meeting?"

Faithful: "3pm"

Hallucination: "2pm"

Verification only requires the source document (RAG, summarization tasks)

Key Insight: Faithfulness is Easier to Verify

For RAG systems and document-grounded tasks, you have the source text. This makes verification tractable: check if each claim in the output is supported by the source. Factual verification is harder because you need access to ground truth knowledge, which may be vast or unavailable.

Types of Hallucinations

Not all hallucinations are equal. Understanding the taxonomy helps choose the right detection method.

Factual Hallucination

The model states something factually incorrect that contradicts real-world knowledge

High Severity

Prompt: "When did the Eiffel Tower open?"

Output: "The Eiffel Tower opened in 1921 to celebrate France's victory in World War I."

Issue: The Eiffel Tower opened in 1889 for the World's Fair, not 1921.

Faithfulness Hallucination

The model's output contradicts or is not supported by the provided context/source

High Severity

Context: "The agreement specifies a total payment of $50,000."

Prompt: "Based on the document, what was the contract value?"

Output: "According to the document, the contract value was $75,000."

Issue: Output contradicts the source document (50K vs 75K).

Intrinsic Hallucination

The output directly contradicts the source material provided in the prompt

High Severity

Prompt: "Summarize: 'The study found no significant correlation between X and Y.'"

Output: "The study demonstrated a strong correlation between X and Y."

Issue: Direct contradiction of the source text.

Extrinsic Hallucination

The output includes information that cannot be verified from the source (may or may not be true)

Medium Severity

Prompt: "Summarize the article about climate change."

Output: "The article discusses rising temperatures. The author, Dr. Smith, has won multiple awards for this research."

Issue: The awards claim cannot be verified from the source.

Intrinsic vs Extrinsic Hallucinations

Intrinsic

Output directly contradicts the source. Always wrong.

Example: Source says "founded in 2018", output says "founded in 2015"

Extrinsic

Output adds information not in source. May or may not be true.

Example: Source does not mention CEO, output names "CEO John Smith"

Interactive Demo: Spot the Hallucinations

Compare a faithful response to one with hallucinations. Hover over highlighted text to see details.

Source Document

COMPANY REPORT Q3 2024 Revenue: $45.2 million (up 12% YoY) Net Income: $8.1 million Employees: 342 Headquarters: Austin, Texas Founded: 2018 Key Highlights: - Launched new product line in September - Expanded to 3 new markets in Europe - Customer satisfaction score: 4.2/5.0

Question: Summarize this company's Q3 2024 performance.

Hallucinated Response

ContradictionExtrinsic

The company reported exceptional Q3 2024 results with revenue of $52 million, a 15% increase from last year. Net income was $8.1 million. The company, founded in 2015, now employs over 500 people at its San Francisco headquarters. They launched two new product lines and expanded to 5 European countries. CEO John Martinez expressed optimism about future growth.

Contradicts source

Cannot verify (extrinsic)

Hover over highlights for details

Hallucination Analysis

Contradictions

Extrinsic Claims

Supported Claims

Detection Methods

Five approaches to detecting hallucinations, each with different tradeoffs in accuracy, cost, and speed.

NLI-based Detection

Use Natural Language Inference models to check if claims in the output are entailed by the source

Medium Complexity

Mechanism:

For each claim in output, classify as ENTAILMENT, NEUTRAL, or CONTRADICTION against source

Cost / 1K checks

$0.02

Latency

~200ms

Best for

RAG and document-grounded tasks

Pros

+ Interpretable results

+ Works with any LLM output

+ No LLM queries needed

+ Fast inference (~50ms/claim)

Cons

- Requires claim decomposition

- NLI models have limits on long contexts

- May miss nuanced contradictions

SelfCheckGPT

Sample multiple responses and check for consistency - hallucinations tend to be inconsistent across samples

Low Complexity

Mechanism:

Generate N responses at high temperature, measure agreement via BERTScore or NLI. Factual content is consistent; hallucinations vary.

Cost / 1K checks

$2.50

Latency

~5s

Best for

Open-ended generation without sources

Pros

+ No external knowledge needed

+ Works for any domain

+ Detects confident hallucinations

Cons

- Requires 5-20 API calls per check

- Slower and more expensive

- Consistent hallucinations slip through

Retrieval Grounding

Retrieve evidence from knowledge base or web and verify claims against retrieved documents

High Complexity

Mechanism:

Extract claims -> embed with sentence-transformers -> retrieve from FAISS/vector store -> NLI verification

Cost / 1K checks

$0.50

Latency

~1s

Best for

Fact-checking against known corpora

Pros

+ External ground truth

+ Catches factual errors

+ Scalable to large KBs

Cons

- Requires knowledge base setup

- Retrieval quality is a bottleneck

- May not cover all claims

RAGAS Evaluation

Comprehensive RAG evaluation framework measuring faithfulness, relevance, and groundedness

Medium Complexity

Mechanism:

Decompose answer into claims, verify each against context using LLM-as-judge (e.g., GPT-4 or Claude)

Cost / 1K checks

$1.20

Latency

~3s

Best for

Production RAG pipeline monitoring

Pros

+ Production-ready framework

+ Multiple metrics in one run

+ LangChain integration

Cons

- Requires LLM for evaluation ($)

- LLM judge has known biases

- v0.2 API still evolving

Atomic Fact Verification (FActScore)

Break output into atomic facts and verify each independently against Wikipedia or knowledge source

High Complexity

Mechanism:

Decompose text into atomic facts -> retrieve Wikipedia evidence per fact -> classify supported/unsupported

Cost / 1K checks

$3.00

Latency

~8s

Best for

Biography and entity-centric generation

Pros

+ Fine-grained per-fact analysis

+ Quantifiable scores

+ Published academic methodology (Min et al., 2023)

Cons

- Decomposition is hard to get right

- Expensive at scale

- Wikipedia coverage varies by domain

Decision Guide: Choose Your Detection Method

Answer the questions below to find the best hallucination detection approach for your use case. Click an option to navigate the decision tree.

What type of hallucination are you detecting?

Method Comparison (all 5 methods across 5 dimensions)

Real Hallucination Rates (Vectara HHEM, March 2026)

Live data from the Vectara Hallucination Leaderboard — the industry-standard benchmark using HHEM-2.3 on 7,700+ articles. Measures summarization factual consistency: what % of model summaries contain hallucinated claims.

OpenAI

Anthropic

Full Leaderboard Snapshot (Top 29 models)

Source: github.com/vectara/hallucination-leaderboard, updated March 20 2026

#	Model	Family	Halluc. Rate	Consistency	Year
1	Finix S1 32B	AntGroup	1.8%	98.2%	2026
2	GPT-5.4 Nano	OpenAI	3.1%	96.9%	2026
3	Gemini 2.5 Flash Lite	Google	3.3%	96.7%	2026
4	Phi-4	Microsoft	3.7%	96.3%	2025
5	Llama 3.3 70B	Meta	4.1%	95.9%	2025
6	Mistral Large 2411	Mistral	4.5%	95.5%	2024
7	Qwen 3 8B	Alibaba	4.8%	95.2%	2025
8	Nova Pro	Amazon	5.1%	94.9%	2025
9	DeepSeek V3.2 Exp	DeepSeek	5.3%	94.7%	2026
10	GPT-4.1	OpenAI	5.6%	94.4%	2025
11	Grok-3	xAI	5.8%	94.2%	2025
12	DeepSeek V3	DeepSeek	6.1%	93.9%	2025
13	Gemini 2.5 Pro	Google	7%	93.0%	2025
14	Llama 4 Scout	Meta	7.7%	92.3%	2026
15	Gemini 2.5 Flash	Google	7.8%	92.2%	2025
16	GPT-5.4	OpenAI	9.6%	90.4%	2024
17	Claude Haiku 4.5	Anthropic	9.8%	90.2%	2025
18	Claude Sonnet 4	Anthropic	10.3%	89.7%	2025
19	Claude Sonnet 4.6	Anthropic	10.6%	89.4%	2026
20	DeepSeek R1	DeepSeek	11.3%	88.7%	2025
21	Claude Opus 4	Anthropic	12%	88.0%	2025
22	Claude Opus 4.6	Anthropic	12.2%	87.8%	2026
23	Gemini 3 Pro	Google	13.6%	86.4%	2026
24	Mistral 3 Large	Mistral	14.5%	85.5%	2025
25	GPT-5 High	OpenAI	15.1%	84.9%	2025
26	o4-mini	OpenAI	18.6%	81.4%	2025
27	Grok-4 Fast	xAI	19.7%	80.3%	2026
28	o3-pro	OpenAI	23.3%	76.7%	2025
29	Phi-4 Mini	Microsoft	23.5%	76.5%	2025

Key Trends: 2024 → 2026

4.7x improvement in 2 years

Best hallucination rate dropped from 8.5% (early 2024, GPT-3.5 era) to 1.8% (March 2026, Finix S1 32B). Average top-5 rate improved from 12.3% to 3.4%. The number of evaluated models grew from 22 to 87.

Reasoning models hallucinate MORE

Counterintuitively, reasoning/chain-of-thought models are worse at summarization fidelity. o3-pro hallucination rate: 23.3% vs GPT-4.1 at 5.6%. DeepSeek R1: 11.3% vs DeepSeek V3: 6.1%. Extended reasoning may introduce fabricated intermediate steps.

Small models close the gap

Phi-4 (14B) achieves 3.7% — better than GPT-5.4 (9.6%) at a fraction of the size. Qwen 3 8B achieves 4.8%. Specialized training for factual consistency matters more than raw parameter count.

Chinese labs lead on factuality

AntGroup (1.8%), Alibaba/Qwen (4.8%), DeepSeek (5.3%) dominate the top of the leaderboard. Their focus on factual grounding during training appears to outperform Western labs' RLHF approaches on this particular benchmark.

Benchmarks and Evaluation

Standard datasets and metrics for measuring hallucination rates, with top model scores from published results.

Benchmark	Type	Size	Metric	Top Model Score	Paper
TruthfulQA	Factuality	817 questions	% Truthful + Informative	GPT-4-Turbo: 64.1%	Lin et al., 2022
HaluEval	Detection	35K samples	Detection Accuracy	GPT-4: 87.2%	Li et al., 2023
FActScore	Biography	500+ bios	% Supported Atomic Facts	RAG GPT-4: 82.7%	Min et al., 2023
FEVER	Verification	185K claims	Label Accuracy	DeBERTa-v3: 91.3%	Thorne et al., 2018
SummEval	Summarization	2800 summaries	Faithfulness Score	GPT-4: 0.89	Fabbri et al., 2021

Mitigation Strategies

Retrieval Augmentation (RAG)

Ground LLM responses in retrieved documents. RAG-augmented GPT-4 scores 82.7% on FActScore vs 73.1% without retrieval.

Effectiveness: High

Tradeoff: Adds ~200ms latency per query + retrieval infrastructure

Chain-of-Thought Prompting

Step-by-step reasoning surfaces errors earlier. Most effective on multi-hop reasoning tasks, less so on factual recall.

Effectiveness: Moderate

Tradeoff: 2-3x longer outputs, higher token cost

Self-Consistency Decoding

Sample N responses, majority-vote or filter by consistency. Reduces hallucination rate significantly on summarization tasks.

Effectiveness: High

Tradeoff: N x API cost, 3-5x latency

Calibrated Uncertainty

Train model to express uncertainty ('I'm not sure, but...') when confidence is low. RLHF-trained models do this better.

Effectiveness: Moderate

Tradeoff: May over-hedge, reducing informativeness

Citation Requirements

Force model to cite sources for each claim. Enables post-hoc verification. Perplexity and Google SGE use this pattern.

Effectiveness: High for verifiable claims

Tradeoff: Requires citation verification pipeline

Knowledge Cutoff Awareness

System prompt declares training date. Model can refuse to answer about events after cutoff instead of hallucinating.

Effectiveness: Moderate for temporal facts

Tradeoff: Does not help with persistent misconceptions

Production Code Examples

Complete, runnable implementations of each detection method using real libraries. Each example includes proper typing, error handling, and realistic output.

NLI Detectionpip install transformers torch spacy

DeBERTa-v3

# pip install transformers torch spacy
# python -m spacy download en_core_web_sm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import spacy
from typing import TypedDict

class ClaimResult(TypedDict):
    claim: str
    label: str          # ENTAILMENT | NEUTRAL | CONTRADICTION
    confidence: float
    is_hallucination: bool

# Use DeBERTa-v3 (not BART-MNLI) — purpose-built for NLI,
# consistently outperforms BART on MNLI and ANLI benchmarks
MODEL = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
nlp = spacy.load("en_core_web_sm")

LABELS = ["CONTRADICTION", "ENTAILMENT", "NEUTRAL"]
HALLUCINATION_THRESHOLD = 0.75  # flag if contradiction confidence > 75%

def decompose_claims(text: str) -> list[str]:
    """Split text into individual claims using spaCy sentence segmentation.
    Each sentence is treated as a separate claim to verify."""
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

def check_claims_batch(
    source: str,
    claims: list[str],
    threshold: float = HALLUCINATION_THRESHOLD
) -> list[ClaimResult]:
    """Batch-verify multiple claims against a source document.
    Returns per-claim verdicts with confidence scores."""
    results: list[ClaimResult] = []

    # Tokenize all pairs at once for efficient GPU batching
    pairs = [(source, claim) for claim in claims]
    encoded = tokenizer.batch_encode_plus(
        pairs,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

    with torch.no_grad():
        logits = model(**encoded).logits
        probs = torch.softmax(logits, dim=-1)

    for i, claim in enumerate(claims):
        label_idx = probs[i].argmax().item()
        label = LABELS[label_idx]
        confidence = probs[i][label_idx].item()

        results.append({
            "claim": claim,
            "label": label,
            "confidence": round(confidence, 3),
            "is_hallucination": (
                label == "CONTRADICTION" and confidence > threshold
            ),
        })

    return results

# --- Example usage ---
source = """TechCorp reported Q3 2024 revenue of $45.2 million,
a 12% increase year-over-year. Founded in 2018, headquarters
in Austin, Texas. Employees: 342."""

generated = """TechCorp had revenue of $45.2 million in Q3 2024,
growing 15% from last year. The Austin-based company now employs
over 500 people. CEO John Martinez praised the results."""

claims = decompose_claims(generated)
results = check_claims_batch(source, claims)

for r in results:
    flag = " ** HALLUCINATION **" if r["is_hallucination"] else ""
    print(f"[{r['label']} {r['confidence']:.0%}] {r['claim']}{flag}")

# Output:
# [ENTAILMENT 0.94] TechCorp had revenue of $45.2 million in Q3 2024,
#   growing 15% from last year.
# [CONTRADICTION 0.89] The Austin-based company now employs over 500
#   people. ** HALLUCINATION **
# [NEUTRAL 0.82] CEO John Martinez praised the results.

Quick Reference

Hallucination Types

- Factual: contradicts world knowledge
- Faithfulness: contradicts provided source
- Intrinsic: direct contradiction of input
- Extrinsic: unverifiable addition

Detection Methods + Scores

- NLI: entailment checking (~$0.02/1K)
- SelfCheck: consistency sampling (~$2.50/1K)
- RAGAS: RAG eval framework (~$1.20/1K)
- FActScore: atomic fact verification (~$3.00/1K)
- Retrieval: KB-grounded verification (~$0.50/1K)

Hallucination Rates (Vectara HHEM, March 2026)

- Best: Finix S1 32B — 1.8%
- GPT-5.4 Nano — 3.1%
- Phi-4 — 3.7%
- Llama 3.3 70B — 4.1%
- GPT-5.4 — 9.6%
- o3-pro — 23.3% (reasoning)

Key Takeaways (2026)

1. Best hallucination rate improved from 8.5% (2024) to 1.8% (2026) — a 4.7x improvement in 2 years
2. Reasoning models (o3-pro, o4-mini, R1) hallucinate 2-3x more than standard models on summarization tasks
3. Small specialized models (Phi-4 3.7%, Qwen 3 8B 4.8%) now beat larger general models (GPT-5.4 9.6%)
4. Chinese labs (AntGroup, Alibaba, DeepSeek) lead the leaderboard on factual consistency benchmarks

§ Use cases