Hallucination Detection
LLMs generate fluent, confident text that is factually wrong. Understanding why this happens, how to detect it, and how to mitigate it is the difference between a demo and a production system.
Why LLMs Hallucinate: Root Causes
Language models are not databases. They are next-token predictors trained on statistical regularities in text. Every hallucination traces back to a fundamental mismatch between what we want from these models (truthful answers) and what they were trained to do (produce probable continuations). Understanding the root causes is essential for choosing the right detection and mitigation strategy.
The term "hallucination" itself is borrowed from psychology, first applied to neural text generation by Rohrbach et al. (2018) in the context of image captioning, where models described objects not present in the image. It has since become the umbrella term for any generated content that is nonsensical or unfaithful to the provided source.
"We define object hallucination as a generated caption that includes an object that is not present in the image being described."
— Rohrbach, A. et al. (2018). Object Hallucination in Image Captioning. EMNLP.
Root Cause 1: The Training Objective Is Fluency, Not Truth
The standard language modeling objective is to minimize the cross-entropy loss on next-token prediction. The model learns: given the previous tokens, what token is statistically likely to come next? This is fundamentally different from: given the previous tokens, what token is factually correct?
RLHF (Reinforcement Learning from Human Feedback) partially addresses this by penalizing outputs that humans rate as unhelpful or incorrect. But RLHF introduces its own failure mode: sycophancy. Models trained with RLHF learn that agreeing with the user gets higher reward, so they sometimes confirm false premises rather than push back.
— Sharma, M. et al. (2023). Towards Understanding Sycophancy in Language Models. ICLR 2024.
Root Cause 2: Incomplete and Conflicting Knowledge
Models have a training data cutoff. Facts that changed after the cutoff are simply unknown. But the problem runs deeper: even within the training data, facts are often stated incorrectly, are contradicted across sources, or are underrepresented (the "long tail" problem).
- -Temporal decay: Training data has a cutoff. The model doesn't know the current Prime Minister if they took office after training.
- -Source conflict: Wikipedia says one thing, a news article says another. The model averages over contradictions.
- -Long-tail entities: Rare entities have few training examples. The model fills gaps by pattern-matching from more common entities, producing plausible-sounding but fabricated details.
- -Frequency bias: Common associations dominate. Ask about a less-known author and the model may attribute books written by a more famous one.
Root Cause 3: Attention Failures and Context Misuse
Even when the correct information exists in the context window, the model can fail to attend to it properly. Liu et al. (2024) demonstrated the "lost in the middle" phenomenon: models perform best when relevant information is at the beginning or end of the context, but degrade significantly when it's in the middle.
This means that even in RAG systems where the correct passage is retrieved and placed in the prompt, the model may ignore it and hallucinate from parametric memory instead.
— Liu, N. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL.
Root Cause 4: Decoding Strategy Amplifies Error
Autoregressive generation is sequential: each token is conditioned on all previous tokens, including previously hallucinated ones. Once the model generates a single incorrect fact early in a response, subsequent tokens are conditioned on that falsehood, leading to snowballing hallucination — a cascade of fabricated details that are internally consistent but factually wrong.
Temperature and top-p sampling add randomness, which trades off between diversity and faithfulness. Higher temperature increases hallucination rate; lower temperature makes output more repetitive but more grounded. There is no free lunch.
— Zhang, Y. et al. (2023). How Language Model Hallucinations Can Snowball. arXiv.
Grounded vs Ungrounded Response
Taxonomy: Types of Hallucination
Not all hallucinations are the same. The taxonomy matters because different types require different detection methods. Maynez et al. (2020) established the foundational distinction between intrinsic and extrinsic hallucinations in abstractive summarization, and Ji et al. (2023) provided a comprehensive survey across all generative tasks.
Intrinsic Hallucination
The output contradicts the provided source/context. Detectable via NLI models because the source provides a clear ground truth to compare against.
Example:
Context: "The company was founded in 2015 and has 50 employees."
LLM: "Founded in 2012, the company has grown to over 200 employees."
Both the year and employee count contradict the source.
Extrinsic Hallucination
The output adds information not present in the source. Harder to detect because the claim may be true — it's just not supported by the provided context.
Example:
Context: "The CEO announced quarterly earnings."
LLM: "The CEO, John Smith, announced record earnings of $5B, exceeding analyst expectations."
Name, amount, and analyst comparison are all unsupported by context.
Factual Fabrication
Inventing facts, dates, statistics, or citations that do not exist. The most dangerous type because it's presented with full confidence.
Entity Conflation
Merging attributes of different entities. Correctly recalling that someone won a Nobel Prize but attributing it to the wrong person.
Unfaithful Reasoning
Chain-of-thought that reaches a correct conclusion through incorrect intermediate steps, or vice versa. The reasoning looks valid but is logically flawed.
"We find that roughly 70% of the hallucinated content in neural abstractive summaries is extrinsic — information not present in the source document — while intrinsic hallucinations (contradicting the source) account for 30%."
— Maynez, J. et al. (2020). On Faithfulness and Factuality in Abstractive Summarization. ACL.
— Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
Hallucination Detection Pipeline
Method 1: NLI-Based Verification
Natural Language Inference (NLI) is the workhorse of hallucination detection. An NLI model takes a premise (the source context) and a hypothesis (the LLM's claim) and classifies the relationship as entailment, neutral, or contradiction.
This directly maps to hallucination detection: entailment means the claim is grounded, contradiction means intrinsic hallucination, and neutral means extrinsic hallucination (the context neither supports nor refutes the claim).
The approach was formalized by Honovich et al. (2022) with TRUE (a benchmark for evaluating NLI-based factual consistency), and by Laban et al. (2022) with SummaC, which showed that aggregating NLI scores over sentence-level pairs substantially outperforms document-level NLI.
— Honovich, O. et al. (2022). TRUE: Re-evaluating Factual Consistency. NAACL.
— Laban, P. et al. (2022). SummaC: Re-Visiting NLI-based Models for Inconsistency Detection. TACL.
NLI Classification for Hallucination
ENTAILMENT
Claim is supported by context
Grounded
NEUTRAL
Context doesn't confirm or deny
Unverifiable
CONTRADICTION
Claim conflicts with context
Hallucination
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class NLIHallucinationDetector:
"""
Detect hallucinations by checking if claims are entailed by context.
Uses DeBERTa-v3 fine-tuned on MNLI — stronger than BART-MNLI for
this task (see Honovich et al. 2022 TRUE benchmark).
"""
def __init__(self, model_name="microsoft/deberta-v3-large-mnli"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.model.eval()
# DeBERTa-MNLI label order: contradiction=0, neutral=1, entailment=2
self.labels = ["contradiction", "neutral", "entailment"]
def check_claim(self, context: str, claim: str) -> dict:
"""Check if a single claim is entailed by context."""
inputs = self.tokenizer(
context, claim,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.softmax(logits, dim=1)[0]
scores = {l: p.item() for l, p in zip(self.labels, probs)}
verdict = max(scores, key=scores.get)
return {
"claim": claim,
"verdict": verdict,
"is_grounded": verdict == "entailment",
"is_hallucination": verdict == "contradiction",
"scores": scores,
}
def check_response(self, context: str, claims: list[str]) -> dict:
"""Check all claims in an LLM response against context."""
results = [self.check_claim(context, c) for c in claims]
grounded = sum(1 for r in results if r["is_grounded"])
return {
"faithfulness": grounded / len(results) if results else 1.0,
"grounded_count": grounded,
"total_claims": len(results),
"results": results,
}
# Usage
detector = NLIHallucinationDetector()
context = "Apple Inc. reported revenue of $94.8 billion in Q1 2024."
claims = [
"Apple's Q1 2024 revenue was approximately $94.8 billion.", # Entailed
"This represented a 2% year-over-year increase.", # Neutral (not in context)
"Apple reported a loss in Q1 2024.", # Contradiction
]
result = detector.check_response(context, claims)
print(f"Faithfulness: {result['faithfulness']:.0%}")
# Faithfulness: 33% (only 1/3 claims entailed by the context)Claim 1: "Apple's Q1 2024 revenue was approximately $94.8 billion." → ENTAILMENT (0.94) ✓ Grounded Claim 2: "This represented a 2% year-over-year increase." → NEUTRAL (0.82) — Not verifiable from context Claim 3: "Apple reported a loss in Q1 2024." → CONTRADICTION (0.97) ✗ Hallucination
Choosing the right NLI model
DeBERTa-v3 (Microsoft) consistently outperforms BART-MNLI on factual consistency benchmarks. For production, consider microsoft/deberta-v3-base-mnli for speed (350ms/claim on CPU) or deberta-v3-large-mnli for accuracy. AlignScore (Zha et al., 2023) builds on this foundation with a unified alignment function that further improves detection.
Method 2: SelfCheckGPT
What if you have no source context to check against? This is the case for open-ended generation where the model is answering from parametric memory alone. Manakul et al. (2023) proposed SelfCheckGPT: sample multiple responses to the same prompt and measure consistency. If the model "knows" a fact, it will consistently state it across samples. If it's hallucinating, the details will vary randomly.
The insight is elegant: a model's own consistency is a proxy for its confidence, and low confidence on factual claims signals hallucination. The paper tests multiple consistency measures (BERTScore, QA, NLI, and n-gram overlap) and finds NLI-based consistency performs best.
— Manakul, P. et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection. EMNLP.
from openai import OpenAI
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import numpy as np
client = OpenAI()
def selfcheck_gpt(prompt: str, n_samples: int = 5,
temperature: float = 0.7) -> dict:
"""
SelfCheckGPT: detect hallucinations via sampling consistency.
No reference context needed — uses model's own consistency.
Steps:
1. Generate one 'main' response at temperature 0
2. Generate n_samples stochastic responses at higher temperature
3. For each sentence in main response, check if it's consistent
with the stochastic samples using NLI
4. Sentences that are contradicted across samples = likely hallucinated
"""
# Step 1: Generate main response (greedy)
main = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0,
).choices[0].message.content
# Step 2: Generate stochastic samples
samples = []
for _ in range(n_samples):
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
samples.append(resp.choices[0].message.content)
# Step 3: Split main response into sentences
sentences = [s.strip() for s in main.split('.') if s.strip()]
# Step 4: Check each sentence against each sample via NLI
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base-mnli")
model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base-mnli"
)
model.eval()
sentence_scores = []
for sentence in sentences:
contradictions = 0
for sample in samples:
inputs = tokenizer(sample, sentence,
return_tensors="pt", truncation=True)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits, dim=1)[0]
# contradiction = index 0 for DeBERTa-MNLI
if probs[0].item() > 0.5:
contradictions += 1
score = contradictions / n_samples
sentence_scores.append({
"sentence": sentence,
"hallucination_score": score, # 0 = consistent, 1 = always contradicted
"likely_hallucinated": score > 0.4,
})
avg_score = np.mean([s["hallucination_score"] for s in sentence_scores])
return {
"overall_score": avg_score,
"sentences": sentence_scores,
"main_response": main,
}
# Usage
result = selfcheck_gpt("Tell me about the founding of SpaceX.")
for s in result["sentences"]:
flag = " [!]" if s["likely_hallucinated"] else ""
print(f" {s['hallucination_score']:.2f} | {s['sentence'][:60]}...{flag}")When to use SelfCheckGPT
SelfCheckGPT is most valuable when you don't have reference documents to check against — for example, when evaluating a model's factual knowledge in open-domain QA. For RAG systems where you have the retrieved context, NLI-based verification is more direct and cheaper (one inference vs. n_samples + NLI).
Cost trade-off: SelfCheckGPT requires 5-10x the generation cost (multiple samples). In production, consider using it selectively — e.g., only on high-stakes responses or as a periodic evaluation metric rather than on every request.
Method 3: FActScore (Atomic Fact Decomposition)
Min et al. (2023) introduced FActScore (Factual precision in Atomicity Score), which breaks a generated text into individual atomic facts and verifies each one independently against a knowledge source. This is the most granular hallucination detection method available.
The key insight: a paragraph might be 80% accurate, but the 20% that is hallucinated could be the most critical part. Sentence-level detection misses this because hallucinated facts are often embedded within otherwise accurate sentences. Atomic decomposition catches them.
Their evaluation on biographies generated by InstructGPT, ChatGPT, and PerplexityAI found that even the best models hallucinate in 20-40% of atomic facts in biographical text — far higher than sentence-level metrics suggest.
— Min, S. et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision. EMNLP.
from openai import OpenAI
import json
client = OpenAI()
def decompose_to_atomic_facts(text: str) -> list[str]:
"""
Break text into atomic facts using an LLM.
Each atomic fact is a single, verifiable claim.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Decompose the text into atomic facts.
Each atomic fact should be:
- A single, self-contained claim
- Verifiable as true or false
- As specific as possible (include names, dates, numbers)
Return JSON: {"facts": ["fact1", "fact2", ...]}"""
}, {
"role": "user",
"content": text
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)["facts"]
def factscore(text: str, knowledge_source: str) -> dict:
"""
FActScore: atomic fact precision against a knowledge source.
1. Decompose text into atomic facts
2. Verify each fact against knowledge source (via NLI or LLM)
3. Return proportion of supported facts
"""
facts = decompose_to_atomic_facts(text)
verified = []
for fact in facts:
# Use LLM-as-judge for verification (see Method 4)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Determine if the claim is supported by the source.
Respond with JSON: {"supported": true/false, "reason": "brief explanation"}"""
}, {
"role": "user",
"content": f"Source: {knowledge_source}\n\nClaim: {fact}"
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
verified.append({
"fact": fact,
"supported": result["supported"],
"reason": result["reason"],
})
supported = sum(1 for v in verified if v["supported"])
return {
"factscore": supported / len(verified) if verified else 1.0,
"supported": supported,
"total": len(verified),
"details": verified,
}
# Usage
bio = """Marie Curie was born in Warsaw, Poland in 1867. She moved to Paris
in 1891 to study at the Sorbonne. She was the first woman to win a Nobel
Prize, receiving the Physics prize in 1903. She later won a second Nobel
Prize in Chemistry in 1911, making her the first person to win two Nobel
Prizes in different fields."""
source = """Marie Sklodowska Curie (1867-1934) was born in Warsaw. She moved
to Paris in 1891 and studied at the University of Paris. She won the Nobel
Prize in Physics in 1903 (shared with Pierre Curie and Henri Becquerel)
and the Nobel Prize in Chemistry in 1911."""
result = factscore(bio, source)
print(f"FActScore: {result['factscore']:.0%}")
# Note: "first person to win two Nobel Prizes in different fields"
# is actually Linus Pauling-debatable, but the claim about Curie
# being "first person" is commonly stated and technically true
# (Pauling won Peace, not a science field)Input: "Marie Curie was born in Warsaw in 1867 and won two Nobel Prizes." Atomic facts: 1. "Marie Curie was born in Warsaw." → SUPPORTED ✓ 2. "Marie Curie was born in 1867." → SUPPORTED ✓ 3. "Marie Curie won two Nobel Prizes." → SUPPORTED ✓ vs. a hallucinated response: 1. "Marie Curie was born in Krakow." → NOT SUPPORTED ✗ (Warsaw) 2. "She studied at Oxford University." → NOT SUPPORTED ✗ (Sorbonne) 3. "She won the Nobel Prize in 1901." → NOT SUPPORTED ✗ (1903)
Method 4: LLM-as-Judge
The simplest and increasingly popular approach: use a strong LLM to evaluate a weaker LLM's output. Zheng et al. (2023) showed that GPT-4's judgments correlate highly with human preferences (over 80% agreement), and Chiang & Lee (2023) demonstrated that LLM judges can detect factual errors with F1 scores competitive with specialized models.
The approach is attractive because it requires no model training and can be customized with natural language instructions. But it has important failure modes: LLM judges exhibit position bias (preferring the first option in comparisons), verbosity bias (preferring longer responses), and self-enhancement bias (models rate their own outputs higher). Careful prompt engineering and calibration are essential.
— Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench. NeurIPS.
— Chiang, C.-H. & Lee, H. (2023). Can Large Language Models Be an Alternative to Human Evaluations? ACL.
from openai import OpenAI
import json
client = OpenAI()
JUDGE_PROMPT = """You are a factual consistency judge. Given a source context
and an AI-generated response, evaluate each claim in the response.
For EACH claim, determine:
1. Is it SUPPORTED by the source context?
2. Is it CONTRADICTED by the source context?
3. Is it UNVERIFIABLE from the source context alone?
Be strict: if the source doesn't explicitly state something, mark it unverifiable.
Do not use your own knowledge — only judge against the provided source.
Respond in JSON:
{
"claims": [
{"text": "claim text", "verdict": "supported|contradicted|unverifiable",
"evidence": "quote from source or explanation"},
],
"overall_faithfulness": 0.0-1.0,
"summary": "one sentence overall assessment"
}"""
def llm_judge(context: str, response: str,
model: str = "gpt-4o") -> dict:
"""
Use a strong LLM to judge factual consistency.
Cost: ~$0.01-0.03 per evaluation with GPT-4o.
"""
result = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": JUDGE_PROMPT},
{"role": "user", "content": f"""Source context:
{context}
AI-generated response to evaluate:
{response}"""}
],
response_format={"type": "json_object"},
temperature=0, # Deterministic for consistency
)
return json.loads(result.choices[0].message.content)
# Usage
context = """Tesla, Inc. was founded in 2003 by Martin Eberhard and
Marc Tarpenning. Elon Musk joined as chairman in 2004 after leading
the Series A funding round. The company went public in 2010."""
response = """Tesla was founded by Elon Musk in 2003. The company
went public in 2010 with an IPO price of $17 per share. It is
headquartered in Austin, Texas."""
judgment = llm_judge(context, response)
print(f"Faithfulness: {judgment['overall_faithfulness']}")
print(f"Summary: {judgment['summary']}")
for claim in judgment["claims"]:
print(f" [{claim['verdict'].upper()}] {claim['text']}")
print(f" Evidence: {claim['evidence']}")Faithfulness: 0.33
Summary: "Only 1 of 3 claims is supported by the source."
[CONTRADICTED] "Tesla was founded by Elon Musk in 2003"
Evidence: Source says founded by Eberhard and Tarpenning; Musk joined in 2004
[SUPPORTED] "The company went public in 2010"
Evidence: Source states "The company went public in 2010"
[UNVERIFIABLE] "with an IPO price of $17 per share"
Evidence: Source does not mention IPO priceDetection Methods Compared
| Method | Needs Context? | Cost | Latency | Best For |
|---|---|---|---|---|
| NLI (DeBERTa) | Yes | Free (local) | ~50ms/claim | RAG, summarization |
| SelfCheckGPT | No | 5-10x gen cost | 5-10s | Open-domain QA |
| FActScore | Yes | Moderate | 2-5s | Biography, long-form |
| LLM-as-Judge | Yes | ~$0.01-0.03 | 1-3s | General-purpose |
Mitigation Strategies
Detection tells you what is hallucinated. Mitigation prevents it from happening in the first place — or catches it before it reaches the user. The most effective production systems layer multiple strategies.
Retrieval-Augmented Generation (RAG)
The single most effective mitigation. Provide the model with relevant source documents and instruct it to answer only from those documents. This shifts the model from parametric recall (unreliable) to reading comprehension (more reliable). Lewis et al. (2020) introduced RAG and showed it substantially reduces hallucination on knowledge-intensive tasks.
SYSTEM_PROMPT = """Answer based ONLY on the provided context. Rules: 1. If the context doesn't contain the answer, say "I don't have enough information to answer this." 2. Never add facts that aren't in the context. 3. Cite sources using [1], [2] format after each claim. 4. If you're uncertain about an interpretation, say so explicitly."""
— Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Chain-of-Verification (CoVe)
Dhuliawala et al. (2023) from Meta AI proposed having the model verify its own output through a structured process: (1) generate an initial response, (2) plan verification questions that could expose errors, (3) answer those questions independently, (4) produce a revised response incorporating the verification. This reduced hallucination by up to 50% on biography generation.
def chain_of_verification(query: str, context: str) -> dict:
# Step 1: Initial response
draft = generate(query, context)
# Step 2: Generate verification questions
questions = generate(f"""Given this draft response:
{draft}
What specific factual claims should be verified?
List 3-5 verification questions.""")
# Step 3: Answer verification questions independently
verifications = []
for q in questions:
answer = generate(q, context) # Answer from context only
verifications.append({"question": q, "answer": answer})
# Step 4: Revise based on verification
revised = generate(f"""Original: {draft}
Verification results: {verifications}
Revise the original, correcting any claims that failed verification.""")
return {"draft": draft, "revised": revised, "checks": verifications}— Dhuliawala, S. et al. (2023). Chain-of-Verification Reduces Hallucination. arXiv.
Confidence Estimation and Selective Abstention
Teach the model to say "I don't know." Kadavath et al. (2022) showed that LLMs do have some notion of uncertainty in their logits, but it's poorly calibrated — they are overconfident on wrong answers. Practical approaches include prompting for self-assessed confidence, using token-level log probabilities as uncertainty signals, and setting abstention thresholds.
def generate_with_abstention(query: str, context: str,
threshold: float = 0.7) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Answer from the context provided.
After your answer, state your confidence (0-100).
If confidence < 70, instead say: 'I cannot answer this
confidently based on the available information.'
Format: [answer]\n\nConfidence: [0-100]"""
}, {
"role": "user",
"content": f"Context: {context}\n\nQuestion: {query}"
}],
temperature=0,
)
text = response.choices[0].message.content
import re
match = re.search(r'Confidence: (\d+)', text)
confidence = int(match.group(1)) / 100 if match else 0.5
if confidence < threshold:
return {"answer": None, "abstained": True, "confidence": confidence}
answer = text.split("\nConfidence")[0].strip()
return {"answer": answer, "abstained": False, "confidence": confidence}— Kadavath, S. et al. (2022). Language Models (Mostly) Know What They Know. arXiv.
Post-Generation Verification Pipeline
The production-grade approach: generate first, then verify before returning to the user. Layer multiple detection methods for defense in depth.
def verified_generate(query: str, context: str) -> dict:
"""Production pipeline: generate → extract claims → verify → respond."""
# 1. Generate response
answer = generate(query, context)
# 2. Extract atomic claims
claims = decompose_to_atomic_facts(answer)
# 3. Verify each claim via NLI
nli_results = detector.check_response(context, claims)
# 4. Decision based on faithfulness score
if nli_results["faithfulness"] >= 0.9:
return {"answer": answer, "status": "verified", **nli_results}
elif nli_results["faithfulness"] >= 0.6:
# Partial hallucination — filter unverified claims
unsupported = [r["claim"] for r in nli_results["results"]
if not r["is_grounded"]]
return {
"answer": answer,
"status": "partial",
"warning": f"{len(unsupported)} claims could not be verified",
"unsupported_claims": unsupported,
}
else:
# Major hallucination — regenerate with stricter prompt
strict_answer = generate(query, context,
system="Answer ONLY with information explicitly stated in context.")
return {"answer": strict_answer, "status": "regenerated"}Evaluation Metrics for Production
You cannot improve what you do not measure. These are the standard metrics for tracking hallucination in production systems. RAGAS (Es et al., 2023) provides an open-source framework that computes most of these automatically.
— Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.
Faithfulness
Proportion of claims in the answer that are supported by the provided context. This is the primary metric for RAG systems.
Faithfulness = Supported Claims / Total Claims
Target: > 0.90 | Measured by: NLI, FActScore, LLM-judge
Factuality (Open-Domain)
Proportion of claims that are objectively true, verified against external knowledge. Harder to compute than faithfulness because it requires a ground truth source.
Factuality = True Claims / Total Claims
Target: > 0.95 | Measured by: FActScore, human eval
Hallucination Rate
Fraction of responses containing at least one hallucination. A response-level metric (binary per response), useful for monitoring overall system health.
HalRate = Responses with Halluc. / Total Responses
Target: < 0.10 | Measured by: Any detection method
Abstention Accuracy
How often the model correctly declines to answer when it should. Both false positives (answering when it shouldn't) and false negatives (refusing when it could) matter.
Abstention Acc = Correct Abstentions / Should Abstain
Target: > 0.80 | Measured by: Confidence thresholding
Example Production Monitoring Dashboard
| Metric | This Week | Last Week | Target | Status |
|---|---|---|---|---|
| Faithfulness (NLI) | 0.93 | 0.89 | > 0.90 | Pass |
| Hallucination Rate | 7.1% | 11.5% | < 10% | Pass |
| Abstention Accuracy | 0.76 | 0.68 | > 0.80 | Improve |
| FActScore (sampled) | 0.87 | 0.84 | > 0.85 | Pass |
Tip: Run FActScore on a daily sample (100-500 responses) rather than every response, due to cost. Use NLI for real-time gating and LLM-judge for detailed weekly audits.
Where the Field Is Heading
Hallucination detection is one of the most active research areas in NLP. Several directions are converging toward more robust solutions.
Representation Engineering for Truthfulness
Li et al. (2024) and others are finding that LLMs have internal representations of truthfulness that can be identified and amplified. "Inference-Time Intervention" modifies specific attention heads during generation to steer the model toward truthful outputs without retraining. This is a fundamentally different approach from post-hoc detection.
— Li, K. et al. (2024). Inference-Time Intervention: Eliciting Truthful Answers. NeurIPS.
Attribution and Provenance
Rather than detecting hallucination after the fact, attribution-based systems generate text with inline citations from the start. Gao et al. (2023) proposed ALCE (Automatic LLMs' Citation Evaluation), a benchmark for evaluating how well models cite their sources. The goal is models that are transparent about what they know and where they learned it.
— Gao, T. et al. (2023). Enabling Large Language Models to Generate Text with Citations. EMNLP.
Uncertainty Quantification via Conformal Prediction
Statistical methods from conformal prediction are being adapted to provide formal guarantees on LLM outputs. Instead of "this might be hallucinated," the goal is "this answer is correct with 95% probability." Early work is promising but the gap between theoretical guarantees and practical deployment remains significant.
Multi-Agent Debate and Verification
Du et al. (2023) showed that having multiple LLM instances debate their answers reduces factual errors. When agents must defend their claims against skeptical counterparts, hallucinations that survive peer scrutiny are substantially reduced. This is computationally expensive but effective for high-stakes applications.
— Du, Y. et al. (2023). Improving Factuality and Reasoning via Multi-Agent Debate. arXiv.
Key Takeaways
- 1
Hallucination is inherent to the architecture -- LLMs are next-token predictors, not knowledge bases. They will hallucinate as long as they generate text beyond what is provably grounded. Detection is not optional.
- 2
NLI is the cheapest, fastest detection method -- For RAG systems, running DeBERTa-MNLI on extracted claims gives you sub-100ms per-claim verification at zero API cost. It should be your first line of defense.
- 3
Layer detection methods for defense in depth -- NLI for real-time gating, SelfCheckGPT for open-domain, FActScore for granular auditing, LLM-as-judge for nuanced evaluation. No single method catches everything.
- 4
Mitigation is as important as detection -- RAG grounding, chain-of-verification, selective abstention, and mandatory citations prevent hallucinations before they happen. Post-generation verification catches the rest.
- 5
Measure faithfulness, not just accuracy -- Track faithfulness, hallucination rate, and abstention accuracy in production. Use RAGAS for automated evaluation. Sample FActScore weekly for deeper audits.
Further Reading
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.