Hallucination Detection
LLMs confidently lie. Learn to detect and prevent hallucinations in production systems.
What Causes Hallucinations
LLMs are trained to predict the next token, not to be truthful. They hallucinate for several fundamental reasons that cannot be fully eliminated through prompting alone.
Training Data Issues
- -Outdated information (knowledge cutoff)
- -Incorrect facts in training data
- -Underrepresented topics (long tail)
- -Conflicting information from multiple sources
Model Architecture Issues
- -Poor confidence calibration
- -Tendency to produce fluent but false text
- -Pattern matching over factual recall
- -Context window limitations
Hallucination Examples
Intrinsic hallucination (contradicts context)
Context: "The company was founded in 2015."
LLM: "The company, established in 2012, has grown significantly."
Extrinsic hallucination (adds unsupported facts)
Context: "The CEO announced quarterly earnings."
LLM: "The CEO, John Smith, announced record quarterly earnings of $5B."
Detection Method 1: NLI-Based Verification
Natural Language Inference (NLI) models classify whether a hypothesis is entailed, neutral, or contradicted by a premise. We use this to verify if LLM claims are supported by the source context.
NLI Classification
ENTAILMENT
Claim is supported by context
Grounded
NEUTRAL
Context doesn't confirm or deny
Uncertain
CONTRADICTION
Claim conflicts with context
Hallucination
from transformers import pipeline
# Load NLI model (BART trained on MNLI)
nli = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
def check_grounded(claim: str, context: str) -> dict:
"""
Check if a claim is supported by the context.
Returns label and confidence score.
"""
# NLI: Is the claim entailed by the context?
result = nli(
claim,
candidate_labels=["entailment", "neutral", "contradiction"],
hypothesis_template="{}",
multi_label=False
)
# Also check with context as premise
# Using the NLI properly: context |= claim?
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize with context as premise, claim as hypothesis
inputs = tokenizer(context, claim, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
labels = ["contradiction", "neutral", "entailment"]
label_probs = {l: p.item() for l, p in zip(labels, probs)}
return {
"is_grounded": label_probs["entailment"] > 0.5,
"label": max(label_probs, key=label_probs.get),
"scores": label_probs
}
# Usage
context = "The company reported revenue of $10 million in Q3 2024."
claim = "The company made $10 million in the third quarter."
result = check_grounded(claim, context)
print(f"Grounded: {result['is_grounded']}")
print(f"Label: {result['label']}")
print(f"Entailment score: {result['scores']['entailment']:.3f}")# Grounded claim Context: "The company reported revenue of $10 million in Q3 2024." Claim: "The company made $10 million in Q3." >>> Grounded: True, Entailment: 0.89 # Hallucinated claim Context: "The company reported revenue of $10 million in Q3 2024." Claim: "The company achieved record profits this quarter." >>> Grounded: False, Entailment: 0.12 (Neutral: 0.71)
Detection Method 2: Self-Consistency
Generate multiple answers to the same question. If the model gives inconsistent answers, it's likely hallucinating. Consistent answers across samples are more reliable.
from openai import OpenAI
from collections import Counter
import re
client = OpenAI()
def self_consistency_check(question: str, context: str,
n_samples: int = 5,
temperature: float = 0.7) -> dict:
"""
Generate multiple answers and check for consistency.
Returns the majority answer and a confidence score.
"""
answers = []
for _ in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based ONLY on the context provided. Be concise."},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
],
temperature=temperature
)
answers.append(response.choices[0].message.content.strip())
# Normalize answers for comparison (lowercase, strip punctuation)
def normalize(text):
return re.sub(r'[^a-z0-9\s]', '', text.lower()).strip()
normalized = [normalize(a) for a in answers]
# Count occurrences
counter = Counter(normalized)
most_common, count = counter.most_common(1)[0]
# Find original answer that matches normalized version
majority_answer = next(a for a, n in zip(answers, normalized) if n == most_common)
consistency_score = count / n_samples
return {
"answer": majority_answer,
"consistency_score": consistency_score,
"is_consistent": consistency_score >= 0.6,
"all_answers": answers,
"sample_count": n_samples
}
# Usage
context = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976."
question = "When was Apple founded?"
result = self_consistency_check(question, context, n_samples=5)
print(f"Answer: {result['answer']}")
print(f"Consistency: {result['consistency_score']:.0%}")
print(f"Reliable: {result['is_consistent']}")Self-Consistency Insight
Factual questions should yield consistent answers (100% consistency). Subjective questions naturally have variance. Low consistency on factual questions signals potential hallucination or insufficient context.
Detection Method 3: Retrieval Verification
Extract claims from the LLM response, then verify each claim is supported by retrieved context. This is how RAGAS faithfulness metric works.
from openai import OpenAI
import json
client = OpenAI()
def extract_claims(answer: str) -> list:
"""Extract atomic claims from an answer."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Extract all factual claims from the text.
Return a JSON array of strings, each being one atomic claim.
Only include verifiable factual statements, not opinions."""},
{"role": "user", "content": answer}
],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result.get("claims", [])
def verify_claims(claims: list, context: str) -> dict:
"""Verify each claim against the context using NLI."""
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")
verified = []
for claim in claims:
inputs = tokenizer(context, claim, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
entailment_score = probs[2].item() # Index 2 is entailment
verified.append({
"claim": claim,
"supported": entailment_score > 0.5,
"score": entailment_score
})
supported_count = sum(1 for v in verified if v["supported"])
faithfulness = supported_count / len(claims) if claims else 1.0
return {
"faithfulness_score": faithfulness,
"verified_claims": verified,
"supported_count": supported_count,
"total_claims": len(claims)
}
def full_hallucination_check(answer: str, context: str) -> dict:
"""Complete hallucination detection pipeline."""
claims = extract_claims(answer)
verification = verify_claims(claims, context)
return {
"is_faithful": verification["faithfulness_score"] >= 0.8,
"faithfulness_score": verification["faithfulness_score"],
"claims": verification["verified_claims"],
"summary": f"{verification['supported_count']}/{verification['total_claims']} claims verified"
}
# Usage
context = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The company is headquartered in Cupertino, California. As of 2023, Apple is
the world's largest technology company by revenue, with $383 billion in 2023.
"""
answer = """
Apple was founded in 1976 by Steve Jobs and Steve Wozniak.
The company is based in Cupertino and generated over $380 billion in revenue in 2023.
Apple is known for the iPhone, which was first released in 2008.
"""
result = full_hallucination_check(answer, context)
print(f"Faithful: {result['is_faithful']}")
print(f"Score: {result['faithfulness_score']:.0%}")
print(f"Verified: {result['summary']}")Faithful: False Score: 67% Verified: 2/3 claims verified Claims: [SUPPORTED] "Apple was founded in 1976 by Steve Jobs and Steve Wozniak" (0.91) [SUPPORTED] "The company is based in Cupertino and generated over $380B in 2023" (0.87) [NOT SUPPORTED] "iPhone was first released in 2008" (0.12) ^ Context doesn't mention iPhone release date (actual: 2007)
Mitigation Strategies
Detection is only half the battle. Here's how to prevent and mitigate hallucinations.
1. Grounding with RAG
Always provide source context. Instruct the model to ONLY use provided information.
system_prompt = """Answer based ONLY on the provided context. If the context doesn't contain the answer, say "I don't have enough information." Never add facts that aren't in the context. Cite sources using [1], [2] format."""
2. Confidence Thresholds
Ask the model to express uncertainty. Filter low-confidence answers.
def generate_with_confidence(query: str, context: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Answer the question based on context.
After your answer, rate your confidence from 0-100.
Format: [Answer]\n\nConfidence: [0-100]"""},
{"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
]
)
text = response.choices[0].message.content
# Parse confidence score
confidence = int(re.search(r'Confidence: (\d+)', text).group(1))
if confidence < 70:
return {"answer": None, "reason": "Low confidence", "confidence": confidence}
return {"answer": text.split("\n\nConfidence")[0], "confidence": confidence}3. Mandatory Citations
Force the model to cite sources for every claim. Uncited claims are suspect.
system_prompt = """Every factual statement MUST have a citation. Use format: [statement] [1] where [1] refers to a source. If you cannot cite a source for a claim, do not make that claim. Unsupported claims are forbidden."""
4. Post-Generation Verification
Run hallucination detection on every response before returning to user.
def safe_generate(query: str, context: str) -> dict:
# Generate answer
answer = generate_answer(query, context)
# Verify
verification = full_hallucination_check(answer, context)
if not verification["is_faithful"]:
# Option 1: Regenerate with stricter prompt
# Option 2: Filter out unverified claims
# Option 3: Add warning to user
return {
"answer": answer,
"warning": "Some claims could not be verified",
"unverified_claims": [c["claim"] for c in verification["claims"]
if not c["supported"]]
}
return {"answer": answer, "verified": True}Evaluation Metrics
Measure hallucination rates systematically to track improvement over time.
Faithfulness
% of claims in the answer that are supported by context.
Faithfulness = Supported Claims / Total Claims
Target: > 0.90
Factuality
% of claims that are objectively true (vs ground truth).
Factuality = True Claims / Total Claims
Target: > 0.95
Hallucination Rate
% of responses containing at least one hallucination.
HalRate = Responses with Halluc / Total Responses
Target: < 0.10
Abstention Rate
% of times model correctly says "I don't know" when it should.
Abstention = Correct IDK / Should IDK
Target: > 0.80
Production Monitoring Dashboard
| Metric | This Week | Last Week | Target | Status |
|---|---|---|---|---|
| Faithfulness | 0.92 | 0.89 | > 0.90 | Pass |
| Hallucination Rate | 8.2% | 11.5% | < 10% | Pass |
| Abstention Rate | 0.72 | 0.68 | > 0.80 | Improve |
| Avg Confidence | 0.84 | 0.81 | > 0.75 | Pass |
Key Takeaways
- 1
Hallucinations are fundamental - LLMs are trained to be fluent, not truthful. Detection is essential for production.
- 2
NLI is the workhorse - Use NLI models to verify if claims are entailed by source context.
- 3
Multi-method detection - Combine NLI, self-consistency, and claim extraction for robust detection.
- 4
Mitigate with grounding and citations - Force models to cite sources, verify post-generation, track metrics.