PII Detection & Anonymization
Detect and redact personally identifiable information to stay compliant.
How PII Detection Works
A technical deep-dive into Personally Identifiable Information detection. From regex patterns to LLM-powered extraction and redaction strategies.
What is Personally Identifiable Information?
PII is any data that can identify a specific individual. Unlike general entities, PII has legal implications under GDPR, CCPA, and HIPAA. Detecting and protecting PII is not optional - it is a compliance requirement.
The Problem
Imagine you are building a customer support chatbot. Every conversation contains names, emails, addresses, and sometimes credit card numbers. If this data leaks into your logs, training data, or third-party APIs, you face:
Common PII Types
Finding PII in text. Returns entity spans with types and confidence scores. This is the first step - you cannot protect what you cannot find.
Output: [{name: "John", phone: "555-1234"}]
Removing or masking detected PII. The strategy depends on your use case: analytics might need hashing, while logs need full removal.
Output: "Call [PERSON] at [PHONE]"
PII Detection in Action
See how PII detection works on real-world examples. Hover over highlighted text to see entity types. Toggle redaction to see different masking strategies.
Hi, my name is Sarah Johnson and I'm having issues with my order. You can reach me at sarah.jsarah.johnson@gmail.comme at (415) 555-0123(415) 555-0123address is 742 Evergreen Te742 Evergreen Terrace, Springfield, IL 62704tion, my SSN ends in 6789.6789
Redaction Strategies
Once you detect PII, what do you do with it? The answer depends on your downstream use case. Each strategy has tradeoffs between privacy, utility, and reversibility.
Replace with asterisks or X's
- + Maintains text structure
- + Clear something was removed
- - May reveal format/length
Swap with placeholder text
- + Preserves semantic meaning
- + Good for analysis
- - Loses original context
One-way cryptographic hash
- + Consistent per value
- + Allows matching
- - Irreversible
- - Loses format
Reversible with key
- + Recoverable if needed
- + Secure storage
- - Key management overhead
Replace with fake but realistic data
- + Maintains data utility
- + Good for testing
- - Complex to implement
Choosing a Strategy
Detection Methods Compared
There is no single best method. Production systems typically combine multiple approaches: regex for structured formats, NER for names, and LLMs for edge cases.
| Method | Type | Accuracy | Speed | Best For |
|---|---|---|---|---|
| Presidio | Hybrid | High (85-95%) | Fast (~1ms/entity) | Enterprise PII redaction |
| spaCy NER | ML-based | High (90%+ for names) | Medium (~10ms/doc) | Name/organization detection |
| Regex Patterns | Rule-based | Variable (format-dependent) | Very fast (<1ms) | Structured formats (SSN, CC, phone) |
| LLM-based | Deep Learning | Very high (95%+) | Slow (500-2000ms) | Complex/ambiguous cases |
| GLiNER | Zero-shot NER | High (85-90%) | Medium (~50ms) | Custom entity definitions |
- - Production-grade PII detection needed
- - You need customizable recognizers
- - Compliance (GDPR, HIPAA) is required
- - Multiple languages supported
- - You already use spaCy for NLP
- - Names and organizations are primary targets
- - You want fine-grained control
- - Custom patterns needed
- - Only structured PII (SSN, CC, phone)
- - Maximum speed is critical
- - No ML infrastructure available
- - Predictable, well-formatted input
- - Complex, ambiguous cases
- - Context-aware detection needed
- - You need explanations
- - Low volume, high value documents
Production systems should layer methods: regex first for structured formats (SSN, CC, phone), then NER for names and addresses, and optionally LLM for edge cases or validation. This gives you speed where possible and accuracy where needed.
Code Examples
Get started with PII detection in Python. From Microsoft Presidio to custom regex patterns.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = """
Customer: John Smith
Email: john.smith@email.com
SSN: 123-45-6789
Phone: (555) 123-4567
"""
# Analyze - detect PII entities
results = analyzer.analyze(
text=text,
entities=["PERSON", "EMAIL_ADDRESS", "US_SSN", "PHONE_NUMBER"],
language="en"
)
# Print detected entities
for result in results:
print(f"{result.entity_type}: {text[result.start:result.end]} (score: {result.score:.2f})")
# Anonymize with custom operators
operators = {
"PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
"EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 10, "from_end": False}),
"US_SSN": OperatorConfig("replace", {"new_value": "[SSN REDACTED]"}),
"PHONE_NUMBER": OperatorConfig("hash"),
}
anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
print(anonymized.text)Quick Reference
- - Microsoft Presidio (full suite)
- - or spaCy + custom regex
- - Always validate with test data
- - Start with regex patterns
- - Add NER for names
- - LLM for edge case analysis
- - False negatives are worse than false positives
- - Context matters: "John" vs "John Smith at 123 Main"
- - Always log what was redacted (not the values)
Use Cases
- ✓GDPR/CCPA compliance
- ✓Log redaction
- ✓Dataset cleaning
- ✓Customer support transcripts
Architectural Patterns
Sequence Labeling
Token-level tagging of PII spans.
Rule + ML Hybrid
Regex for high-precision entities plus ML for recall.
Implementations
API Services
AWS Comprehend PII
AWSManaged PII detection service.
Benchmarks
Quick Facts
- Input
- Text
- Output
- Structured Data
- Implementations
- 2 open source, 1 API
- Patterns
- 2 approaches
Related Blocks
Have benchmark data?
Help us track the state of the art for pii detection & anonymization.
Submit Results