Home/Building Blocks/PII Detection & Anonymization
TextStructured Data

PII Detection & Anonymization

Detect and redact personally identifiable information to stay compliant.

How PII Detection Works

A technical deep-dive into Personally Identifiable Information detection. From regex patterns to LLM-powered extraction and redaction strategies.

1

What is Personally Identifiable Information?

PII is any data that can identify a specific individual. Unlike general entities, PII has legal implications under GDPR, CCPA, and HIPAA. Detecting and protecting PII is not optional - it is a compliance requirement.

The Problem

Imagine you are building a customer support chatbot. Every conversation contains names, emails, addresses, and sometimes credit card numbers. If this data leaks into your logs, training data, or third-party APIs, you face:

Legal Penalties
GDPR: up to 4% annual revenue
Reputation Damage
Customer trust is hard to rebuild
Identity Theft
Real harm to real people

Common PII Types

PERSON
Full names, nicknames
"John Smith"
Risk: Identity theft, social engineering
EMAIL
Email addresses
"john@example.com"
Risk: Spam, phishing, account takeover
PHONE
Phone numbers
"(555) 123-4567"
Risk: SMS phishing, identity verification
SSN
Social Security Numbers
"123-45-6789"
Risk: Complete identity theft, fraud
ADDRESS
Physical addresses
"123 Main St, NYC"
Risk: Physical security, stalking
CREDIT_CARD
Credit card numbers
"4111-1111-1111-1111"
Risk: Financial fraud
DOB
Date of birth
"01/15/1990"
Risk: Identity verification, fraud
IP_ADDRESS
IP addresses
"192.168.1.1"
Risk: Location tracking, network attacks
Detection

Finding PII in text. Returns entity spans with types and confidence scores. This is the first step - you cannot protect what you cannot find.

Input: "Call John at 555-1234"
Output: [{name: "John", phone: "555-1234"}]
Redaction

Removing or masking detected PII. The strategy depends on your use case: analytics might need hashing, while logs need full removal.

Input: "Call John at 555-1234"
Output: "Call [PERSON] at [PHONE]"
2

PII Detection in Action

See how PII detection works on real-world examples. Hover over highlighted text to see entity types. Toggle redaction to see different masking strategies.

Hi, my name is Sarah Johnson and I'm having issues with my order. You can reach me at sarah.jsarah.johnson@gmail.comme at (415) 555-0123(415) 555-0123address is 742 Evergreen Te742 Evergreen Terrace, Springfield, IL 62704tion, my SSN ends in 6789.6789

PERSON(Full names, nicknames)EMAIL(Email addresses)PHONE(Phone numbers)ADDRESS(Physical addresses)SSN_PARTIAL(Social Security Numbers)
5
PII Entities Found
5
Unique Types
HIGH
Risk Level
100%
Coverage
3

Redaction Strategies

Once you detect PII, what do you do with it? The answer depends on your downstream use case. Each strategy has tradeoffs between privacy, utility, and reversibility.

Mask

Replace with asterisks or X's

john@email.com
****@*****.***
Pros:
  • + Maintains text structure
  • + Clear something was removed
Cons:
  • - May reveal format/length
Replace

Swap with placeholder text

John Smith
[PERSON]
Pros:
  • + Preserves semantic meaning
  • + Good for analysis
Cons:
  • - Loses original context
Hash

One-way cryptographic hash

123-45-6789
a7b9c2d1...
Pros:
  • + Consistent per value
  • + Allows matching
Cons:
  • - Irreversible
  • - Loses format
Encrypt

Reversible with key

(555) 123-4567
xK9mP2...
Pros:
  • + Recoverable if needed
  • + Secure storage
Cons:
  • - Key management overhead
Synthetic

Replace with fake but realistic data

Sarah Johnson
Emily Williams
Pros:
  • + Maintains data utility
  • + Good for testing
Cons:
  • - Complex to implement

Choosing a Strategy

For Logging
Use Replace with type tags. Maintains log readability while removing sensitive data.
For Analytics
Use Hash to allow entity counting and matching without exposing values.
For Testing
Use Synthetic to maintain realistic data distributions.
4

Detection Methods Compared

There is no single best method. Production systems typically combine multiple approaches: regex for structured formats, NER for names, and LLMs for edge cases.

MethodTypeAccuracySpeedBest For
PresidioHybridHigh (85-95%)Fast (~1ms/entity)Enterprise PII redaction
spaCy NERML-basedHigh (90%+ for names)Medium (~10ms/doc)Name/organization detection
Regex PatternsRule-basedVariable (format-dependent)Very fast (<1ms)Structured formats (SSN, CC, phone)
LLM-basedDeep LearningVery high (95%+)Slow (500-2000ms)Complex/ambiguous cases
GLiNERZero-shot NERHigh (85-90%)Medium (~50ms)Custom entity definitions
Use Presidio when:
  • - Production-grade PII detection needed
  • - You need customizable recognizers
  • - Compliance (GDPR, HIPAA) is required
  • - Multiple languages supported
Use spaCy + Regex when:
  • - You already use spaCy for NLP
  • - Names and organizations are primary targets
  • - You want fine-grained control
  • - Custom patterns needed
Use Regex alone when:
  • - Only structured PII (SSN, CC, phone)
  • - Maximum speed is critical
  • - No ML infrastructure available
  • - Predictable, well-formatted input
Use LLM when:
  • - Complex, ambiguous cases
  • - Context-aware detection needed
  • - You need explanations
  • - Low volume, high value documents
*
The Hybrid Approach (Recommended)

Production systems should layer methods: regex first for structured formats (SSN, CC, phone), then NER for names and addresses, and optionally LLM for edge cases or validation. This gives you speed where possible and accuracy where needed.

5

Code Examples

Get started with PII detection in Python. From Microsoft Presidio to custom regex patterns.

Presidiopip install presidio-analyzer presidio-anonymizer
Production Ready
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = """
Customer: John Smith
Email: john.smith@email.com
SSN: 123-45-6789
Phone: (555) 123-4567
"""

# Analyze - detect PII entities
results = analyzer.analyze(
    text=text,
    entities=["PERSON", "EMAIL_ADDRESS", "US_SSN", "PHONE_NUMBER"],
    language="en"
)

# Print detected entities
for result in results:
    print(f"{result.entity_type}: {text[result.start:result.end]} (score: {result.score:.2f})")

# Anonymize with custom operators
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 10, "from_end": False}),
    "US_SSN": OperatorConfig("replace", {"new_value": "[SSN REDACTED]"}),
    "PHONE_NUMBER": OperatorConfig("hash"),
}

anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
print(anonymized.text)

Quick Reference

For Production
  • - Microsoft Presidio (full suite)
  • - or spaCy + custom regex
  • - Always validate with test data
For Prototyping
  • - Start with regex patterns
  • - Add NER for names
  • - LLM for edge case analysis
Key Considerations
  • - False negatives are worse than false positives
  • - Context matters: "John" vs "John Smith at 123 Main"
  • - Always log what was redacted (not the values)

Use Cases

  • GDPR/CCPA compliance
  • Log redaction
  • Dataset cleaning
  • Customer support transcripts

Architectural Patterns

Sequence Labeling

Token-level tagging of PII spans.

Rule + ML Hybrid

Regex for high-precision entities plus ML for recall.

Implementations

API Services

AWS Comprehend PII

AWS
API

Managed PII detection service.

Open Source

Microsoft Presidio

Apache 2.0
Open Source

Production-grade PII detection/redaction.

spaCy PII Pipelines

MIT
Open Source

Customizable NER + rules.

Benchmarks

Quick Facts

Input
Text
Output
Structured Data
Implementations
2 open source, 1 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for pii detection & anonymization.

Submit Results