Home/Building Blocks/PII Detection & Anonymization

Text→Structured Data

PII Detection & Anonymization

Detect and redact personally identifiable information to stay compliant.

How PII Detection Works

A technical deep-dive into Personally Identifiable Information detection. From regex patterns to LLM-powered extraction and redaction strategies.

1. What is PII 2. Detection Demo 3. Redaction 4. Methods 5. Code

What is Personally Identifiable Information?

PII is any data that can identify a specific individual. Unlike general entities, PII has legal implications under GDPR, CCPA, and HIPAA. Detecting and protecting PII is not optional - it is a compliance requirement.

The Problem

Imagine you are building a customer support chatbot. Every conversation contains names, emails, addresses, and sometimes credit card numbers. If this data leaks into your logs, training data, or third-party APIs, you face:

Legal Penalties

GDPR: up to 4% annual revenue

Reputation Damage

Customer trust is hard to rebuild

Identity Theft

Real harm to real people

Common PII Types

PERSON

Full names, nicknames

"John Smith"

Risk: Identity theft, social engineering

Email addresses

"john@example.com"

Risk: Spam, phishing, account takeover

PHONE

Phone numbers

"(555) 123-4567"

Risk: SMS phishing, identity verification

SSN

Social Security Numbers

"123-45-6789"

Risk: Complete identity theft, fraud

ADDRESS

Physical addresses

"123 Main St, NYC"

Risk: Physical security, stalking

CREDIT_CARD

Credit card numbers

"4111-1111-1111-1111"

Risk: Financial fraud

DOB

Date of birth

"01/15/1990"

Risk: Identity verification, fraud

IP_ADDRESS

IP addresses

"192.168.1.1"

Risk: Location tracking, network attacks

Detection

Finding PII in text. Returns entity spans with types and confidence scores. This is the first step - you cannot protect what you cannot find.

Input: "Call John at 555-1234"
Output: [{name: "John", phone: "555-1234"}]

Redaction

Removing or masking detected PII. The strategy depends on your use case: analytics might need hashing, while logs need full removal.

Input: "Call John at 555-1234"
Output: "Call [PERSON] at [PHONE]"

PII Detection in Action

See how PII detection works on real-world examples. Hover over highlighted text to see entity types. Toggle redaction to see different masking strategies.

Hi, my name is Sarah Johnson and I'm having issues with my order. You can reach me at sarah.jsarah.johnson@gmail.comme at (415) 555-0123(415) 555-0123address is 742 Evergreen Te742 Evergreen Terrace, Springfield, IL 62704tion, my SSN ends in 6789.6789

PERSON(Full names, nicknames)EMAIL(Email addresses)PHONE(Phone numbers)ADDRESS(Physical addresses)SSN_PARTIAL(Social Security Numbers)

PII Entities Found

Unique Types

HIGH

Risk Level

100%

Coverage

Redaction Strategies

Once you detect PII, what do you do with it? The answer depends on your downstream use case. Each strategy has tradeoffs between privacy, utility, and reversibility.

Mask

Replace with asterisks or X's

john@email.com

****@*****.***

Pros:

+ Maintains text structure
+ Clear something was removed

Cons:

- May reveal format/length

Replace

Swap with placeholder text

John Smith

[PERSON]

Pros:

+ Preserves semantic meaning
+ Good for analysis

Cons:

- Loses original context

Hash

One-way cryptographic hash

123-45-6789

a7b9c2d1...

Pros:

+ Consistent per value
+ Allows matching

Cons:

- Irreversible
- Loses format

Encrypt

Reversible with key

(555) 123-4567

xK9mP2...

Pros:

+ Recoverable if needed
+ Secure storage

Cons:

- Key management overhead

Synthetic

Replace with fake but realistic data

Sarah Johnson

Emily Williams

Pros:

+ Maintains data utility
+ Good for testing

Cons:

- Complex to implement

Choosing a Strategy

For Logging

Use Replace with type tags. Maintains log readability while removing sensitive data.

For Analytics

Use Hash to allow entity counting and matching without exposing values.

For Testing

Use Synthetic to maintain realistic data distributions.

Detection Methods Compared

There is no single best method. Production systems typically combine multiple approaches: regex for structured formats, NER for names, and LLMs for edge cases.

Method	Type	Accuracy	Speed	Best For
Presidio	Hybrid	High (85-95%)	Fast (~1ms/entity)	Enterprise PII redaction
spaCy NER	ML-based	High (90%+ for names)	Medium (~10ms/doc)	Name/organization detection
Regex Patterns	Rule-based	Variable (format-dependent)	Very fast (<1ms)	Structured formats (SSN, CC, phone)
LLM-based	Deep Learning	Very high (95%+)	Slow (500-2000ms)	Complex/ambiguous cases
GLiNER	Zero-shot NER	High (85-90%)	Medium (~50ms)	Custom entity definitions

Use Presidio when:

- Production-grade PII detection needed
- You need customizable recognizers
- Compliance (GDPR, HIPAA) is required
- Multiple languages supported

Use spaCy + Regex when:

- You already use spaCy for NLP
- Names and organizations are primary targets
- You want fine-grained control
- Custom patterns needed

Use Regex alone when:

- Only structured PII (SSN, CC, phone)
- Maximum speed is critical
- No ML infrastructure available
- Predictable, well-formatted input

Use LLM when:

- Complex, ambiguous cases
- Context-aware detection needed
- You need explanations
- Low volume, high value documents

The Hybrid Approach (Recommended)

Production systems should layer methods: regex first for structured formats (SSN, CC, phone), then NER for names and addresses, and optionally LLM for edge cases or validation. This gives you speed where possible and accuracy where needed.

Code Examples

Get started with PII detection in Python. From Microsoft Presidio to custom regex patterns.

Presidiopip install presidio-analyzer presidio-anonymizer

Production Ready

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = """
Customer: John Smith
Email: john.smith@email.com
SSN: 123-45-6789
Phone: (555) 123-4567
"""

# Analyze - detect PII entities
results = analyzer.analyze(
    text=text,
    entities=["PERSON", "EMAIL_ADDRESS", "US_SSN", "PHONE_NUMBER"],
    language="en"
)

# Print detected entities
for result in results:
    print(f"{result.entity_type}: {text[result.start:result.end]} (score: {result.score:.2f})")

# Anonymize with custom operators
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 10, "from_end": False}),
    "US_SSN": OperatorConfig("replace", {"new_value": "[SSN REDACTED]"}),
    "PHONE_NUMBER": OperatorConfig("hash"),
}

anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
print(anonymized.text)