Named Entity Recognition
Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.
NER extracts structured entities (people, organizations, locations, dates) from unstructured text. Fine-tuned transformers achieve 93+ F1 on CoNLL-2003, but real-world NER over noisy, domain-specific text (medical, legal, social media) remains significantly harder. GLiNER and LLM-based extraction are closing the gap for zero-shot entity types.
History
CoNLL-2003 shared task establishes the standard NER benchmark; CRF-based systems achieve ~88 F1
Stanford NER (Finkel et al.) becomes the go-to production system with CRF + feature engineering
Lample et al. introduce BiLSTM-CRF for NER, eliminating manual feature engineering
BERT fine-tuned on CoNLL-2003 pushes F1 to 92.8, establishing transformer dominance
Flair embeddings (Akbik et al.) combine character-level and contextual embeddings, reaching 93.09 F1
LUKE (Yamada et al.) introduces entity-aware pretraining, achieving 94.3 F1 on CoNLL-2003
UniversalNER explores instruction-tuned LLMs for open-domain entity extraction
GLiNER (Zaratiana et al.) enables zero-shot NER for arbitrary entity types using a bidirectional encoder with span matching
GPT-4o and Claude demonstrate strong NER via structured extraction prompts, rivaling fine-tuned models on standard types
How Named Entity Recognition Works
Tokenization
Text is tokenized into subwords; entity labels use BIO/BIOES tagging to mark span boundaries
Contextual encoding
Each token gets a contextualized representation from transformer layers that capture surrounding context
Token classification
A linear layer predicts an entity tag (B-PER, I-ORG, O, etc.) for each token position
CRF decoding (optional)
A conditional random field layer enforces valid tag sequences (e.g., I-PER can't follow B-ORG)
Span aggregation
Consecutive BIO tags are merged into entity spans with their types and confidence scores
Current Landscape
NER in 2025 sits at a crossroads: fine-tuned encoders remain faster and cheaper for standard entity types, but LLMs and zero-shot models like GLiNER are eating into the long tail of custom entity recognition. The CoNLL-2003 benchmark is effectively saturated above 94 F1 and no longer differentiates systems. Real progress is measured on harder benchmarks like MultiNERD, CrossNER, and domain-specific corpora (NCBI Disease, MIT Restaurant). Production systems increasingly combine fast encoder NER for common types with LLM extraction for complex or novel entity schemas.
Key Challenges
Nested entities (e.g., 'Bank of [America]ORG' where America is also a GPE) require specialized architectures like biaffine models
Domain adaptation: medical NER (diseases, drugs, genes) and legal NER (statutes, parties, holdings) need domain-specific training data
Entity boundary detection on noisy text (tweets, OCR output, chat logs) degrades significantly vs. clean news text
Cross-lingual NER for low-resource languages has far fewer labeled corpora than English
Emerging/novel entity types not in any training set require zero-shot approaches
Quick Recommendations
Best accuracy (English)
DeBERTa-v3-large + CRF fine-tuned on target domain
Consistently achieves 93+ F1 on CoNLL; CRF layer ensures valid sequences
Zero-shot / custom entity types
GLiNER-large-v2.1
Handles arbitrary entity types without training; 6x faster than LLM extraction
Production NER (speed)
spaCy v3 with transformer pipeline
Optimized for throughput with built-in entity linking and rule integration
Biomedical NER
PubMedBERT fine-tuned or BioBERT
Domain-specific pretraining on 30M+ PubMed abstracts dramatically improves medical entity recognition
Complex extraction with relations
GPT-4o with structured output
Can extract entities + relationships in a single pass with JSON schema enforcement
What's Next
The future is unified information extraction: single models that jointly perform NER, relation extraction, event extraction, and coreference resolution. GLiNER-style approaches will expand to handle nested and discontinuous entities natively. Expect on-device NER to improve rapidly as quantized encoder models under 100M params reach 90+ F1, enabling privacy-preserving extraction on mobile and edge devices.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Named Entity Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.