Natural Language Processingtoken-classification

Named Entity Recognition

Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.

1 datasets7 resultsView full task mapping →

NER extracts structured entities (people, organizations, locations, dates) from unstructured text. Fine-tuned transformers achieve 93+ F1 on CoNLL-2003, but real-world NER over noisy, domain-specific text (medical, legal, social media) remains significantly harder. GLiNER and LLM-based extraction are closing the gap for zero-shot entity types.

History

2003

CoNLL-2003 shared task establishes the standard NER benchmark; CRF-based systems achieve ~88 F1

2011

Stanford NER (Finkel et al.) becomes the go-to production system with CRF + feature engineering

2015

Lample et al. introduce BiLSTM-CRF for NER, eliminating manual feature engineering

2018

BERT fine-tuned on CoNLL-2003 pushes F1 to 92.8, establishing transformer dominance

2019

Flair embeddings (Akbik et al.) combine character-level and contextual embeddings, reaching 93.09 F1

2020

LUKE (Yamada et al.) introduces entity-aware pretraining, achieving 94.3 F1 on CoNLL-2003

2022

UniversalNER explores instruction-tuned LLMs for open-domain entity extraction

2023

GLiNER (Zaratiana et al.) enables zero-shot NER for arbitrary entity types using a bidirectional encoder with span matching

2024

GPT-4o and Claude demonstrate strong NER via structured extraction prompts, rivaling fine-tuned models on standard types

How Named Entity Recognition Works

Tokenization

Text is tokenized into subwords; entity labels use BIO/BIOES tagging to mark span boundaries

Contextual encoding

Each token gets a contextualized representation from transformer layers that capture surrounding context

Token classification

A linear layer predicts an entity tag (B-PER, I-ORG, O, etc.) for each token position

CRF decoding (optional)

A conditional random field layer enforces valid tag sequences (e.g., I-PER can't follow B-ORG)

Span aggregation

Consecutive BIO tags are merged into entity spans with their types and confidence scores

Current Landscape

NER in 2025 sits at a crossroads: fine-tuned encoders remain faster and cheaper for standard entity types, but LLMs and zero-shot models like GLiNER are eating into the long tail of custom entity recognition. The CoNLL-2003 benchmark is effectively saturated above 94 F1 and no longer differentiates systems. Real progress is measured on harder benchmarks like MultiNERD, CrossNER, and domain-specific corpora (NCBI Disease, MIT Restaurant). Production systems increasingly combine fast encoder NER for common types with LLM extraction for complex or novel entity schemas.

Key Challenges

Nested entities (e.g., 'Bank of [America]ORG' where America is also a GPE) require specialized architectures like biaffine models

Domain adaptation: medical NER (diseases, drugs, genes) and legal NER (statutes, parties, holdings) need domain-specific training data

Entity boundary detection on noisy text (tweets, OCR output, chat logs) degrades significantly vs. clean news text

Cross-lingual NER for low-resource languages has far fewer labeled corpora than English

Emerging/novel entity types not in any training set require zero-shot approaches

Quick Recommendations

Best accuracy (English)

DeBERTa-v3-large + CRF fine-tuned on target domain

Consistently achieves 93+ F1 on CoNLL; CRF layer ensures valid sequences

Zero-shot / custom entity types

GLiNER-large-v2.1

Handles arbitrary entity types without training; 6x faster than LLM extraction

Production NER (speed)

spaCy v3 with transformer pipeline

Optimized for throughput with built-in entity linking and rule integration

Biomedical NER

PubMedBERT fine-tuned or BioBERT

Domain-specific pretraining on 30M+ PubMed abstracts dramatically improves medical entity recognition

Complex extraction with relations

GPT-4o with structured output

Can extract entities + relationships in a single pass with JSON schema enforcement

What's Next

The future is unified information extraction: single models that jointly perform NER, relation extraction, event extraction, and coreference resolution. GLiNER-style approaches will expand to handle nested and discontinuous entities natively. Expect on-device NER to improve rapidly as quantized encoder models under 100M params reach 90+ F1, enabling privacy-preserving extraction on mobile and edge devices.

Benchmarks & SOTA

CoNLL-2003

CoNLL-2003 Named Entity Recognition

20037 results

Reuters news stories annotated with 4 entity types: PER, ORG, LOC, MISC. The standard NER benchmark.

State of the Art

GLiNER-multitask

Knowledgator

93.8

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Named Entity Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing