Natural Language Processingtoken-classification

Named Entity Recognition

Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.

1 datasets7 resultsView full task mapping →

NER extracts structured entities (people, organizations, locations, dates) from unstructured text. Fine-tuned transformers achieve 93+ F1 on CoNLL-2003, but real-world NER over noisy, domain-specific text (medical, legal, social media) remains significantly harder. GLiNER and LLM-based extraction are closing the gap for zero-shot entity types.

History

2003

CoNLL-2003 shared task establishes the standard NER benchmark; CRF-based systems achieve ~88 F1

2011

Stanford NER (Finkel et al.) becomes the go-to production system with CRF + feature engineering

2015

Lample et al. introduce BiLSTM-CRF for NER, eliminating manual feature engineering

2018

BERT fine-tuned on CoNLL-2003 pushes F1 to 92.8, establishing transformer dominance

2019

Flair embeddings (Akbik et al.) combine character-level and contextual embeddings, reaching 93.09 F1

2020

LUKE (Yamada et al.) introduces entity-aware pretraining, achieving 94.3 F1 on CoNLL-2003

2022

UniversalNER explores instruction-tuned LLMs for open-domain entity extraction

2023

GLiNER (Zaratiana et al.) enables zero-shot NER for arbitrary entity types using a bidirectional encoder with span matching

2024

GPT-4o and Claude demonstrate strong NER via structured extraction prompts, rivaling fine-tuned models on standard types

How Named Entity Recognition Works

1TokenizationText is tokenized into subw…2Contextual encodingEach token gets a contextua…3Token classificationA linear layer predicts an …4CRF decoding (optiona…A conditional random field …5Span aggregationConsecutive BIO tags are me…Named Entity Recognition Pipeline
1

Tokenization

Text is tokenized into subwords; entity labels use BIO/BIOES tagging to mark span boundaries

2

Contextual encoding

Each token gets a contextualized representation from transformer layers that capture surrounding context

3

Token classification

A linear layer predicts an entity tag (B-PER, I-ORG, O, etc.) for each token position

4

CRF decoding (optional)

A conditional random field layer enforces valid tag sequences (e.g., I-PER can't follow B-ORG)

5

Span aggregation

Consecutive BIO tags are merged into entity spans with their types and confidence scores

Current Landscape

NER in 2025 sits at a crossroads: fine-tuned encoders remain faster and cheaper for standard entity types, but LLMs and zero-shot models like GLiNER are eating into the long tail of custom entity recognition. The CoNLL-2003 benchmark is effectively saturated above 94 F1 and no longer differentiates systems. Real progress is measured on harder benchmarks like MultiNERD, CrossNER, and domain-specific corpora (NCBI Disease, MIT Restaurant). Production systems increasingly combine fast encoder NER for common types with LLM extraction for complex or novel entity schemas.

Key Challenges

Nested entities (e.g., 'Bank of [America]ORG' where America is also a GPE) require specialized architectures like biaffine models

Domain adaptation: medical NER (diseases, drugs, genes) and legal NER (statutes, parties, holdings) need domain-specific training data

Entity boundary detection on noisy text (tweets, OCR output, chat logs) degrades significantly vs. clean news text

Cross-lingual NER for low-resource languages has far fewer labeled corpora than English

Emerging/novel entity types not in any training set require zero-shot approaches

Quick Recommendations

Best accuracy (English)

DeBERTa-v3-large + CRF fine-tuned on target domain

Consistently achieves 93+ F1 on CoNLL; CRF layer ensures valid sequences

Zero-shot / custom entity types

GLiNER-large-v2.1

Handles arbitrary entity types without training; 6x faster than LLM extraction

Production NER (speed)

spaCy v3 with transformer pipeline

Optimized for throughput with built-in entity linking and rule integration

Biomedical NER

PubMedBERT fine-tuned or BioBERT

Domain-specific pretraining on 30M+ PubMed abstracts dramatically improves medical entity recognition

Complex extraction with relations

GPT-4o with structured output

Can extract entities + relationships in a single pass with JSON schema enforcement

What's Next

The future is unified information extraction: single models that jointly perform NER, relation extraction, event extraction, and coreference resolution. GLiNER-style approaches will expand to handle nested and discontinuous entities natively. Expect on-device NER to improve rapidly as quantized encoder models under 100M params reach 90+ F1, enabling privacy-preserving extraction on mobile and edge devices.

Benchmarks & SOTA

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Something wrong or missing?

Help keep Named Entity Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000