Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

16 tasks27 datasets7436 results

NLP is no longer one leaderboard. Production choices split by output shape: embeddings for retrieval, encoders for cheap labels and entities, MT systems for translation, RAG for factual QA, and frontier LLMs only when generation or reasoning is the actual task.

Tasks & Benchmarks

State of the Field (2025)

  • Text embeddings should be selected from retrieval and reranking evidence such as MTEB, not from chat-model leaderboards
  • Classification, NER, and extraction are still often cheaper and easier to validate with fine-tuned encoder or small instruction models than with frontier chat models
  • Question answering in production is mostly a retrieval, citation, and evaluation problem; the generator is only one component of the system
  • Translation quality depends on language pair, terminology, and domain. WMT-style evidence and human review beat generic LLM rankings

Quick Recommendations

Semantic search, RAG retrieval, deduplication

Pick from MTEB retrieval and reranking evidence

Embeddings are infrastructure, not chat. Compare retrieval score, vector size, latency, license, language coverage, and serving cost before committing.

High-volume labels, moderation, routing, intent

Fine-tuned encoder or compact instruction model

For known label sets, small supervised models are easier to evaluate, cheaper to run, and less variable than a frontier chat model.

Machine translation for production

Dedicated MT system first; LLM second-pass only when needed

Route by language pair and domain. Use WMT, human adequacy/fluency review, terminology checks, and post-edit distance instead of a generic LLM pick.

Factual question answering over private data

RAG with citations, reranking, and answerability checks

The retrieval layer, citation discipline, and refusal policy usually decide quality. Benchmark with your documents and adversarial missing-answer questions.

Named entities and structured extraction

Fine-tuned token classifier, GLiNER-style model, or schema-constrained LLM

Use token classifiers for stable entity types and throughput. Use schema-constrained LLM extraction when entity definitions change often.

Summarization and rewriting

Current frontier LLM with factuality checks

Use a frontier generator when the output is prose. Still evaluate for dropped facts, invented claims, citation coverage, and controllable length.

Long-form reasoning or agentic workflows

Use the current frontier shortlist, then verify with task traces

Reasoning models change quickly. Do not hard-code an old model family; compare success rate, latency, tool-call reliability, and cost on your own workflows.

Show all datasets and SOTA results

Polish LLM General

Open PL LLM Leaderboard2025
60296.3(poleval2018-task3)internlm2-1_8b

Polish Cultural Competency

PLCC2025
100(geography)Gemini-3.0-Pro-Preview

Polish Text Understanding

CPTU-Bench2025
4.702247(tricky-questions)Qwen/Qwen3.5-35B-A3B thinking (API)

Polish Conversation Quality

Polish MT-Bench2025
10(humanities)Mistral-Small-Instruct-2409

Polish Emotional Intelligence

Polish EQ-Bench2025
78.07(eq-score)mistralai/Mistral-Large-Instruct-2407

Question Answering

BrowseComp2025
83.4(accuracy)DeepSeek-V4-Pro Max
DROP2019
87.8(f1)MiniMax-Text-01
FRAMES2024
HotpotQA2018
71.3(f1)GPT-4o
KILT2021
MuSiQue2021
Natural Questions2019
39.9(accuracy)LLaMA-65B
SQuAD v2.02018
92.2(f1)ALBERT ensemble
SimpleQA2024
54(accuracy)Gemini 2.5 Pro
TriviaQA2017
85(accuracy)Llama 2 70B (5-shot)

Feature Extraction

75.97(mteb-score)QZhou-Embedding

Text Summarization

CNN/DailyMail2015
47.78(rouge-1)BRIO

Text Ranking

BEIR2021
62.65(ndcg@10)NV-Embed-v2
41.8(mrr@10)RankLLaMA-7B

Natural Language Inference

SNLI2015
92.6(accuracy)GPT-4o

Named Entity Recognition

CoNLL-20032003
93.8(f1)GLiNER-multitask

Zero-Shot Classification

XNLI2018
87.4(accuracy)GPT-4

Fill-Mask

GLUE2018
91.37(avg-score)DeBERTa-v3-large

Semantic Textual Similarity

88.4(spearman)GTE-Qwen2-7B-instruct

Table Question Answering

SQA2017
75.3(accuracy)GPT-4

Reading Comprehension

RACE2017
89.4(accuracy)ALBERT ensemble

Honest Takes

Benchmark scores are mostly contamination

Models perform substantially better on problems released before their training cutoff than after. GLUE and SuperGLUE are saturated. Focus on domain-specific evaluation on your actual production data, not generic leaderboard rankings.

Longer context doesn't mean better reasoning

Models now support 2M token windows but reasoning degrades even when retrieval succeeds. Don't assume more context improves performance - test carefully as adding documents can hurt accuracy.

RAG beats fine-tuning for most use cases

Unless you have domain-specific style requirements or truly unique knowledge patterns, RAG provides better ROI. Knowledge updates without retraining, lower cost, and retrieval quality matters more than model size.

Edge deployment is viable now

90% size reduction through quantization while maintaining 95%+ accuracy enables capable models on mobile devices. Edge deployment grew 340% in 2025 - privacy and latency benefits outweigh cloud convenience for many applications.

Specialized models still win on translation

For production translation at scale, domain-specific MT systems and WMT-style evaluations are still the right first stop. Use frontier LLMs for context repair, terminology checks, or low-volume editorial workflows, not as the default answer.

Get notified when these results update

New models drop weekly. We track them so you don't have to.