Natural Language Processingsentence-similarity

Semantic Textual Similarity

Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.

1
Datasets
3
Results
spearman
Canonical metric
Canonical Benchmark

STS Benchmark

Semantic textual similarity with human-annotated sentence pairs

Primary metric: spearman
View full leaderboard

Top 10

Leading models on STS Benchmark.

RankModelspearmanYearSource
1
GTE-Qwen2-7B-instruct
88.42024paper
2
E5-Mistral-7B-instruct
84.72024paper
3
all-MiniLM-L6-v2
82.82022paper

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace