Natural Language Processingsentence-similarity

Semantic Textual Similarity

Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.

1 datasets3 resultsView full task mapping →

Semantic similarity measures how close two texts are in meaning, powering search, deduplication, and clustering. Sentence-BERT made it practical by encoding texts into comparable vectors, and modern embedding models achieve near-human correlation on STS benchmarks. The task is largely solved for English sentence-level comparison; cross-lingual and paragraph-level similarity remain harder.

History

2012

SemEval STS shared task launches, establishing standard evaluation for semantic textual similarity

2014

GloVe + word mover's distance provides the first competitive neural approach to sentence similarity

2017

InferSent (Conneau et al.) trains sentence embeddings on NLI data; first neural model to beat baselines on STS

2019

Sentence-BERT (Reimers & Gurevych) achieves SOTA STS with siamese BERT fine-tuning; 65ms to compare 10K sentence pairs

2021

SimCSE (Gao et al.) shows that contrastive learning from dropout noise alone produces strong sentence embeddings

2022

E5 and INSTRUCTOR embeddings outperform SBERT on STS with task-specific instructions

2023

MTEB benchmark unifies STS evaluation with retrieval, classification, and clustering into one leaderboard

2024

NV-Embed-v2 and GTE-Qwen2 achieve 0.87+ Spearman correlation on STS-B, approaching human inter-annotator agreement

How Semantic Textual Similarity Works

Text encoding

Each text is independently encoded into a dense vector using a sentence transformer (mean pooling over token embeddings)

Similarity computation

Cosine similarity between the two vectors produces a score from -1 to 1; >0.8 typically indicates paraphrases

Cross-encoder reranking (optional)

For higher accuracy, a cross-encoder jointly processes both texts and directly outputs a similarity score, but at O(n²) cost

Calibration

Raw similarity scores are often linearly mapped to application-specific thresholds for matching, deduplication, or clustering

Current Landscape

Semantic similarity in 2025 is effectively solved for English sentence-level comparison — top models achieve Spearman correlations near human inter-annotator agreement. The practical focus has shifted to efficiency (how fast can you compare millions of pairs), cross-lingual alignment, and domain specialization. The bi-encoder vs. cross-encoder tradeoff remains fundamental: bi-encoders scale linearly but miss nuance, cross-encoders capture subtle meaning but don't scale. Production systems use both: bi-encoder for retrieval, cross-encoder for reranking.

Key Challenges

Negation sensitivity: 'The cat sat on the mat' and 'The cat didn't sit on the mat' have high vector similarity despite opposite meanings

Granularity: sentence-level models struggle with paragraph and document-level similarity comparison

Asymmetric relevance: query-document similarity differs from document-document similarity, requiring different model training

Domain-specific similarity (legal, medical) needs fine-tuning on domain pairs to capture specialized equivalences

Cross-lingual similarity still lags monolingual by 5-10 points on STS benchmarks

Quick Recommendations

Best STS accuracy

Cross-encoder: ms-marco-MiniLM-L-12-v2 (reranking)

Cross-encoders achieve the highest accuracy but scale quadratically — use for reranking top candidates

Fast bi-encoder similarity

all-MiniLM-L6-v2 or E5-large-v2

Encode once, compare via dot product; ideal for large-scale similarity search

Multilingual similarity

multilingual-e5-large or paraphrase-multilingual-MiniLM-L12-v2

Cross-lingual similarity for 100+ languages with single model

Highest quality bi-encoder

NV-Embed-v2 or GTE-Qwen2-7B-instruct

Top MTEB STS scores; worth the extra compute for quality-critical applications

What's Next

The future involves late-interaction models (ColBERT) that offer a middle ground between bi-encoder speed and cross-encoder accuracy, learned multi-vector representations that capture multiple aspects of meaning, and tighter integration with search and recommendation systems. Expect semantic similarity to become invisible infrastructure — embedded in every search box and recommendation engine rather than a standalone task.

Benchmarks & SOTA

STS Benchmark

20173 results

Semantic textual similarity with human-annotated sentence pairs

State of the Art

GTE-Qwen2-7B-instruct

Alibaba

88.4

spearman

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Semantic Textual Similarity benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing