Semantic Textual Similarity
Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.
Semantic similarity measures how close two texts are in meaning, powering search, deduplication, and clustering. Sentence-BERT made it practical by encoding texts into comparable vectors, and modern embedding models achieve near-human correlation on STS benchmarks. The task is largely solved for English sentence-level comparison; cross-lingual and paragraph-level similarity remain harder.
History
SemEval STS shared task launches, establishing standard evaluation for semantic textual similarity
GloVe + word mover's distance provides the first competitive neural approach to sentence similarity
InferSent (Conneau et al.) trains sentence embeddings on NLI data; first neural model to beat baselines on STS
Sentence-BERT (Reimers & Gurevych) achieves SOTA STS with siamese BERT fine-tuning; 65ms to compare 10K sentence pairs
SimCSE (Gao et al.) shows that contrastive learning from dropout noise alone produces strong sentence embeddings
E5 and INSTRUCTOR embeddings outperform SBERT on STS with task-specific instructions
MTEB benchmark unifies STS evaluation with retrieval, classification, and clustering into one leaderboard
NV-Embed-v2 and GTE-Qwen2 achieve 0.87+ Spearman correlation on STS-B, approaching human inter-annotator agreement
How Semantic Textual Similarity Works
Text encoding
Each text is independently encoded into a dense vector using a sentence transformer (mean pooling over token embeddings)
Similarity computation
Cosine similarity between the two vectors produces a score from -1 to 1; >0.8 typically indicates paraphrases
Cross-encoder reranking (optional)
For higher accuracy, a cross-encoder jointly processes both texts and directly outputs a similarity score, but at O(n²) cost
Calibration
Raw similarity scores are often linearly mapped to application-specific thresholds for matching, deduplication, or clustering
Current Landscape
Semantic similarity in 2025 is effectively solved for English sentence-level comparison — top models achieve Spearman correlations near human inter-annotator agreement. The practical focus has shifted to efficiency (how fast can you compare millions of pairs), cross-lingual alignment, and domain specialization. The bi-encoder vs. cross-encoder tradeoff remains fundamental: bi-encoders scale linearly but miss nuance, cross-encoders capture subtle meaning but don't scale. Production systems use both: bi-encoder for retrieval, cross-encoder for reranking.
Key Challenges
Negation sensitivity: 'The cat sat on the mat' and 'The cat didn't sit on the mat' have high vector similarity despite opposite meanings
Granularity: sentence-level models struggle with paragraph and document-level similarity comparison
Asymmetric relevance: query-document similarity differs from document-document similarity, requiring different model training
Domain-specific similarity (legal, medical) needs fine-tuning on domain pairs to capture specialized equivalences
Cross-lingual similarity still lags monolingual by 5-10 points on STS benchmarks
Quick Recommendations
Best STS accuracy
Cross-encoder: ms-marco-MiniLM-L-12-v2 (reranking)
Cross-encoders achieve the highest accuracy but scale quadratically — use for reranking top candidates
Fast bi-encoder similarity
all-MiniLM-L6-v2 or E5-large-v2
Encode once, compare via dot product; ideal for large-scale similarity search
Multilingual similarity
multilingual-e5-large or paraphrase-multilingual-MiniLM-L12-v2
Cross-lingual similarity for 100+ languages with single model
Highest quality bi-encoder
NV-Embed-v2 or GTE-Qwen2-7B-instruct
Top MTEB STS scores; worth the extra compute for quality-critical applications
What's Next
The future involves late-interaction models (ColBERT) that offer a middle ground between bi-encoder speed and cross-encoder accuracy, learned multi-vector representations that capture multiple aspects of meaning, and tighter integration with search and recommendation systems. Expect semantic similarity to become invisible infrastructure — embedded in every search box and recommendation engine rather than a standalone task.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Semantic Textual Similarity benchmarks accurate. Report outdated results, missing benchmarks, or errors.