{"service":"CodeSOTA · /api/sota","version":"0.1","description":"Programmatic lookup for the current state-of-the-art per task. CodeSOTA is the registry, not a router — this endpoint returns the dated, sourced pick. Inference happens at your own provider.","endpoints":{"index":"https://www.codesota.com/api/sota","pick":"https://www.codesota.com/api/sota/{task_id}?tier=sota"},"supported_tiers":["sota"],"tasks":[{"id":"polish-llm-general","alias":null,"name":"Polish LLM General","description":"General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.","result_count":5100,"url":"https://www.codesota.com/api/sota/polish-llm-general"},{"id":"polish-cultural-competency","alias":null,"name":"Polish Cultural Competency","description":"Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & tradition, geography, grammar, history, and vocabulary.","result_count":1155,"url":"https://www.codesota.com/api/sota/polish-cultural-competency"},{"id":"document-ocr","alias":"ocr","name":"Optical Character Recognition","description":"Extracting text from document images","result_count":831,"url":"https://www.codesota.com/api/sota/ocr"},{"id":"scene-text-detection","alias":null,"name":"Scene Text Detection","description":"Detecting text regions in natural scene images","result_count":581,"url":"https://www.codesota.com/api/sota/scene-text-detection"},{"id":"speech-recognition","alias":"asr","name":"Speech Recognition","description":"Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.","result_count":526,"url":"https://www.codesota.com/api/sota/asr"},{"id":"polish-text-understanding","alias":null,"name":"Polish Text Understanding","description":"Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.","result_count":465,"url":"https://www.codesota.com/api/sota/polish-text-understanding"},{"id":"polish-conversation-quality","alias":null,"name":"Polish Conversation Quality","description":"Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.","result_count":450,"url":"https://www.codesota.com/api/sota/polish-conversation-quality"},{"id":"code-generation","alias":"code","name":"Code Generation","description":"Generating code from natural language descriptions (HumanEval, MBPP).","result_count":270,"url":"https://www.codesota.com/api/sota/code"},{"id":"multi-step-reasoning","alias":null,"name":"Multi-step Reasoning","description":"Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.","result_count":161,"url":"https://www.codesota.com/api/sota/multi-step-reasoning"},{"id":"document-parsing","alias":null,"name":"Document Parsing","description":"Parsing document structure and content","result_count":149,"url":"https://www.codesota.com/api/sota/document-parsing"},{"id":"visual-question-answering","alias":"vqa","name":"Visual Question Answering","description":"Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely \"solved\" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.","result_count":147,"url":"https://www.codesota.com/api/sota/vqa"},{"id":"document-layout-analysis","alias":null,"name":"Document Layout Analysis","description":"Analyzing the layout structure of documents","result_count":133,"url":"https://www.codesota.com/api/sota/document-layout-analysis"},{"id":"scene-text-recognition","alias":null,"name":"Scene Text Recognition","description":"Recognizing text in natural scene images","result_count":127,"url":"https://www.codesota.com/api/sota/scene-text-recognition"},{"id":"mathematical-reasoning","alias":null,"name":"Mathematical Reasoning","description":"Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.","result_count":127,"url":"https://www.codesota.com/api/sota/mathematical-reasoning"},{"id":"commonsense-reasoning","alias":null,"name":"Commonsense Reasoning","description":"Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.","result_count":109,"url":"https://www.codesota.com/api/sota/commonsense-reasoning"},{"id":"object-detection","alias":null,"name":"Object Detection","description":"Object Detection is a computer vision task that involves identifying and localizing objects within an image. The goal is to detect instances or objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Object detection models typically output a set of bounding boxes with corresponding predicted class names.","result_count":104,"url":"https://www.codesota.com/api/sota/object-detection"},{"id":"polish-emotional-intelligence","alias":null,"name":"Polish Emotional Intelligence","description":"Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emotional responses, and nuanced sentiment analysis.","result_count":101,"url":"https://www.codesota.com/api/sota/polish-emotional-intelligence"},{"id":"image-classification","alias":null,"name":"Image Classification","description":"Image Classification is a fundamental task in computer vision that aims to assign a label or class to an entire image. The goal is to train a model that can recognize and categorize images into predefined classes.","result_count":87,"url":"https://www.codesota.com/api/sota/image-classification"},{"id":"swe-bench","alias":null,"name":"SWE-bench","description":"SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.","result_count":81,"url":"https://www.codesota.com/api/sota/swe-bench"},{"id":"time-series-forecasting","alias":null,"name":"Time-series forecasting","description":"Time series forecasting uses historical, time-stamped data to create models that predict future events by identifying patterns in the data. This method analyzes trends, seasonality, and other fluctuations over time to anticipate outcomes, improve decision-making, and reduce risks in fields like business, finance, weather prediction, and resource allocation.","result_count":75,"url":"https://www.codesota.com/api/sota/time-series-forecasting"},{"id":"table-recognition","alias":null,"name":"Table Recognition","description":"Detecting and parsing tables in documents","result_count":71,"url":"https://www.codesota.com/api/sota/table-recognition"},{"id":"ocr-capabilities","alias":null,"name":"General OCR Capabilities","description":"Comprehensive benchmarks covering multiple aspects of OCR performance.","result_count":70,"url":"https://www.codesota.com/api/sota/ocr-capabilities"},{"id":"question-answering","alias":null,"name":"Question Answering","description":"Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.","result_count":67,"url":"https://www.codesota.com/api/sota/question-answering"},{"id":"document-classification","alias":null,"name":"Document Image Classification","description":"Classifying documents by type or category","result_count":63,"url":"https://www.codesota.com/api/sota/document-classification"},{"id":"image-text-to-text","alias":null,"name":"Image-Text-to-Text","description":"Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.","result_count":57,"url":"https://www.codesota.com/api/sota/image-text-to-text"},{"id":"disease-classification","alias":null,"name":"Disease Classification","description":"Diagnosing diseases from medical images or data.","result_count":57,"url":"https://www.codesota.com/api/sota/disease-classification"},{"id":"agents","alias":null,"name":"Task agents","description":"AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.","result_count":45,"url":"https://www.codesota.com/api/sota/agents"},{"id":"feature-extraction","alias":null,"name":"Feature Extraction","description":"Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.","result_count":44,"url":"https://www.codesota.com/api/sota/feature-extraction"},{"id":"video-understanding","alias":null,"name":"Video Understanding","description":"Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.","result_count":44,"url":"https://www.codesota.com/api/sota/video-understanding"},{"id":"react-native-code-generation","alias":null,"name":"React Native Code Generation","description":"Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation, navigation, state management, lists, and platform APIs using real-world libraries (Reanimated, React Navigation, Zustand, FlashList).","result_count":40,"url":"https://www.codesota.com/api/sota/react-native-code-generation"},{"id":"handwriting-recognition","alias":null,"name":"Handwriting Recognition","description":"Recognizing handwritten text","result_count":40,"url":"https://www.codesota.com/api/sota/handwriting-recognition"},{"id":"web-agents","alias":null,"name":"Web & Desktop Agents","description":"Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions (\"book a flight under $300\") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.","result_count":39,"url":"https://www.codesota.com/api/sota/web-agents"},{"id":"document-understanding","alias":null,"name":"Document Understanding","description":"Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.","result_count":28,"url":"https://www.codesota.com/api/sota/document-understanding"},{"id":"anomaly-detection","alias":null,"name":"Anomaly Detection","description":"Detecting defects and anomalies in manufacturing (MVTec AD, VisA).","result_count":27,"url":"https://www.codesota.com/api/sota/anomaly-detection"},{"id":"medical-image-segmentation","alias":null,"name":"Medical Image Segmentation","description":"Segmenting organs and abnormalities in medical images.","result_count":26,"url":"https://www.codesota.com/api/sota/medical-image-segmentation"},{"id":"semantic-segmentation","alias":null,"name":"Semantic Segmentation","description":"Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.","result_count":24,"url":"https://www.codesota.com/api/sota/semantic-segmentation"},{"id":"autonomous-coding","alias":null,"name":"Autonomous Coding","description":"Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.","result_count":23,"url":"https://www.codesota.com/api/sota/autonomous-coding"},{"id":"tool-use","alias":null,"name":"Tool Use","description":"Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.","result_count":19,"url":"https://www.codesota.com/api/sota/tool-use"},{"id":"text-summarization","alias":null,"name":"Text Summarization","description":"Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.","result_count":16,"url":"https://www.codesota.com/api/sota/text-summarization"},{"id":"language-modeling","alias":null,"name":"Language Modeling","description":"Language Modeling is the task of predicting the next word or character in a sequence given the previous context. Language models learn the probability distribution of word sequences and are foundational for many NLP applications including text generation, machine translation, and speech recognition.","result_count":14,"url":"https://www.codesota.com/api/sota/language-modeling"},{"id":"text-classification","alias":null,"name":"Text classification","description":"Text classification is a machine learning process of automatically assigning predefined categories or labels to text based on its content, often using natural language processing (NLP). It involves analyzing text to understand its meaning and then applying the most appropriate label, with common applications including sentiment analysis (e.g., positive/negative reviews), spam detection, and topic categorization (e.g., organizing news articles).","result_count":13,"url":"https://www.codesota.com/api/sota/text-classification"},{"id":"video-classification","alias":null,"name":"Video classification","description":"The task of classifying videos into predefined categories or classes. Video classification involves analyzing temporal sequences of frames to understand the content and assign appropriate labels to entire video clips.","result_count":13,"url":"https://www.codesota.com/api/sota/video-classification"},{"id":"atari-games","alias":null,"name":"Atari Games","description":"Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).","result_count":12,"url":"https://www.codesota.com/api/sota/atari-games"},{"id":"logical-reasoning","alias":null,"name":"Logical Reasoning","description":"Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.","result_count":12,"url":"https://www.codesota.com/api/sota/logical-reasoning"},{"id":"text-to-speech","alias":"tts","name":"Text-to-speech","description":"Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as \"read aloud\" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.","result_count":11,"url":"https://www.codesota.com/api/sota/tts"},{"id":"continuous-control","alias":null,"name":"Continuous Control","description":"Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.","result_count":9,"url":"https://www.codesota.com/api/sota/continuous-control"},{"id":"text-ranking","alias":null,"name":"Text Ranking","description":"Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.","result_count":9,"url":"https://www.codesota.com/api/sota/text-ranking"},{"id":"text-to-image","alias":"t2i","name":"Text-to-Image Generation","description":"Text-to-image generation went from \"interesting research\" to cultural phenomenon in 18 months. DALL-E 2 (2022) proved diffusion models could produce photorealistic images from text, Stable Diffusion democratized it as open source, and Midjourney v5/v6 set the aesthetic bar that even non-technical users now expect. DALL-E 3 (2023) solved the prompt-following problem by training on highly descriptive captions, Flux pushed open-source quality to near-commercial levels, and Ideogram cracked reliable text rendering in images. The remaining frontiers are compositional generation (multiple objects with specified spatial relationships), consistent character identity across images, and the still-unsolved challenge of reliable hand and finger anatomy.","result_count":8,"url":"https://www.codesota.com/api/sota/t2i"},{"id":"natural-language-inference","alias":null,"name":"Natural Language Inference","description":"Determining entailment relationships between sentences (SNLI, MNLI).","result_count":8,"url":"https://www.codesota.com/api/sota/natural-language-inference"},{"id":"image-captioning","alias":"caption","name":"Image Captioning","description":"Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.","result_count":7,"url":"https://www.codesota.com/api/sota/caption"},{"id":"audio-captioning","alias":null,"name":"Audio Captioning","description":"Generating text descriptions of audio content.","result_count":7,"url":"https://www.codesota.com/api/sota/audio-captioning"},{"id":"code-translation","alias":null,"name":"Code Translation","description":"Converting code between programming languages.","result_count":7,"url":"https://www.codesota.com/api/sota/code-translation"},{"id":"named-entity-recognition","alias":null,"name":"Named Entity Recognition","description":"Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.","result_count":7,"url":"https://www.codesota.com/api/sota/named-entity-recognition"},{"id":"hcast","alias":null,"name":"HCAST","description":"HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.","result_count":6,"url":"https://www.codesota.com/api/sota/hcast"},{"id":"node-classification","alias":null,"name":"Node Classification","description":"Node classification — assigning labels to vertices in a graph using both node features and neighborhood structure — is the flagship task for Graph Neural Networks. GCN (Kipf & Welling, 2017) established the Cora/Citeseer/PubMed benchmark trinity, but these datasets are tiny by modern standards and results have saturated well above 85% accuracy. The field has moved toward large-scale heterogeneous graphs (ogbn-arxiv, ogbn-products from OGB) and the unsettled debate over whether simple MLPs with neighborhood features can match GNNs, as shown by SIGN and SGC ablations.","result_count":6,"url":"https://www.codesota.com/api/sota/node-classification"},{"id":"code-completion","alias":null,"name":"Code Completion","description":"Predicting the next tokens in code sequences.","result_count":6,"url":"https://www.codesota.com/api/sota/code-completion"},{"id":"bug-detection","alias":null,"name":"Bug Detection","description":"Identifying bugs and vulnerabilities in code.","result_count":6,"url":"https://www.codesota.com/api/sota/bug-detection"},{"id":"arithmetic-reasoning","alias":null,"name":"Arithmetic Reasoning","description":"Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.","result_count":6,"url":"https://www.codesota.com/api/sota/arithmetic-reasoning"},{"id":"audio-classification","alias":null,"name":"Audio Classification","description":"Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.","result_count":5,"url":"https://www.codesota.com/api/sota/audio-classification"},{"id":"machine-translation","alias":null,"name":"Machine Translation","description":"Machine Translation is the task of automatically translating text from one natural language to another. The goal is to produce translations that preserve the meaning, style, and grammatical correctness of the source text while being fluent in the target language.","result_count":5,"url":"https://www.codesota.com/api/sota/machine-translation"},{"id":"re-bench","alias":null,"name":"RE-Bench","description":"RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the \"long-horizon reliability\" gap in agentic AI.","result_count":5,"url":"https://www.codesota.com/api/sota/re-bench"},{"id":"tabular-classification","alias":null,"name":"Tabular Classification","description":"Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain where gradient-boosted trees (XGBoost, LightGBM, CatBoost) stubbornly rival deep learning. Despite years of effort, neural approaches like TabNet (2019) and FT-Transformer (2021) only match tree methods on certain splits, and a 2022 NeurIPS study by Grinsztajn et al. confirmed that trees still dominate on medium-sized datasets. The real frontier is AutoML systems (AutoGluon, FLAML) that ensemble both paradigms, and the emerging question of whether foundation models pretrained on millions of tables can finally tip the balance.","result_count":5,"url":"https://www.codesota.com/api/sota/tabular-classification"},{"id":"robot-manipulation","alias":null,"name":"Robot Manipulation","description":"Robot manipulation — grasping, placing, and using tools — is where sim-to-real and foundation models meet physical dexterity. DexNet (2017) pioneered data-driven grasp planning, but the field accelerated when contact-rich manipulation was tackled with RL in simulation (DexterousHands, 2023) and then transferred to real hardware. Current state-of-the-art combines diffusion policies (Chi et al., 2023) with large pretrained vision encoders to achieve robust 6-DOF manipulation from a handful of demonstrations, though deformable objects and multi-step assembly remain unsolved.","result_count":5,"url":"https://www.codesota.com/api/sota/robot-manipulation"},{"id":"program-repair","alias":null,"name":"Program Repair","description":"Automatically fixing bugs in code.","result_count":5,"url":"https://www.codesota.com/api/sota/program-repair"},{"id":"time-horizon","alias":null,"name":"Time Horizon","description":"Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.","result_count":5,"url":"https://www.codesota.com/api/sota/time-horizon"},{"id":"audio-text-to-text","alias":null,"name":"Audio-Text-to-Text","description":"Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.","result_count":4,"url":"https://www.codesota.com/api/sota/audio-text-to-text"},{"id":"coding-agents","alias":null,"name":"Coding Agents","description":"Coding agents are autonomous, AI-powered software development tools that understand natural language prompts and execute multi-step tasks to automate coding, bug fixing, and entire software workflows. They act as intelligent assistants within the software development lifecycle, capable of understanding code, generating new code, optimizing existing code, debugging, and handling tasks like documentation and feature scaffolding with minimal user supervision. Examples of coding agents include Claude Code and Cursor Agent.","result_count":4,"url":"https://www.codesota.com/api/sota/coding-agents"},{"id":"video-language-models","alias":null,"name":"Video-Language Models","description":"Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.","result_count":4,"url":"https://www.codesota.com/api/sota/video-language-models"},{"id":"sound-event-detection","alias":null,"name":"Sound Event Detection","description":"Detecting and localizing sound events in audio.","result_count":3,"url":"https://www.codesota.com/api/sota/sound-event-detection"},{"id":"knowledge-graph-completion","alias":null,"name":"Knowledge Graph Completion","description":"Predicting missing links in knowledge graphs.","result_count":3,"url":"https://www.codesota.com/api/sota/knowledge-graph-completion"},{"id":"relation-extraction","alias":null,"name":"Relation Extraction","description":"Extracting relationships between entities from text.","result_count":3,"url":"https://www.codesota.com/api/sota/relation-extraction"},{"id":"voice-cloning","alias":null,"name":"Voice cloning","description":"Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.","result_count":3,"url":"https://www.codesota.com/api/sota/voice-cloning"},{"id":"image-segmentation","alias":null,"name":"Image segmentation","description":"Image segmentation is a computer vision technique that divides a digital image into multiple parts or \"segments,\" where each segment contains pixels with similar characteristics, such as color, texture, or brightness. The goal is to simplify an image by changing its representation into something more meaningful and easier to analyze, often by identifying and locating objects, their boundaries, and different regions within the image. This process has wide-ranging applications, from medical image analysis to autonomous vehicles and satellite imagery.","result_count":3,"url":"https://www.codesota.com/api/sota/image-segmentation"},{"id":"fill-mask","alias":null,"name":"Fill-Mask","description":"Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.","result_count":3,"url":"https://www.codesota.com/api/sota/fill-mask"},{"id":"link-prediction","alias":null,"name":"Link Prediction","description":"Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-target discovery, and social network recommendation. TransE (2013) launched the knowledge graph embedding era, and the field matured through DistMult, RotatE, and CompGCN, benchmarked on FB15k-237 and WN18RR. The current frontier is inductive link prediction (generalizing to unseen entities), where GNN-based methods like NBFNet and foundation models like ULTRA (2024) show that a single model can transfer across entirely different knowledge graphs without retraining.","result_count":3,"url":"https://www.codesota.com/api/sota/link-prediction"},{"id":"semantic-similarity","alias":null,"name":"Semantic Textual Similarity","description":"Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.","result_count":3,"url":"https://www.codesota.com/api/sota/semantic-similarity"},{"id":"entity-linking","alias":null,"name":"Entity Linking","description":"Linking mentions to knowledge base entities.","result_count":3,"url":"https://www.codesota.com/api/sota/entity-linking"},{"id":"music-generation","alias":null,"name":"Music Generation","description":"Generating music from text, audio, or other inputs.","result_count":3,"url":"https://www.codesota.com/api/sota/music-generation"},{"id":"molecular-property-prediction","alias":null,"name":"Molecular Property Prediction","description":"Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from molecular structure — is the workhorse task of AI-driven drug discovery. GNNs operate on molecular graphs while transformer approaches (ChemBERTa, Uni-Mol) use SMILES strings or 3D coordinates. MoleculeNet (2018) and the Therapeutic Data Commons (TDC) provide standardized benchmarks, but the real bottleneck is distribution shift: models trained on known chemical space struggle with novel scaffolds, and the gap between leaderboard accuracy and actual wet-lab utility remains the field's central challenge.","result_count":3,"url":"https://www.codesota.com/api/sota/molecular-property-prediction"},{"id":"speaker-verification","alias":null,"name":"Speaker Verification","description":"Verifying speaker identity from voice samples.","result_count":3,"url":"https://www.codesota.com/api/sota/speaker-verification"},{"id":"table-question-answering","alias":null,"name":"Table Question Answering","description":"Table question answering bridges natural language and structured data — asking \"what was Q3 revenue?\" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.","result_count":3,"url":"https://www.codesota.com/api/sota/table-question-answering"},{"id":"speech-translation","alias":null,"name":"Speech Translation","description":"Translating spoken audio directly to another language.","result_count":3,"url":"https://www.codesota.com/api/sota/speech-translation"},{"id":"code-summarization","alias":null,"name":"Code Summarization","description":"Generating natural language descriptions of code.","result_count":3,"url":"https://www.codesota.com/api/sota/code-summarization"},{"id":"zero-shot-classification","alias":null,"name":"Zero-Shot Classification","description":"Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on — the ultimate test of language understanding and generalization. The breakthrough was the natural language inference (NLI) trick: reframe classification as \"does this text entail the label?\" using models fine-tuned on MNLI, pioneered by Yin et al. (2019) and popularized by BART-large-MNLI. Today, instruction-tuned LLMs have largely subsumed this approach — GPT-4, Claude, and Llama 3 can classify into arbitrary taxonomies via prompting with near-supervised accuracy. The remaining challenge is consistency and calibration: LLMs are powerful but their predictions can be brittle to prompt phrasing, making them unreliable for high-stakes automated pipelines without careful engineering.","result_count":3,"url":"https://www.codesota.com/api/sota/zero-shot-classification"},{"id":"video-to-video","alias":null,"name":"Video-to-Video","description":"Video-to-video translation transforms existing footage — applying style transfer, temporal super-resolution, relighting, or motion retargeting while preserving temporal coherence across frames. The naive approach of processing frames independently produces unwatchable flicker, so the core technical challenge is enforcing cross-frame consistency. Diffusion-based approaches like Rerender-A-Video and TokenFlow (2023) showed that propagating attention features between frames solves this elegantly. The practical frontier is real-time processing for live video — current methods are offline and slow, but the creative potential for film post-production, video editing, and content repurposing is enormous.","result_count":2,"url":"https://www.codesota.com/api/sota/video-to-video"},{"id":"bioinformatics-agents","alias":null,"name":"Bioinformatics Agents","description":"LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpreting biological results.","result_count":2,"url":"https://www.codesota.com/api/sota/bioinformatics-agents"},{"id":"tabular-regression","alias":null,"name":"Tabular Regression","description":"Tabular regression — predicting continuous values from structured data — powers everything from house-price estimation to demand forecasting and shares the same tree-vs-neural tension as classification. XGBoost and LightGBM remain brutally effective defaults, but recent work on differentiable trees and table-aware transformers (TabPFN, 2022) showed that meta-learned priors can beat tuned GBDTs on small datasets in seconds. The challenge is distribution shift: real-world regression targets drift over time, and most benchmarks (UCI, Kaggle) are static snapshots that hide this problem entirely.","result_count":2,"url":"https://www.codesota.com/api/sota/tabular-regression"},{"id":"reading-comprehension","alias":null,"name":"Reading Comprehension","description":"Understanding and answering questions about passages.","result_count":2,"url":"https://www.codesota.com/api/sota/reading-comprehension"},{"id":"ocr","alias":null,"name":"OCR","description":"OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readable, editable, and searchable digital text data. This involves converting scanned documents, photos, or image-only PDFs to text from their static visual format, enabling the document to be edited, searched, or used for data entry and other applications. Examples include digitizing receipts for your bank app, translating signs with Google Translate, or creating searchable archives from old documents.","result_count":1,"url":"https://www.codesota.com/api/sota/ocr"},{"id":"keypoint-detection","alias":null,"name":"Keypoint Detection","description":"Keypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.","result_count":1,"url":"https://www.codesota.com/api/sota/keypoint-detection"}],"retrieved_at":"2026-07-28T15:13:03.183Z"}