| 01 | Polish LLM General General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, ques… | Open Polish LLM Leaderboard | Meta-Llama-3.1-405B-Instruct-FP8 | 93.44 belebele | 3,728 |
| 02 | Polish Cultural Competency Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & t… | Polish Linguistic and Cultural Competency Benchmark | Gemini-3.1-Pro-Preview | 100.0 geography | 1,155 |
| 03 | Polish Text Understanding Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky question… | Complex Polish Text Understanding Benchmark | Qwen/Qwen3.5-35B-A3B thinking (API) | 4.702 tricky-questions | 465 |
| 04 | Polish Conversation Quality Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities,… | Polish Multi-Turn Benchmark | Phi-4 | 10.00 stem | 450 |
| 05 | Polish Emotional Intelligence Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emo… | Polish Emotional Intelligence Benchmark (EQ-Bench v2 PL) | Mistral-Large-Instruct-2407 | 78.07 eq-score | 101 |
| 06 | Question Answering Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (20… | Stanford Question Answering Dataset v2.0 | DeBERTa-v3-large | 91.4% f1 | 24 |
| 07 | Text Summarization Text summarization compresses documents while preserving key information — a task that became dramatically mor… | CNN/DailyMail Summarization | BRIO | 47.8% rouge-1 | 15 |
| 08 | Text Classification Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the… | SuperGLUE | DeBERTa-v3-large | 91.40 average-score | 12 |
| 09 | Natural Language Inference Determining entailment relationships between sentences (SNLI, MNLI). | Stanford Natural Language Inference | GPT-4o | 92.6% accuracy | 8 |
| 10 | Text Ranking Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C… | BEIR | NV-Embed-v2 | 62.65 ndcg@10 | 8 |
| 11 | Named Entity Recognition Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u… | CoNLL-2003 Named Entity Recognition | GLiNER-multitask | 93.8% f1 | 7 |
| 12 | Feature Extraction Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powerin… | MTEB Leaderboard | NV-Embed-v2 | 72.31 avg-score | 6 |
| 13 | Machine Translation Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer… | WMT'23 | GPT-4 | 84.10 comet | 4 |
| 14 | Fill-Mask Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict w… | GLUE | DeBERTa-v3-large | 91.37 avg-score | 3 |
| 15 | Semantic Textual Similarity Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti… | STS Benchmark | GTE-Qwen2-7B-instruct | 88.40 spearman | 3 |
| 16 | Table Question Answering Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s… | WikiTableQuestions | GPT-4 | 75.3% accuracy | 3 |
| 17 | Zero-Shot Classification Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on —… | XNLI | GPT-4 | 87.4% accuracy | 3 |