Rows and columns — churn, credit risk, diagnoses, sensor logs — are still where most production ML lives. For a decade the answer was a gradient-boosted tree. Then TabPFN put a tabular foundation model in Nature, TabArena gave the field its first living, Elo-rated benchmark, and the frontier moved: today the top single models on the board are all pre-trained transformers doing in-context learning, with LightGBM and CatBoost holding the cost-efficiency line.
Below: the live TabArena board (best config per model family), the small-data slice where foundation models won outright, and how to pick. Every number links to its primary source.
Best configuration per model family, from the official board’s 64 rated configs (defaults · tuned · tuned + post-hoc-ensembled). Foundation models run a single forward pass on GPU; “tuned + ensembled” entries reflect a full hyperparameter search — read the train-time column before comparing Elo alone.
| # | Model (best config on board) | Family | Elo | 95% CI | Train s/1K | Predict s/1K | HW | Model paper |
|---|---|---|---|---|---|---|---|---|
| 1 | AutoGluon 1.5 (extreme, 4h) Reference pipeline, not a single model — a 4-hour multi-model ensemble. The ceiling everything else is measured against. | AutoML pipeline | 1695 | +82/−69 | 289.1 | 4.03 | GPU | arXiv:2003.06505 → |
| 2 | TabPFN-3 (default) Top single model on the board — a single forward pass, no tuning, within the CI of the 4-hour AutoGluon ensemble. | Foundation model | 1673 | +83/−62 | 4.97 | 0.58 | GPU | arXiv:2605.13986 → |
| 3 | TabPFN-2.6 (default) Prior Labs' successor line to the Nature-published TabPFN v2. Default config, zero tuning. | Foundation model | 1624 | +79/−52 | 5.48 | 0.56 | GPU | arXiv:2511.08667 → |
| 4 | RealTabPFN-2.5 (tuned + ensembled) TabPFN-2.5 fine-tuned on real (not just synthetic) data, then tuned + post-hoc ensembled. | Foundation model | 1600 | +79/−61 | 2040 | 8.92 | GPU | arXiv:2511.08667 → |
| 5 | TabICLv2 (default) The Inria in-context-learning line. Fastest predict time in the top tier. | Foundation model | 1596 | +76/−63 | 4.02 | 0.38 | GPU | arXiv:2602.11139 → |
| 6 | RealMLP (tuned + ensembled) Best trained-from-scratch neural net on the board — an MLP with a bag of carefully ablated tricks. | Neural net | 1513 | +56/−46 | 2951 | 11.99 | GPU | arXiv:2407.04491 → |
| 7 | TabDPT (tuned + ensembled) Retrieval-based tabular foundation model trained on real data (Layer 6 AI). | Foundation model | 1459 | +63/−53 | 4908 | 286.7 | GPU | arXiv:2410.18164 → |
| 8 | TabM (tuned + ensembled) Parameter-efficient MLP ensembling (Yandex Research). Co-led the original TabArena-v0.1 board in 2025. | Neural net | 1447 | +53/−43 | 3286 | 1.47 | GPU | arXiv:2410.24210 → |
| 9 | LightGBM (tuned + ensembled) Best gradient-boosted tree on the board — and it runs on CPU, no GPU required. | Tree-based | 1433 | +34/−30 | 417.0 | 2.64 | CPU | NeurIPS 2017 → |
| 10 | CatBoost (tuned + ensembled) Its default config scores Elo 1369 with 7 s/1K train time — the strongest out-of-the-box classical model. | Tree-based | 1417 | +40/−38 | 1658 | 0.65 | CPU | arXiv:1706.09516 → |
| 11 | iLTM (tuned + ensembled) Large tabular model; accuracy comes at heavy train and inference cost. | Foundation model | 1407 | +42/−45 | 12683 | 464.4 | GPU | arXiv:2511.15941 → |
| 12 | ModernNCA (tuned + ensembled) Neighbourhood-component-analysis revival — retrieval-flavoured deep net. | Neural net | 1390 | +77/−53 | 4622 | 8.15 | GPU | arXiv:2407.03257 → |
| 13 | XGBoost (tuned + ensembled) The 2015–2022 default. Still solid; no longer the frontier, even among trees. | Tree-based | 1375 | +32/−34 | 693.5 | 1.69 | CPU | arXiv:1603.02754 → |
| 14 | LimiX (default) Open large structured-data model out of China; default config only on the board. | Foundation model | 1361 | +79/−62 | 26.5 | 6.25 | GPU | arXiv:2509.03505 → |
| 15 | xRFM (tuned + ensembled) Recursive feature machines — kernel methods scaled up. The strongest non-tree, non-NN, non-FM entry. | Kernel / other | 1350 | +49/−41 | 846.9 | 2.55 | GPU | arXiv:2508.10053 → |
| 16 | EBM (tuned + ensembled) Explainable Boosting Machine — glass-box additive model. The price of full interpretability: ~160 Elo vs LightGBM. | Tree-based | 1272 | +41/−40 | 2930 | 0.42 | CPU | Lou et al., KDD 2013 → |
| 17 | Random Forest (tuned + ensembled) Default-config Random Forest (Elo 1000) is the board's calibration anchor. Even tuned, it trails boosting by ~260 Elo. | Tree-based | 1171 | +51/−45 | — | — | CPU | Breiman, 2001 → |
| 18 | Linear / Logistic (tuned + ensembled) The sanity-check baseline. If your model can't clearly beat this, your features carry the signal — not the model. | Baseline | 962 | +64/−98 | — | — | CPU | scikit-learn → |
Calibration: default-config Random Forest = Elo 1000; a 400-point gap ≈ 10:1 expected win rate. Times are median seconds per 1,000 samples as published on the board.
Source: official TabArena leaderboard (tabarena.ai), v0.1.4 board data, retrieved 2026-06-10. Methodology: arXiv:2506.16791. Per-model paper links in the last column. Spot an error? Tell us →
36 of the 51 TabArena datasets are small — the regime most real business problems live in, and the regime the TabPFN Nature paper targeted (≤10,000 samples × 500 features). Here the top five single models are all foundation models, and the best gradient-boosted tree trails the leader by ~250 Elo — roughly a 4:1 expected win rate against it.
| # | Model (best config on board) | Family | Elo | 95% CI | Train s/1K | Predict s/1K | HW | Model paper |
|---|---|---|---|---|---|---|---|---|
| 1 | AutoGluon 1.5 (extreme, 4h) Reference pipeline. | AutoML pipeline | 1643 | — | — | — | GPU | arXiv:2003.06505 → |
| 2 | TabPFN-3 (default) Statistically tied with the 4-hour AutoGluon ensemble — from a single forward pass. | Foundation model | 1642 | +85/−56 | — | — | GPU | arXiv:2605.13986 → |
| 3 | TabPFN-2.6 (default) | Foundation model | 1602 | +76/−49 | — | — | GPU | arXiv:2511.08667 → |
| 4 | RealTabPFN-2.5 (tuned + ensembled) | Foundation model | 1599 | +94/−64 | — | — | GPU | arXiv:2511.08667 → |
| 5 | TabICLv2 (default) | Foundation model | 1575 | +105/−83 | — | — | GPU | arXiv:2602.11139 → |
| 6 | RealMLP (tuned + ensembled) Best non-foundation single model on small data. | Neural net | 1483 | +62/−46 | — | — | GPU | arXiv:2407.04491 → |
| 7 | LightGBM (tuned + ensembled) Best GBDT on small data — 253 Elo below TabPFN-3. This is the regime where trees lost. | Tree-based | 1389 | +36/−32 | — | — | CPU | NeurIPS 2017 → |
| 8 | CatBoost (tuned + ensembled) | Tree-based | 1362 | +46/−40 | — | — | CPU | arXiv:1706.09516 → |
| 9 | XGBoost (tuned + ensembled) | Tree-based | 1324 | +32/−36 | — | — | CPU | arXiv:1603.02754 → |
Source: official TabArena leaderboard, small-dataset subset (36 datasets), v0.1.4 board data, retrieved 2026-06-10. Train/predict times omitted where the subset board does not republish them.
2015–2022 · Trees rule
XGBoost (2016), LightGBM (2017) and CatBoost (2017) win essentially every tabular Kaggle competition and every fair benchmark. Repeated studies find deep learning fails to beat tuned GBDTs on typical tabular tasks. “Just use XGBoost” becomes the default advice.
2025 · The Nature moment
TabPFN v2 (Hollmann et al., Nature 637, 319–326) shows a transformer pre-trained on ~100 million synthetic datasets can outperform every baseline on datasets up to 10,000 samples × 500 features — in 2.8 seconds, beating ensembles tuned for 4 hours. Tabular ML gets its foundation-model era.
2025 · TabArena ships
The community gets a living benchmark (arXiv:2506.16791): 51 datasets curated from a pool of 1,053 across 14 prior benchmarks, 16 model families, Elo ratings with bootstrapped CIs, and a maintained public leaderboard. Its launch finding: validation protocol and post-hoc ensembling change rankings more than architecture choice.
2026 · Foundation models take the board
On the current board (v0.1.4), the top four single models are all tabular foundation models — TabPFN-3, TabPFN-2.6, RealTabPFN-2.5, TabICLv2. The best GBDT config sits ~240 Elo lower. Trees still win on CPU cost, large data, and operational simplicity — but they no longer set the accuracy frontier.
The honest caveats, straight from the TabArena paper: rankings are protocol-sensitive. Post-hoc ensembling of hyperparameter configurations reshuffles the order — without it, CatBoost beats the tuned neural nets that outrank it with it. And foundation-model dominance is currently a small-and-medium-data result; context-length limits still bound how much data fits in a forward pass.
Practical reading: if you benchmark candidates yourself, fix the validation protocol first (outer folds, consistent tuning budget, same ensembling policy) — otherwise you are measuring your harness, not your model. TabArena’s code is open if you want to reuse its protocol: github.com/autogluon/tabarena.
Elo is one axis. The others are dataset size, hardware (foundation models want a GPU; GBDTs don’t), tuning budget, and whether you must explain every prediction.
Small / medium data (≲50K rows) · max accuracy
TabPFN-3 · TabPFN-2.6 · TabICLv2
Foundation models hold the top single-model slots on TabArena, in one forward pass with no tuning. This regime — most real business datasets — is where in-context learning won.
Large data · CPU-only · production serving
LightGBM · CatBoost
Best Elo per CPU-second on the board. LightGBM (tuned + ensembled, Elo 1433) is the strongest tree; CatBoost's default config (Elo 1369, ~7 s/1K train) is the best zero-tuning classical pick.
Absolute ceiling · accuracy at any cost
AutoGluon (extreme preset)
The 4-hour multi-model ensemble tops the overall board (Elo 1695). If a Kaggle-style last percent matters and you have the compute budget, ensemble across families.
Regulated / interpretable
EBM · Linear + strong features
Explainable Boosting Machines are glass-box additive models — every prediction decomposes into per-feature curves. You pay ~160 Elo vs LightGBM for that property.
Deep-learning stack already in place
TabM · RealMLP
The two strongest trained-from-scratch nets (Elo 1447 / 1513 tuned + ensembled). They beat tuned GBDTs on the 2026 board — but only with tuning and post-hoc ensembling.
Honest baseline before anything fancy
Logistic / linear regression
Elo 962 tuned. If your candidate model doesn't clearly separate from this, invest in features and data quality, not architecture.
Tabular ML spent a decade with mutually contradicting benchmarks. As of 2026 the field has converged on one living board plus a handful of primary papers — these four cover the evidence on this page.
The official living benchmark for tabular ML (paper: arXiv:2506.16791). Datasets curated from 1,053 candidates across 14 prior benchmarks; every model run with defaults, tuned, and tuned + post-hoc-ensembled protocols; Elo calibrated to default Random Forest = 1000. The single board to read first in 2026.
tabarena.ai leaderboard →The peer-reviewed evaluation behind the foundation-model turn: TabPFN outperforms all baselines on small data, with reported speedups of 5,140× (classification) and 3,000× (regression) versus 4-hour-tuned ensembles. Hollmann et al., Nature 637, 319–326.
Nature paper →The subset where the GBDT-vs-foundation-model question is most decisively settled: TabPFN-3 (Elo 1642) ties the 4-hour AutoGluon reference pipeline and leads the best GBDT config by ~250 Elo. If your datasets are small, this is the slice that predicts your experience.
Leaderboard (small-data filter) →Pichler, Salinas et al. — the methodology behind the board: why prior tabular benchmarks disagreed (dataset licensing, leakage, weak validation protocols), how post-hoc ensembling reshuffles rankings, and the maintenance protocol that keeps the leaderboard living.
arXiv:2506.16791 →This page exists because a reader asked for it. Missing a model, a regime (time series? streaming? >1M rows?), or a benchmark slice you need? Tell us — we update the page based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.