Codesota · Tasks · Tabular Machine LearningTasks/Structured data/Tabular ML
Structured data · the benchmark finally exists

Tabular Machine Learning.

Rows and columns — churn, credit risk, diagnoses, sensor logs — are still where most production ML lives. For a decade the answer was a gradient-boosted tree. Then TabPFN put a tabular foundation model in Nature, TabArena gave the field its first living, Elo-rated benchmark, and the frontier moved: today the top single models on the board are all pre-trained transformers doing in-context learning, with LightGBM and CatBoost holding the cost-efficiency line.

Below: the live TabArena board (best config per model family), the small-data slice where foundation models won outright, and how to pick. Every number links to its primary source.

Live TabArena leaderboard TabPFN in NatureAll tasks
§ 01 · The board

TabArena overall — 51 datasets, Elo-rated.

Best configuration per model family, from the official board’s 64 rated configs (defaults · tuned · tuned + post-hoc-ensembled). Foundation models run a single forward pass on GPU; “tuned + ensembled” entries reflect a full hyperparameter search — read the train-time column before comparing Elo alone.

#Model (best config on board)FamilyElo95% CITrain s/1KPredict s/1KHWModel paper
1
AutoGluon 1.5 (extreme, 4h)
Reference pipeline, not a single model — a 4-hour multi-model ensemble. The ceiling everything else is measured against.
AutoML pipeline1695+82/−69289.14.03GPUarXiv:2003.06505
2
TabPFN-3 (default)
Top single model on the board — a single forward pass, no tuning, within the CI of the 4-hour AutoGluon ensemble.
Foundation model1673+83/−624.970.58GPUarXiv:2605.13986
3
TabPFN-2.6 (default)
Prior Labs' successor line to the Nature-published TabPFN v2. Default config, zero tuning.
Foundation model1624+79/−525.480.56GPUarXiv:2511.08667
4
RealTabPFN-2.5 (tuned + ensembled)
TabPFN-2.5 fine-tuned on real (not just synthetic) data, then tuned + post-hoc ensembled.
Foundation model1600+79/−6120408.92GPUarXiv:2511.08667
5
TabICLv2 (default)
The Inria in-context-learning line. Fastest predict time in the top tier.
Foundation model1596+76/−634.020.38GPUarXiv:2602.11139
6
RealMLP (tuned + ensembled)
Best trained-from-scratch neural net on the board — an MLP with a bag of carefully ablated tricks.
Neural net1513+56/−46295111.99GPUarXiv:2407.04491
7
TabDPT (tuned + ensembled)
Retrieval-based tabular foundation model trained on real data (Layer 6 AI).
Foundation model1459+63/−534908286.7GPUarXiv:2410.18164
8
TabM (tuned + ensembled)
Parameter-efficient MLP ensembling (Yandex Research). Co-led the original TabArena-v0.1 board in 2025.
Neural net1447+53/−4332861.47GPUarXiv:2410.24210
9
LightGBM (tuned + ensembled)
Best gradient-boosted tree on the board — and it runs on CPU, no GPU required.
Tree-based1433+34/−30417.02.64CPUNeurIPS 2017
10
CatBoost (tuned + ensembled)
Its default config scores Elo 1369 with 7 s/1K train time — the strongest out-of-the-box classical model.
Tree-based1417+40/−3816580.65CPUarXiv:1706.09516
11
iLTM (tuned + ensembled)
Large tabular model; accuracy comes at heavy train and inference cost.
Foundation model1407+42/−4512683464.4GPUarXiv:2511.15941
12
ModernNCA (tuned + ensembled)
Neighbourhood-component-analysis revival — retrieval-flavoured deep net.
Neural net1390+77/−5346228.15GPUarXiv:2407.03257
13
XGBoost (tuned + ensembled)
The 2015–2022 default. Still solid; no longer the frontier, even among trees.
Tree-based1375+32/−34693.51.69CPUarXiv:1603.02754
14
LimiX (default)
Open large structured-data model out of China; default config only on the board.
Foundation model1361+79/−6226.56.25GPUarXiv:2509.03505
15
xRFM (tuned + ensembled)
Recursive feature machines — kernel methods scaled up. The strongest non-tree, non-NN, non-FM entry.
Kernel / other1350+49/−41846.92.55GPUarXiv:2508.10053
16
EBM (tuned + ensembled)
Explainable Boosting Machine — glass-box additive model. The price of full interpretability: ~160 Elo vs LightGBM.
Tree-based1272+41/−4029300.42CPULou et al., KDD 2013
17
Random Forest (tuned + ensembled)
Default-config Random Forest (Elo 1000) is the board's calibration anchor. Even tuned, it trails boosting by ~260 Elo.
Tree-based1171+51/−45CPUBreiman, 2001
18
Linear / Logistic (tuned + ensembled)
The sanity-check baseline. If your model can't clearly beat this, your features carry the signal — not the model.
Baseline962+64/−98CPUscikit-learn

Calibration: default-config Random Forest = Elo 1000; a 400-point gap ≈ 10:1 expected win rate. Times are median seconds per 1,000 samples as published on the board.

Source: official TabArena leaderboard (tabarena.ai), v0.1.4 board data, retrieved 2026-06-10. Methodology: arXiv:2506.16791. Per-model paper links in the last column. Spot an error? Tell us →

§ 02 · Small data

The small-data slice — where the argument ended.

36 of the 51 TabArena datasets are small — the regime most real business problems live in, and the regime the TabPFN Nature paper targeted (≤10,000 samples × 500 features). Here the top five single models are all foundation models, and the best gradient-boosted tree trails the leader by ~250 Elo — roughly a 4:1 expected win rate against it.

#Model (best config on board)FamilyElo95% CITrain s/1KPredict s/1KHWModel paper
1
AutoGluon 1.5 (extreme, 4h)
Reference pipeline.
AutoML pipeline1643GPUarXiv:2003.06505
2
TabPFN-3 (default)
Statistically tied with the 4-hour AutoGluon ensemble — from a single forward pass.
Foundation model1642+85/−56GPUarXiv:2605.13986
3
TabPFN-2.6 (default)
Foundation model1602+76/−49GPUarXiv:2511.08667
4
RealTabPFN-2.5 (tuned + ensembled)
Foundation model1599+94/−64GPUarXiv:2511.08667
5
TabICLv2 (default)
Foundation model1575+105/−83GPUarXiv:2602.11139
6
RealMLP (tuned + ensembled)
Best non-foundation single model on small data.
Neural net1483+62/−46GPUarXiv:2407.04491
7
LightGBM (tuned + ensembled)
Best GBDT on small data — 253 Elo below TabPFN-3. This is the regime where trees lost.
Tree-based1389+36/−32CPUNeurIPS 2017
8
CatBoost (tuned + ensembled)
Tree-based1362+46/−40CPUarXiv:1706.09516
9
XGBoost (tuned + ensembled)
Tree-based1324+32/−36CPUarXiv:1603.02754

Source: official TabArena leaderboard, small-dataset subset (36 datasets), v0.1.4 board data, retrieved 2026-06-10. Train/predict times omitted where the subset board does not republish them.

§ 03 · The storyline

From “just use XGBoost” to foundation models in four acts.

2015–2022 · Trees rule

XGBoost (2016), LightGBM (2017) and CatBoost (2017) win essentially every tabular Kaggle competition and every fair benchmark. Repeated studies find deep learning fails to beat tuned GBDTs on typical tabular tasks. “Just use XGBoost” becomes the default advice.

2025 · The Nature moment

TabPFN v2 (Hollmann et al., Nature 637, 319–326) shows a transformer pre-trained on ~100 million synthetic datasets can outperform every baseline on datasets up to 10,000 samples × 500 features — in 2.8 seconds, beating ensembles tuned for 4 hours. Tabular ML gets its foundation-model era.

2025 · TabArena ships

The community gets a living benchmark (arXiv:2506.16791): 51 datasets curated from a pool of 1,053 across 14 prior benchmarks, 16 model families, Elo ratings with bootstrapped CIs, and a maintained public leaderboard. Its launch finding: validation protocol and post-hoc ensembling change rankings more than architecture choice.

2026 · Foundation models take the board

On the current board (v0.1.4), the top four single models are all tabular foundation models — TabPFN-3, TabPFN-2.6, RealTabPFN-2.5, TabICLv2. The best GBDT config sits ~240 Elo lower. Trees still win on CPU cost, large data, and operational simplicity — but they no longer set the accuracy frontier.

The honest caveats, straight from the TabArena paper: rankings are protocol-sensitive. Post-hoc ensembling of hyperparameter configurations reshuffles the order — without it, CatBoost beats the tuned neural nets that outrank it with it. And foundation-model dominance is currently a small-and-medium-data result; context-length limits still bound how much data fits in a forward pass.

Practical reading: if you benchmark candidates yourself, fix the validation protocol first (outer folds, consistent tuning budget, same ensembling policy) — otherwise you are measuring your harness, not your model. TabArena’s code is open if you want to reuse its protocol: github.com/autogluon/tabarena.

§ 04 · Decision shortcuts

Which should I use?

Elo is one axis. The others are dataset size, hardware (foundation models want a GPU; GBDTs don’t), tuning budget, and whether you must explain every prediction.

Small / medium data (≲50K rows) · max accuracy

TabPFN-3 · TabPFN-2.6 · TabICLv2

Foundation models hold the top single-model slots on TabArena, in one forward pass with no tuning. This regime — most real business datasets — is where in-context learning won.

Large data · CPU-only · production serving

LightGBM · CatBoost

Best Elo per CPU-second on the board. LightGBM (tuned + ensembled, Elo 1433) is the strongest tree; CatBoost's default config (Elo 1369, ~7 s/1K train) is the best zero-tuning classical pick.

Absolute ceiling · accuracy at any cost

AutoGluon (extreme preset)

The 4-hour multi-model ensemble tops the overall board (Elo 1695). If a Kaggle-style last percent matters and you have the compute budget, ensemble across families.

Regulated / interpretable

EBM · Linear + strong features

Explainable Boosting Machines are glass-box additive models — every prediction decomposes into per-feature curves. You pay ~160 Elo vs LightGBM for that property.

Deep-learning stack already in place

TabM · RealMLP

The two strongest trained-from-scratch nets (Elo 1447 / 1513 tuned + ensembled). They beat tuned GBDTs on the 2026 board — but only with tuning and post-hoc ensembling.

Honest baseline before anything fancy

Logistic / linear regression

Elo 962 tuned. If your candidate model doesn't clearly separate from this, invest in features and data quality, not architecture.

§ 05 · Reference sources

The boards and papers that matter.

Tabular ML spent a decade with mutually contradicting benchmarks. As of 2026 the field has converged on one living board plus a handful of primary papers — these four cover the evidence on this page.

TabArena (v0.1.4)

51 datasets · 64 rated configs · Elo + 95% CI2025–live

The official living benchmark for tabular ML (paper: arXiv:2506.16791). Datasets curated from 1,053 candidates across 14 prior benchmarks; every model run with defaults, tuned, and tuned + post-hoc-ensembled protocols; Elo calibrated to default Random Forest = 1000. The single board to read first in 2026.

tabarena.ai leaderboard

TabPFN v2 Nature evaluation

Datasets ≤10K samples × 500 features2025

The peer-reviewed evaluation behind the foundation-model turn: TabPFN outperforms all baselines on small data, with reported speedups of 5,140× (classification) and 3,000× (regression) versus 4-hour-tuned ensembles. Hollmann et al., Nature 637, 319–326.

Nature paper

TabArena small-data slice

36 of the 51 datasets2025–live

The subset where the GBDT-vs-foundation-model question is most decisively settled: TabPFN-3 (Elo 1642) ties the 4-hour AutoGluon reference pipeline and leads the best GBDT config by ~250 Elo. If your datasets are small, this is the slice that predicts your experience.

Leaderboard (small-data filter)

TabArena paper

Benchmark methodology2025

Pichler, Salinas et al. — the methodology behind the board: why prior tabular benchmarks disagreed (dataset licensing, leakage, weak validation protocols), how post-hoc ensembling reshuffles rankings, and the maintenance protocol that keeps the leaderboard living.

arXiv:2506.16791
Related on CodeSOTA
All tasks Text Classification Request a benchmark
Reply within 48 hours · No newsletter

What were you looking for on tabular ML?

This page exists because a reader asked for it. Missing a model, a regime (time series? streaming? >1M rows?), or a benchmark slice you need? Tell us — we update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.