Codesota · Tasks · Tabular Machine LearningTasks/Structured data/Tabular ML

Structured data · the benchmark finally exists

Tabular Machine Learning.

Rows and columns — churn, credit risk, diagnoses, sensor logs — are still where most production ML lives. For a decade the answer was a gradient-boosted tree. Then TabPFN put a tabular foundation model in Nature, TabArena gave the field its first living, Elo-rated benchmark, and the frontier moved: today the top single models on the board are all pre-trained transformers doing in-context learning, with LightGBM and CatBoost holding the cost-efficiency line.

Below: the live TabArena board (best config per model family), the small-data slice where foundation models won outright, and how to pick. Every number links to its primary source.

Live TabArena leaderboard →TabPFN in Nature All tasks

§ 01 · The board

TabArena overall — 51 datasets, Elo-rated.

Best configuration per model family, from the official board’s 64 rated configs (defaults · tuned · tuned + post-hoc-ensembled). Foundation models run a single forward pass on GPU; “tuned + ensembled” entries reflect a full hyperparameter search — read the train-time column before comparing Elo alone.

#	Model (best config on board)	Family	Elo	95% CI	Train s/1K	Predict s/1K	HW	Model paper
1	AutoGluon 1.5 (extreme, 4h) Reference pipeline, not a single model — a 4-hour multi-model ensemble. The ceiling everything else is measured against.	AutoML pipeline	1695	+82/−69	289.1	4.03	GPU	arXiv:2003.06505 →
2	TabPFN-3 (default) Top single model on the board — a single forward pass, no tuning, within the CI of the 4-hour AutoGluon ensemble.	Foundation model	1673	+83/−62	4.97	0.58	GPU	arXiv:2605.13986 →
3	TabPFN-2.6 (default) Prior Labs' successor line to the Nature-published TabPFN v2. Default config, zero tuning.	Foundation model	1624	+79/−52	5.48	0.56	GPU	arXiv:2511.08667 →
4	RealTabPFN-2.5 (tuned + ensembled) TabPFN-2.5 fine-tuned on real (not just synthetic) data, then tuned + post-hoc ensembled.	Foundation model	1600	+79/−61	2040	8.92	GPU	arXiv:2511.08667 →
5	TabICLv2 (default) The Inria in-context-learning line. Fastest predict time in the top tier.	Foundation model	1596	+76/−63	4.02	0.38	GPU	arXiv:2602.11139 →
6	RealMLP (tuned + ensembled) Best trained-from-scratch neural net on the board — an MLP with a bag of carefully ablated tricks.	Neural net	1513	+56/−46	2951	11.99	GPU	arXiv:2407.04491 →
7	TabDPT (tuned + ensembled) Retrieval-based tabular foundation model trained on real data (Layer 6 AI).	Foundation model	1459	+63/−53	4908	286.7	GPU	arXiv:2410.18164 →
8	TabM (tuned + ensembled) Parameter-efficient MLP ensembling (Yandex Research). Co-led the original TabArena-v0.1 board in 2025.	Neural net	1447	+53/−43	3286	1.47	GPU	arXiv:2410.24210 →
9	LightGBM (tuned + ensembled) Best gradient-boosted tree on the board — and it runs on CPU, no GPU required.	Tree-based	1433	+34/−30	417.0	2.64	CPU	NeurIPS 2017 →
10	CatBoost (tuned + ensembled) Its default config scores Elo 1369 with 7 s/1K train time — the strongest out-of-the-box classical model.	Tree-based	1417	+40/−38	1658	0.65	CPU	arXiv:1706.09516 →
11	iLTM (tuned + ensembled) Large tabular model; accuracy comes at heavy train and inference cost.	Foundation model	1407	+42/−45	12683	464.4	GPU	arXiv:2511.15941 →
12	ModernNCA (tuned + ensembled) Neighbourhood-component-analysis revival — retrieval-flavoured deep net.	Neural net	1390	+77/−53	4622	8.15	GPU	arXiv:2407.03257 →
13	XGBoost (tuned + ensembled) The 2015–2022 default. Still solid; no longer the frontier, even among trees.	Tree-based	1375	+32/−34	693.5	1.69	CPU	arXiv:1603.02754 →
14	LimiX (default) Open large structured-data model out of China; default config only on the board.	Foundation model	1361	+79/−62	26.5	6.25	GPU	arXiv:2509.03505 →
15	xRFM (tuned + ensembled) Recursive feature machines — kernel methods scaled up. The strongest non-tree, non-NN, non-FM entry.	Kernel / other	1350	+49/−41	846.9	2.55	GPU	arXiv:2508.10053 →
16	EBM (tuned + ensembled) Explainable Boosting Machine — glass-box additive model. The price of full interpretability: ~160 Elo vs LightGBM.	Tree-based	1272	+41/−40	2930	0.42	CPU	Lou et al., KDD 2013 →
17	Random Forest (tuned + ensembled) Default-config Random Forest (Elo 1000) is the board's calibration anchor. Even tuned, it trails boosting by ~260 Elo.	Tree-based	1171	+51/−45	—	—	CPU	Breiman, 2001 →
18	Linear / Logistic (tuned + ensembled) The sanity-check baseline. If your model can't clearly beat this, your features carry the signal — not the model.	Baseline	962	+64/−98	—	—	CPU	scikit-learn →

Calibration: default-config Random Forest = Elo 1000; a 400-point gap ≈ 10:1 expected win rate. Times are median seconds per 1,000 samples as published on the board.

Source: official TabArena leaderboard (tabarena.ai), v0.1.4 board data, retrieved 2026-06-10. Methodology: arXiv:2506.16791. Per-model paper links in the last column. Spot an error? Tell us →

§ 02 · Small data

The small-data slice — where the argument ended.

36 of the 51 TabArena datasets are small — the regime most real business problems live in, and the regime the TabPFN Nature paper targeted (≤10,000 samples × 500 features). Here the top five single models are all foundation models, and the best gradient-boosted tree trails the leader by ~250 Elo — roughly a 4:1 expected win rate against it.

#	Model (best config on board)	Family	Elo	95% CI	Train s/1K	Predict s/1K	HW	Model paper
1	AutoGluon 1.5 (extreme, 4h) Reference pipeline.	AutoML pipeline	1643	—	—	—	GPU	arXiv:2003.06505 →
2	TabPFN-3 (default) Statistically tied with the 4-hour AutoGluon ensemble — from a single forward pass.	Foundation model	1642	+85/−56	—	—	GPU	arXiv:2605.13986 →
3	TabPFN-2.6 (default)	Foundation model	1602	+76/−49	—	—	GPU	arXiv:2511.08667 →
4	RealTabPFN-2.5 (tuned + ensembled)	Foundation model	1599	+94/−64	—	—	GPU	arXiv:2511.08667 →
5	TabICLv2 (default)	Foundation model	1575	+105/−83	—	—	GPU	arXiv:2602.11139 →
6	RealMLP (tuned + ensembled) Best non-foundation single model on small data.	Neural net	1483	+62/−46	—	—	GPU	arXiv:2407.04491 →
7	LightGBM (tuned + ensembled) Best GBDT on small data — 253 Elo below TabPFN-3. This is the regime where trees lost.	Tree-based	1389	+36/−32	—	—	CPU	NeurIPS 2017 →
8	CatBoost (tuned + ensembled)	Tree-based	1362	+46/−40	—	—	CPU	arXiv:1706.09516 →
9	XGBoost (tuned + ensembled)	Tree-based	1324	+32/−36	—	—	CPU	arXiv:1603.02754 →

Source: official TabArena leaderboard, small-dataset subset (36 datasets), v0.1.4 board data, retrieved 2026-06-10. Train/predict times omitted where the subset board does not republish them.

§ 03 · The storyline

From “just use XGBoost” to foundation models in four acts.

2015–2022 · Trees rule

XGBoost (2016), LightGBM (2017) and CatBoost (2017) win essentially every tabular Kaggle competition and every fair benchmark. Repeated studies find deep learning fails to beat tuned GBDTs on typical tabular tasks. “Just use XGBoost” becomes the default advice.

2025 · The Nature moment

TabPFN v2 (Hollmann et al., Nature 637, 319–326) shows a transformer pre-trained on ~100 million synthetic datasets can outperform every baseline on datasets up to 10,000 samples × 500 features — in 2.8 seconds, beating ensembles tuned for 4 hours. Tabular ML gets its foundation-model era.

2025 · TabArena ships

The community gets a living benchmark (arXiv:2506.16791): 51 datasets curated from a pool of 1,053 across 14 prior benchmarks, 16 model families, Elo ratings with bootstrapped CIs, and a maintained public leaderboard. Its launch finding: validation protocol and post-hoc ensembling change rankings more than architecture choice.

2026 · Foundation models take the board

On the current board (v0.1.4), the top four single models are all tabular foundation models — TabPFN-3, TabPFN-2.6, RealTabPFN-2.5, TabICLv2. The best GBDT config sits ~240 Elo lower. Trees still win on CPU cost, large data, and operational simplicity — but they no longer set the accuracy frontier.

The honest caveats, straight from the TabArena paper: rankings are protocol-sensitive. Post-hoc ensembling of hyperparameter configurations reshuffles the order — without it, CatBoost beats the tuned neural nets that outrank it with it. And foundation-model dominance is currently a small-and-medium-data result; context-length limits still bound how much data fits in a forward pass.

Practical reading: if you benchmark candidates yourself, fix the validation protocol first (outer folds, consistent tuning budget, same ensembling policy) — otherwise you are measuring your harness, not your model. TabArena’s code is open if you want to reuse its protocol: github.com/autogluon/tabarena.

§ 04 · Decision shortcuts

Which should I use?

Elo is one axis. The others are dataset size, hardware (foundation models want a GPU; GBDTs don’t), tuning budget, and whether you must explain every prediction.

Small / medium data (≲50K rows) · max accuracy

TabPFN-3 · TabPFN-2.6 · TabICLv2

Foundation models hold the top single-model slots on TabArena, in one forward pass with no tuning. This regime — most real business datasets — is where in-context learning won.

Large data · CPU-only · production serving

LightGBM · CatBoost

Best Elo per CPU-second on the board. LightGBM (tuned + ensembled, Elo 1433) is the strongest tree; CatBoost's default config (Elo 1369, ~7 s/1K train) is the best zero-tuning classical pick.

Absolute ceiling · accuracy at any cost

AutoGluon (extreme preset)

The 4-hour multi-model ensemble tops the overall board (Elo 1695). If a Kaggle-style last percent matters and you have the compute budget, ensemble across families.

Regulated / interpretable

EBM · Linear + strong features

Explainable Boosting Machines are glass-box additive models — every prediction decomposes into per-feature curves. You pay ~160 Elo vs LightGBM for that property.

Deep-learning stack already in place

TabM · RealMLP

The two strongest trained-from-scratch nets (Elo 1447 / 1513 tuned + ensembled). They beat tuned GBDTs on the 2026 board — but only with tuning and post-hoc ensembling.

Honest baseline before anything fancy

Logistic / linear regression

Elo 962 tuned. If your candidate model doesn't clearly separate from this, invest in features and data quality, not architecture.

§ 05 · Reference sources

The boards and papers that matter.

Tabular ML spent a decade with mutually contradicting benchmarks. As of 2026 the field has converged on one living board plus a handful of primary papers — these four cover the evidence on this page.

TabArena (v0.1.4)

51 datasets · 64 rated configs · Elo + 95% CI2025–live

The official living benchmark for tabular ML (paper: arXiv:2506.16791). Datasets curated from 1,053 candidates across 14 prior benchmarks; every model run with defaults, tuned, and tuned + post-hoc-ensembled protocols; Elo calibrated to default Random Forest = 1000. The single board to read first in 2026.

tabarena.ai leaderboard →

TabPFN v2 Nature evaluation

Datasets ≤10K samples × 500 features2025

The peer-reviewed evaluation behind the foundation-model turn: TabPFN outperforms all baselines on small data, with reported speedups of 5,140× (classification) and 3,000× (regression) versus 4-hour-tuned ensembles. Hollmann et al., Nature 637, 319–326.

Nature paper →

TabArena small-data slice

36 of the 51 datasets2025–live

The subset where the GBDT-vs-foundation-model question is most decisively settled: TabPFN-3 (Elo 1642) ties the 4-hour AutoGluon reference pipeline and leads the best GBDT config by ~250 Elo. If your datasets are small, this is the slice that predicts your experience.

Leaderboard (small-data filter) →

TabArena paper

Benchmark methodology2025

Pichler, Salinas et al. — the methodology behind the board: why prior tabular benchmarks disagreed (dataset licensing, leakage, weak validation protocols), how post-hoc ensembling reshuffles rankings, and the maintenance protocol that keeps the leaderboard living.

arXiv:2506.16791 →

Related on CodeSOTA

All tasks →Text Classification →Request a benchmark →

Reply within 48 hours · No newsletter

What were you looking for on tabular ML?

This page exists because a reader asked for it. Missing a model, a regime (time series? streaming? >1M rows?), or a benchmark slice you need? Tell us — we update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.