Time Series Forecasting
Time-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (2023), Google's TimesFM (2024), and Amazon's Chronos showed that a single pretrained model can zero-shot forecast diverse series, rivaling task-specific statistical models like ETS and ARIMA. Yet the Monash benchmark and M-competition lineage (M4, M5) reveal an uncomfortable truth: simple ensembles of statistical methods still win on many univariate tasks. The real battle now is multivariate long-horizon forecasting, where PatchTST and iTransformer compete with state-space models like Mamba.
Time series forecasting predicts future values of temporal sequences — demand planning, financial markets, energy load, weather. Foundation models (TimesFM, Chronos, Moirai) are disrupting the field by enabling zero-shot forecasting that rivals or beats task-specific models, challenging decades of statistical and deep learning method development.
History
Box-Jenkins ARIMA methodology established for univariate time series
Prophet (Facebook) makes decomposition-based forecasting accessible to practitioners
DeepAR (Amazon) applies autoregressive RNNs with probabilistic output for demand forecasting
N-BEATS achieves strong performance with pure MLP architecture and interpretable decomposition
Temporal Fusion Transformer (TFT) combines attention with multi-horizon forecasting
PatchTST applies ViT-style patching to time series, achieving new transformer SOTA
TSMixer (Google) shows MLPs rival transformers on long-term forecasting
TimeGPT (Nixtla) — first commercial time series foundation model
Chronos (Amazon), TimesFM (Google), and Moirai release as open time series foundation models
Foundation models show zero-shot forecasting competitive with tuned statistical methods
How Time Series Forecasting Works
Data Preparation
Handle missing values, detect and adjust for seasonality/trends, and create train/validation/test splits respecting temporal order (no future leakage).
Feature Engineering
Create time features (day-of-week, month, holiday), lag features, rolling statistics, and optional external covariates (weather, events).
Model Selection
Choose between statistical (ARIMA, ETS), ML (LightGBM on lags), deep learning (TFT, PatchTST), or foundation models (Chronos, TimesFM).
Training / Fine-Tuning
Task-specific models are trained on the target series; foundation models can be used zero-shot or fine-tuned with few examples.
Probabilistic Forecasting
Output prediction intervals via quantile regression, conformal prediction, or learned distributional parameters.
Current Landscape
Time series forecasting in 2025 is in the middle of a paradigm shift. Foundation models (Chronos, TimesFM, Moirai) can forecast new time series zero-shot, rivaling task-specific models that required training. However, well-tuned LightGBM on engineered lag features remains extremely competitive and is the workhorse of production forecasting. The deep learning approaches (PatchTST, iTransformer) excel on long-horizon multivariate forecasting. The honest assessment: for most business use cases, a well-engineered LightGBM pipeline still wins.
Key Challenges
Distribution shift — the data-generating process changes over time (concept drift), invalidating learned patterns
Evaluation pitfalls — improper cross-validation, lookahead bias, and inconsistent metrics plague time series evaluation
Long-horizon degradation — forecast accuracy drops rapidly with prediction horizon length
Multivariate complexity — modeling dependencies between hundreds of correlated time series remains challenging
Foundation model limitations — zero-shot works for common patterns but fails on domain-specific dynamics
Quick Recommendations
Quick baseline / zero-shot
Chronos / TimesFM
Zero-shot foundation models that rival tuned models on many benchmarks
Production forecasting
LightGBM on lag features / TFT
Reliable, fast, and interpretable for business applications
Probabilistic demand planning
DeepAR / TFT
Proven at scale for inventory and supply chain forecasting
Long-term forecasting
PatchTST / iTransformer
Best transformer architectures for long-horizon prediction
What's Next
The frontier is multimodal forecasting — combining numerical time series with text (news, reports), images (satellite data), and external knowledge graphs. Foundation models will improve through pretraining on larger, more diverse time series corpora. Expect hybrid approaches that use foundation models for initialization and task-specific fine-tuning for production accuracy.
Benchmarks & SOTA
M4 Competition
M4 Forecasting Competition
100,000 time series from diverse domains (finance, demographic, macro, micro, industry, other). Competition ran in 2018. Lower sMAPE/MASE/OWA is better.
State of the Art
TiDE
13.95
smapi
Weather
Weather Time Series Benchmark
The Weather dataset contains 21 meteorological indicators (temperature, humidity, wind speed, etc.) recorded every 10 minutes at a weather station in Germany for 2020. Widely used for long-term multivariate forecasting benchmarks. Results reported as averages across prediction horizons {96, 192, 336, 720}.
State of the Art
DLinear
THUML
0.317
mae
ETTh1
Electricity Transformer Temperature - hourly (ETTh1)
ETTh1 is one of four ETT benchmark datasets for long-term time series forecasting. It records electricity transformer oil temperature and load at hourly granularity from a power station in China (July 2016 – July 2018). Results reported as averages across prediction horizons {96, 192, 336, 720}.
State of the Art
Chronos-Large
Amazon
0.588
mse
ETTh2
Electricity Transformer Temperature - hourly 2 (ETTh2)
ETTh2 is a second hourly ETT dataset from a different transformer station in China. Results reported as averages across prediction horizons {96, 192, 336, 720}.
State of the Art
Chronos-Large
Amazon
0.455
mse
ETTm1
Electricity Transformer Temperature - 15-minute (ETTm1)
ETTm1 is sampled at 15-minute intervals from the same station as ETTh1. Results reported as averages across prediction horizons {96, 192, 336, 720}.
State of the Art
Chronos-Large
Amazon
0.555
mse
ETTm2
Electricity Transformer Temperature - 15-minute 2 (ETTm2)
ETTm2 is sampled at 15-minute intervals from the same station as ETTh2. Results reported as averages across prediction horizons {96, 192, 336, 720}.
State of the Art
TimesFM
Google Research
0.346
mae
Related Tasks
Time Series Classification
Classifying time series patterns.
Tabular Classification
Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain where gradient-boosted trees (XGBoost, LightGBM, CatBoost) stubbornly rival deep learning. Despite years of effort, neural approaches like TabNet (2019) and FT-Transformer (2021) only match tree methods on certain splits, and a 2022 NeurIPS study by Grinsztajn et al. confirmed that trees still dominate on medium-sized datasets. The real frontier is AutoML systems (AutoGluon, FLAML) that ensemble both paradigms, and the emerging question of whether foundation models pretrained on millions of tables can finally tip the balance.
Tabular Regression
Tabular regression — predicting continuous values from structured data — powers everything from house-price estimation to demand forecasting and shares the same tree-vs-neural tension as classification. XGBoost and LightGBM remain brutally effective defaults, but recent work on differentiable trees and table-aware transformers (TabPFN, 2022) showed that meta-learned priors can beat tuned GBDTs on small datasets in seconds. The challenge is distribution shift: real-world regression targets drift over time, and most benchmarks (UCI, Kaggle) are static snapshots that hide this problem entirely.
Something wrong or missing?
Help keep Time Series Forecasting benchmarks accurate. Report outdated results, missing benchmarks, or errors.