Time Seriestime-series-forecasting

Time Series Forecasting

Time-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (2023), Google's TimesFM (2024), and Amazon's Chronos showed that a single pretrained model can zero-shot forecast diverse series, rivaling task-specific statistical models like ETS and ARIMA. Yet the Monash benchmark and M-competition lineage (M4, M5) reveal an uncomfortable truth: simple ensembles of statistical methods still win on many univariate tasks. The real battle now is multivariate long-horizon forecasting, where PatchTST and iTransformer compete with state-space models like Mamba.

6 datasets75 resultsView full task mapping →

Time series forecasting predicts future values of temporal sequences — demand planning, financial markets, energy load, weather. Foundation models (TimesFM, Chronos, Moirai) are disrupting the field by enabling zero-shot forecasting that rivals or beats task-specific models, challenging decades of statistical and deep learning method development.

History

1970

Box-Jenkins ARIMA methodology established for univariate time series

2000

Prophet (Facebook) makes decomposition-based forecasting accessible to practitioners

2017

DeepAR (Amazon) applies autoregressive RNNs with probabilistic output for demand forecasting

2019

N-BEATS achieves strong performance with pure MLP architecture and interpretable decomposition

2021

Temporal Fusion Transformer (TFT) combines attention with multi-horizon forecasting

2022

PatchTST applies ViT-style patching to time series, achieving new transformer SOTA

2023

TSMixer (Google) shows MLPs rival transformers on long-term forecasting

2023

TimeGPT (Nixtla) — first commercial time series foundation model

2024

Chronos (Amazon), TimesFM (Google), and Moirai release as open time series foundation models

2025

Foundation models show zero-shot forecasting competitive with tuned statistical methods

How Time Series Forecasting Works

Data Preparation

Handle missing values, detect and adjust for seasonality/trends, and create train/validation/test splits respecting temporal order (no future leakage).

Feature Engineering

Create time features (day-of-week, month, holiday), lag features, rolling statistics, and optional external covariates (weather, events).

Model Selection

Choose between statistical (ARIMA, ETS), ML (LightGBM on lags), deep learning (TFT, PatchTST), or foundation models (Chronos, TimesFM).

Training / Fine-Tuning

Task-specific models are trained on the target series; foundation models can be used zero-shot or fine-tuned with few examples.

Probabilistic Forecasting

Output prediction intervals via quantile regression, conformal prediction, or learned distributional parameters.

Current Landscape

Time series forecasting in 2025 is in the middle of a paradigm shift. Foundation models (Chronos, TimesFM, Moirai) can forecast new time series zero-shot, rivaling task-specific models that required training. However, well-tuned LightGBM on engineered lag features remains extremely competitive and is the workhorse of production forecasting. The deep learning approaches (PatchTST, iTransformer) excel on long-horizon multivariate forecasting. The honest assessment: for most business use cases, a well-engineered LightGBM pipeline still wins.

Key Challenges

Distribution shift — the data-generating process changes over time (concept drift), invalidating learned patterns

Evaluation pitfalls — improper cross-validation, lookahead bias, and inconsistent metrics plague time series evaluation

Long-horizon degradation — forecast accuracy drops rapidly with prediction horizon length

Multivariate complexity — modeling dependencies between hundreds of correlated time series remains challenging

Foundation model limitations — zero-shot works for common patterns but fails on domain-specific dynamics

Quick Recommendations

Quick baseline / zero-shot

Chronos / TimesFM

Zero-shot foundation models that rival tuned models on many benchmarks

Production forecasting

LightGBM on lag features / TFT

Reliable, fast, and interpretable for business applications

Probabilistic demand planning

DeepAR / TFT

Proven at scale for inventory and supply chain forecasting

Long-term forecasting

PatchTST / iTransformer

Best transformer architectures for long-horizon prediction

What's Next

The frontier is multimodal forecasting — combining numerical time series with text (news, reports), images (satellite data), and external knowledge graphs. Foundation models will improve through pretraining on larger, more diverse time series corpora. Expect hybrid approaches that use foundation models for initialization and task-specific fine-tuning for production accuracy.

Benchmarks & SOTA

M4 Competition

M4 Forecasting Competition

201839 results

100,000 time series from diverse domains (finance, demographic, macro, micro, industry, other). Competition ran in 2018. Lower sMAPE/MASE/OWA is better.

State of the Art

TiDE

Google

13.95

smapi

Weather

Weather Time Series Benchmark

202112 results

The Weather dataset contains 21 meteorological indicators (temperature, humidity, wind speed, etc.) recorded every 10 minutes at a weather station in Germany for 2020. Widely used for long-term multivariate forecasting benchmarks. Results reported as averages across prediction horizons {96, 192, 336, 720}.

State of the Art

DLinear

THUML

0.317

mae

ETTh1

Electricity Transformer Temperature - hourly (ETTh1)

20216 results

ETTh1 is one of four ETT benchmark datasets for long-term time series forecasting. It records electricity transformer oil temperature and load at hourly granularity from a power station in China (July 2016 – July 2018). Results reported as averages across prediction horizons {96, 192, 336, 720}.

State of the Art

Chronos-Large

Amazon

0.588

mse

ETTh2

Electricity Transformer Temperature - hourly 2 (ETTh2)

20216 results

ETTh2 is a second hourly ETT dataset from a different transformer station in China. Results reported as averages across prediction horizons {96, 192, 336, 720}.

State of the Art

Chronos-Large

Amazon

0.455

mse

ETTm1

Electricity Transformer Temperature - 15-minute (ETTm1)

20216 results

ETTm1 is sampled at 15-minute intervals from the same station as ETTh1. Results reported as averages across prediction horizons {96, 192, 336, 720}.

State of the Art

Chronos-Large

Amazon

0.555

mse

ETTm2

Electricity Transformer Temperature - 15-minute 2 (ETTm2)

20216 results

ETTm2 is sampled at 15-minute intervals from the same station as ETTh2. Results reported as averages across prediction horizons {96, 192, 336, 720}.

State of the Art

TimesFM

Google Research

0.346

mae

Related Tasks

Time Series Classification

Classifying time series patterns.

Tabular Classification

Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain where gradient-boosted trees (XGBoost, LightGBM, CatBoost) stubbornly rival deep learning. Despite years of effort, neural approaches like TabNet (2019) and FT-Transformer (2021) only match tree methods on certain splits, and a 2022 NeurIPS study by Grinsztajn et al. confirmed that trees still dominate on medium-sized datasets. The real frontier is AutoML systems (AutoGluon, FLAML) that ensemble both paradigms, and the emerging question of whether foundation models pretrained on millions of tables can finally tip the balance.

Tabular Regression

Tabular regression — predicting continuous values from structured data — powers everything from house-price estimation to demand forecasting and shares the same tree-vs-neural tension as classification. XGBoost and LightGBM remain brutally effective defaults, but recent work on differentiable trees and table-aware transformers (TabPFN, 2022) showed that meta-learned priors can beat tuned GBDTs on small datasets in seconds. The challenge is distribution shift: real-world regression targets drift over time, and most benchmarks (UCI, Kaggle) are static snapshots that hide this problem entirely.

Something wrong or missing?

Help keep Time Series Forecasting benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Time Series