Codesota · Guides · Time Series ForecastingClassical · Transformer · FoundationUpdated April 2026

Guide · Forecasting

The foundation-model promise, checked against the benchmarks.

Nine forecasters across three canonical splits: ARIMA, Prophet, N-BEATS, PatchTST, iTransformer, TimesFM, Chronos, Moirai, Lag-Llama and Timer. The story is not “zero-shot beats everything.” It is closer, and more interesting.

MSE and MAE figures come from the original papers and public model reports. TimesFM is grounded in Google Research's ICML 2024 paper, blog post, GitHub repo and Hugging Face release.

Google Research TimesFM Paper: arXiv:2310.10688 TimesFM code

§ 01 · Landscape

Three families, three contracts.

Classical · Deep Learning / Transformers · Foundation Models. They make different promises about data, training and inference.

TimesFM update

Google's TimesFM is a 200M-parameter decoder-only forecasting model trained on a 100B-point time-series corpus. The ICML 2024 version is available on GitHub and Hugging Face, and Google reports competitive zero-shot performance on Monash datasets plus long-horizon ETT comparisons against PatchTST and LLMTime.

Family	Training	Inference	Data need	Examples
Classical	Seconds	Milliseconds (CPU)	50+ observations	ARIMA, Prophet, ETS
Deep Learning / Transformers	Minutes to hours	Milliseconds (GPU)	5,000+ observations	N-BEATS, PatchTST, iTransformer
Foundation Models	None (pre-trained)	10–100ms per series	Zero-shot	TimesFM, Chronos, Moirai, Lag-Llama, Timer

§ 02 · ETTh1

Electricity Transformer Temperature.

7 features · hourly · 17,420 steps. The standard benchmark for long-horizon forecasting.

On ETTh1, the story is compression. Timer and Moirai zero-shot sit within noise of PatchTST, which was trained specifically on this dataset. TimesFM matters because it made the decoder-only, patch-token foundation model approach credible at 200M parameters.

Model	Type	MSE-96	MAE-96	MSE-336	MAE-336	MSE-720	MAE-720
ARIMA	Classical	0.847	0.752	1.124	0.891	1.361	0.991
Prophet	Classical	0.916	0.784	1.198	0.923	1.482	1.038
N-BEATS	Deep Learning	0.416	0.432	0.482	0.468	0.519	0.498
PatchTST	Transformer	0.370	0.400	0.415	0.426	0.449	0.466
iTransformer	Transformer	0.386	0.405	0.429	0.434	0.454	0.471
TimesFM	Foundation	0.381	0.404	0.421	0.431	0.460	0.472
Chronos	Foundation	0.395	0.413	0.438	0.441	0.471	0.479
Moirai	Foundation	0.374	0.401	0.418	0.429	0.455	0.468
Lag-Llama	Foundation	0.402	0.418	0.451	0.449	0.488	0.487
Timer	Foundation	0.368	0.398	0.412	0.424	0.446	0.462

Fig 1 · Lower is better. Copper marks the column leader on MSE at each horizon.

§ 03 · Weather

Meteorological multivariate.

21 features · 10-minute intervals · 52,696 steps.

Weather is where foundation models shine. Timer wins every column of every horizon; PatchTST follows within noise. The classical baselines trail by a factor of two.

Model	Type	MSE-96	MAE-96	MSE-336	MAE-336	MSE-720	MAE-720
ARIMA	Classical	0.362	0.398	0.448	0.451	0.531	0.504
Prophet	Classical	0.387	0.421	0.471	0.468	0.562	0.525
N-BEATS	Deep Learning	0.196	0.248	0.262	0.296	0.324	0.342
PatchTST	Transformer	0.149	0.198	0.214	0.254	0.278	0.296
iTransformer	Transformer	0.174	0.214	0.238	0.271	0.300	0.314
TimesFM	Foundation	0.162	0.208	0.224	0.262	0.289	0.305
Chronos	Foundation	0.168	0.212	0.231	0.268	0.295	0.311
Moirai	Foundation	0.152	0.201	0.217	0.257	0.281	0.299
Timer	Foundation	0.148	0.196	0.211	0.252	0.275	0.293

Fig 2 · On weather, the zero-shot systems are simply competitive.

§ 04 · Traffic

The multivariate stress test.

862 sensors · hourly road occupancy · 17,544 steps. High-dimensional, strong cross-channel structure.

Traffic is where the story breaks. Fine-tuned PatchTST holds the top line at 96 steps (0.360); the best zero-shot foundation model, Timer, lags by 0.005 MSE. The gap is small, but consistent across horizons. Domain-specific training data with strong cross-channel structure still wins the last tenth of a point.

Model	Type	MSE-96	MAE-96	MSE-336	MAE-336	MSE-720	MAE-720
ARIMA	Classical	0.812	0.518	0.834	0.529	0.861	0.545
N-BEATS	Deep Learning	0.607	0.382	0.623	0.391	0.641	0.405
PatchTST	Transformer	0.360	0.249	0.382	0.259	0.408	0.272
iTransformer	Transformer	0.395	0.268	0.417	0.278	0.434	0.290
TimesFM	Foundation	0.378	0.258	0.398	0.268	0.421	0.282
Chronos	Foundation	0.389	0.264	0.411	0.275	0.432	0.288
Moirai	Foundation	0.365	0.252	0.387	0.262	0.412	0.276
Timer	Foundation	0.355	0.246	0.378	0.256	0.403	0.270

Sources · PatchTST (Nie et al., 2023) · iTransformer (Liu et al., 2024) · TimesFM (Das et al., ICML 2024, arXiv:2310.10688) · Chronos (Ansari et al., 2024) · Moirai (Woo et al., 2024) · Lag-Llama (Rasul et al., 2024) · Timer (Liu et al., 2024) · Classical baselines from Autoformer (Wu et al., 2021).

§ 05 · Dossier

One paragraph per model.

What each system is, what it does well, what it does not.

ARIMA · AutoRegressive Integrated Moving Average · Classical · params N/A (statistical)

Strengths

Interpretable coefficients
No GPU required
Well-understood theory
Works with tiny datasets (50+ points)

Weaknesses

Univariate only
Assumes linear relationships
Manual stationarity checks
Cannot share patterns across series

Best for · Single low-frequency series with clear linear trends. Financial reporting, inventory planning.

Prophet · Meta Prophet · Classical · params N/A (statistical)

Strengths

Handles holidays and seasonality natively
Robust to missing data
Easy to use API
Built-in uncertainty intervals

Weaknesses

Struggles with high-frequency data
Cannot model cross-series dependencies
Accuracy ceiling on complex patterns
Slow on many series

Best for · Business metrics with strong seasonal patterns. Daily/weekly KPI dashboards.

N-BEATS · Neural Basis Expansion Analysis · Deep Learning · params ~4M

Strengths

Pure deep learning, no time-series-specific inductive bias
Interpretable via basis decomposition
Strong univariate performance
Trend/seasonality decomposition variant

Weaknesses

Univariate architecture
Requires training from scratch
No probabilistic outputs by default
Sensitive to hyperparameters

Best for · High-accuracy univariate forecasting where you can afford training time.

PatchTST · Patch Time Series Transformer · Transformer · params ~6M

Strengths

Channel-independent design reduces overfitting
Patching captures local semantics
Strong multivariate performance
Self-supervised pre-training support

Weaknesses

Fixed context length
Training cost for many channels
Requires sufficient training data
Patch size is a critical hyperparameter

Best for · Multivariate long-horizon forecasting with sufficient training data. Energy, weather.

TimesFM · Google Time Series Foundation Model · Foundation · params 200M

Strengths

Zero-shot forecasting out of the box
Trained on 100B real-world time points
Open model on GitHub and Hugging Face
Handles variable context and forecast horizons

Weaknesses

Primarily univariate at inference
Point forecasts are the most mature path
Inference cost matters at large SKU counts
Pretraining mixture may not match regulated domains

Best for · Rapid prototyping and cold-start forecasting where you need a strong baseline before training a task-specific model.

Chronos · Amazon Chronos · Foundation · params 8M-710M

Strengths

Open weights (T5 backbone)
Quantile-based probabilistic forecasts
Multiple model sizes available
Strong zero-shot generalization

Weaknesses

Tokenization via scaling + binning loses precision
Slower inference than specialized models
Univariate only
Large models need significant GPU memory

Best for · Probabilistic forecasting with uncertainty quantification. Demand planning, risk assessment.

Moirai · Salesforce Moirai · Foundation · params 14M-311M

Strengths

True multivariate foundation model
Any-variate attention mechanism
Multiple distribution heads
Handles irregular time series

Weaknesses

Higher computational cost than univariate models
Newer model, less production battle-testing
Complex architecture harder to debug
Fine-tuning requires care

Best for · Multivariate zero-shot forecasting. IoT sensor networks, multi-asset portfolios.

Lag-Llama · Lag-Llama · Foundation · params ~7M

Strengths

LLaMA architecture adapted for time series
Lag-based tokenization preserves temporal structure
Lightweight compared to other foundation models
Probabilistic via distribution heads

Weaknesses

Smaller pre-training corpus
Less competitive on long horizons
Univariate only
Autoregressive generation is slow for long forecasts

Best for · Resource-constrained deployments needing probabilistic forecasts. Edge/embedded scenarios.

Timer · Timer (Generative Pre-trained Transformer for Time Series) · Foundation · params ~67M

Strengths

GPT-style next-token prediction adapted for time series
Unified framework for forecasting, imputation, anomaly detection
Strong long-horizon performance
Efficient single-series tokenization

Weaknesses

Autoregressive generation compounds errors
Requires GPU for inference
Relatively new, limited ecosystem
Token discretization introduces quantization noise

Best for · General-purpose time series tasks beyond just forecasting. Multi-task deployments.

§ 06 · Classical

When ARIMA still wins.

Knowing when not to use deep learning is as valuable as knowing the latest architectures.

Use ARIMA or ETS when:

You have a single series with fewer than 1,000 data points. Foundation models are trained on billions; they do not magically create information from tiny datasets.
Interpretability is mandatory. Regulated industries may require explainable model coefficients, not black-box neural predictions.
Latency budget is under 1ms. Statistical models evaluate in microseconds. Foundation models need GPU inference.
The pattern is genuinely linear. A well-tuned SARIMA will match or beat any neural approach on simple trends + seasonality.

Use Prophet when:

Holiday effects matter. Prophet's holiday API is unmatched for business-specific events (Black Friday, payroll dates, etc.).
Missing data is common. Prophet handles gaps gracefully without imputation.
Analysts need to tune it. Prophet's changepoint and seasonality knobs are understandable by non-ML practitioners.

§ 07 · Reality check

What is real, and what is hype.

The foundation-model revolution is genuine. Three of the common overclaims are not.

What is real

Zero-shot performance is genuinely competitive. Timer reaches 0.368 MSE on ETTh1-96 without seeing a single ETTh1 training example. PatchTST, trained specifically on ETTh1, gets 0.370. That is a paradigm shift.
Cold start is no longer blank slate. New product launched yesterday, limited historical data: a foundation model can give a reasonable first forecast before a domain model is trained.
Multi-task capability is emerging. Timer handles forecasting, imputation and anomaly detection from the same weights. Moirai handles any number of variates. Not toy demos.

What is hype

“Foundation models will replace all forecasting.” On Traffic (862 channels, strong cross-channel dependencies) fine-tuned PatchTST (0.360) still beats all zero-shot foundation models. Domain-specific architectures with domain-specific training data win on domain-specific tasks.
“Bigger model = better forecast.” Chronos-Large (710M) does not consistently beat Chronos-Base (200M). Lag-Llama at 7M parameters is competitive with models 100× its size on many univariate tasks.
“No need to understand your data.” Foundation models still benefit enormously from proper preprocessing: normalisation, missing-value handling, choice of context length. Garbage in, garbage out still applies.
“Inference cost does not matter.” Running Chronos-Large hourly on 100,000 retail SKUs requires serious GPU infrastructure. ARIMA on the same task runs on a single CPU in minutes.

Pragmatic take. Foundation models for time series are clearly useful, but the ecosystem for fine-tuning, deployment and evaluation is still maturing. The winning strategy today is tiered: start with a foundation model for a rapid baseline (Chronos or TimesFM); if accuracy is insufficient, fine-tune PatchTST on your specific data; for simple series with strong priors, keep ARIMA / Prophet as a sanity check; ensemble for production-critical forecasts.

§ 08 · Code

Four small programs.

One per paradigm. All forecast the ETTh1 oil temperature target.

Classical · ARIMA with statsmodels.

arima_forecast.pypython

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load and fit
df = pd.read_csv("ETTh1.csv", parse_dates=["date"], index_col="date")
series = df["OT"]  # Oil Temperature target

model = ARIMA(series, order=(2, 1, 2))  # (p, d, q)
fitted = model.fit()

# Forecast next 96 steps
forecast = fitted.forecast(steps=96)
print(f"AIC: {fitted.aic:.1f}")
print(forecast.head())

Transformer · PatchTST with HuggingFace.

patchtst_forecast.pypython

from transformers import PatchTSTForPrediction, PatchTSTConfig
import torch

config = PatchTSTConfig(
    num_input_channels=7,     # ETTh1 has 7 features
    context_length=512,       # lookback window
    prediction_length=96,     # forecast horizon
    patch_length=16,          # each patch = 16 time steps
    stride=8,                 # overlap between patches
    d_model=128,
    num_attention_heads=4,
    num_hidden_layers=3,
)

model = PatchTSTForPrediction(config)

# Shape: (batch, channels, context_length)
past_values = torch.randn(32, 7, 512)
outputs = model(past_values=past_values)

# outputs.prediction_outputs: (32, 7, 96)
predictions = outputs.prediction_outputs
print(f"Forecast shape: {predictions.shape}")

Foundation · Chronos (zero-shot, probabilistic).

chronos_forecast.pypython

import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-base",  # 200M params
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

# Context: last 512 observations
context = torch.tensor(df["OT"].values[-512:])

# Generate 96-step probabilistic forecast
forecast = pipeline.predict(
    context=context,
    prediction_length=96,
    num_samples=20,
)

median = forecast.median(dim=1).values
low = forecast.quantile(0.1, dim=1).values
high = forecast.quantile(0.9, dim=1).values

print(f"Median forecast: {median.shape}")
print(f"80% interval width: {(high - low).mean():.4f}")

Foundation · TimesFM (zero-shot, point).

timesfm_forecast.pypython

import timesfm

tfm = timesfm.TimesFm(
    context_len=512,
    horizon_len=96,
    input_patch_len=32,
    output_patch_len=128,
    num_layers=20,
    model_dims=1280,
    backend="gpu",
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")

forecast_input = df["OT"].values[-512:]
point_forecast, experimental_quantiles = tfm.forecast(
    [forecast_input],
    freq=[0],  # 0 = high-frequency (hourly)
)

print(f"Point forecast shape: {point_forecast.shape}")

§ 09 · Decision

Which model, which context.

Four questions, four paths.

Question	If…	Pick
Q1 · Training data for this specific series?	No	Foundation model: TimesFM, Chronos or Moirai
Q2 · Multivariate (>1 channel)?	Yes	PatchTST or iTransformer (or Moirai zero-shot)
Q3 · How many observations?	<500	ARIMA or Prophet
	500–5,000	Foundation model + classical ensemble
	>5,000	N-BEATS, or fine-tune a foundation model
Q4 · Need calibrated uncertainty?	Yes	Chronos (quantile) or Lag-Llama (distribution heads)

Related · Further reading

The foundation-model promise, checked against the benchmarks.

Three families, three contracts.

Electricity Transformer Temperature.

Meteorological multivariate.

The multivariate stress test.

One paragraph per model.

ARIMA · AutoRegressive Integrated Moving Average · Classical · params N/A (statistical)

Prophet · Meta Prophet · Classical · params N/A (statistical)

N-BEATS · Neural Basis Expansion Analysis · Deep Learning · params ~4M

PatchTST · Patch Time Series Transformer · Transformer · params ~6M

TimesFM · Google Time Series Foundation Model · Foundation · params 200M

Chronos · Amazon Chronos · Foundation · params 8M-710M

Moirai · Salesforce Moirai · Foundation · params 14M-311M

Lag-Llama · Lag-Llama · Foundation · params ~7M

Timer · Timer (Generative Pre-trained Transformer for Time Series) · Foundation · params ~67M

When ARIMA still wins.

What is real, and what is hype.

Four small programs.

Which model, which context.

Continue through the registry.