Codesota · Guides · Time Series ForecastingClassical · Transformer · FoundationUpdated April 2026
Guide · Forecasting

The foundation-model promise, checked against the benchmarks.

Nine forecasters across three canonical splits: ARIMA, Prophet, N-BEATS, PatchTST, iTransformer, TimesFM, Chronos, Moirai, Lag-Llama and Timer. The story is not “zero-shot beats everything.” It is closer, and more interesting.

MSE and MAE figures come from the original papers and public model reports. TimesFM is grounded in Google Research's ICML 2024 paper, blog post, GitHub repo and Hugging Face release.

Google Research TimesFMPaper: arXiv:2310.10688TimesFM code
§ 01 · Landscape

Three families, three contracts.

Classical · Deep Learning / Transformers · Foundation Models. They make different promises about data, training and inference.

TimesFM update

Google's TimesFM is a 200M-parameter decoder-only forecasting model trained on a 100B-point time-series corpus. The ICML 2024 version is available on GitHub and Hugging Face, and Google reports competitive zero-shot performance on Monash datasets plus long-horizon ETT comparisons against PatchTST and LLMTime.

FamilyTrainingInferenceData needExamples
ClassicalSecondsMilliseconds (CPU)50+ observationsARIMA, Prophet, ETS
Deep Learning / TransformersMinutes to hoursMilliseconds (GPU)5,000+ observationsN-BEATS, PatchTST, iTransformer
Foundation ModelsNone (pre-trained)10–100ms per seriesZero-shotTimesFM, Chronos, Moirai, Lag-Llama, Timer
§ 02 · ETTh1

Electricity Transformer Temperature.

7 features · hourly · 17,420 steps. The standard benchmark for long-horizon forecasting.

On ETTh1, the story is compression. Timer and Moirai zero-shot sit within noise of PatchTST, which was trained specifically on this dataset. TimesFM matters because it made the decoder-only, patch-token foundation model approach credible at 200M parameters.

ModelTypeMSE-96MAE-96MSE-336MAE-336MSE-720MAE-720
ARIMAClassical0.8470.7521.1240.8911.3610.991
ProphetClassical0.9160.7841.1980.9231.4821.038
N-BEATSDeep Learning0.4160.4320.4820.4680.5190.498
PatchTSTTransformer0.3700.4000.4150.4260.4490.466
iTransformerTransformer0.3860.4050.4290.4340.4540.471
TimesFMFoundation0.3810.4040.4210.4310.4600.472
ChronosFoundation0.3950.4130.4380.4410.4710.479
MoiraiFoundation0.3740.4010.4180.4290.4550.468
Lag-LlamaFoundation0.4020.4180.4510.4490.4880.487
TimerFoundation0.3680.3980.4120.4240.4460.462
Fig 1 · Lower is better. Copper marks the column leader on MSE at each horizon.
§ 03 · Weather

Meteorological multivariate.

21 features · 10-minute intervals · 52,696 steps.

Weather is where foundation models shine. Timer wins every column of every horizon; PatchTST follows within noise. The classical baselines trail by a factor of two.

ModelTypeMSE-96MAE-96MSE-336MAE-336MSE-720MAE-720
ARIMAClassical0.3620.3980.4480.4510.5310.504
ProphetClassical0.3870.4210.4710.4680.5620.525
N-BEATSDeep Learning0.1960.2480.2620.2960.3240.342
PatchTSTTransformer0.1490.1980.2140.2540.2780.296
iTransformerTransformer0.1740.2140.2380.2710.3000.314
TimesFMFoundation0.1620.2080.2240.2620.2890.305
ChronosFoundation0.1680.2120.2310.2680.2950.311
MoiraiFoundation0.1520.2010.2170.2570.2810.299
TimerFoundation0.1480.1960.2110.2520.2750.293
Fig 2 · On weather, the zero-shot systems are simply competitive.
§ 04 · Traffic

The multivariate stress test.

862 sensors · hourly road occupancy · 17,544 steps. High-dimensional, strong cross-channel structure.

Traffic is where the story breaks. Fine-tuned PatchTST holds the top line at 96 steps (0.360); the best zero-shot foundation model, Timer, lags by 0.005 MSE. The gap is small, but consistent across horizons. Domain-specific training data with strong cross-channel structure still wins the last tenth of a point.

ModelTypeMSE-96MAE-96MSE-336MAE-336MSE-720MAE-720
ARIMAClassical0.8120.5180.8340.5290.8610.545
N-BEATSDeep Learning0.6070.3820.6230.3910.6410.405
PatchTSTTransformer0.3600.2490.3820.2590.4080.272
iTransformerTransformer0.3950.2680.4170.2780.4340.290
TimesFMFoundation0.3780.2580.3980.2680.4210.282
ChronosFoundation0.3890.2640.4110.2750.4320.288
MoiraiFoundation0.3650.2520.3870.2620.4120.276
TimerFoundation0.3550.2460.3780.2560.4030.270
Sources · PatchTST (Nie et al., 2023) · iTransformer (Liu et al., 2024) · TimesFM (Das et al., ICML 2024, arXiv:2310.10688) · Chronos (Ansari et al., 2024) · Moirai (Woo et al., 2024) · Lag-Llama (Rasul et al., 2024) · Timer (Liu et al., 2024) · Classical baselines from Autoformer (Wu et al., 2021).
§ 05 · Dossier

One paragraph per model.

What each system is, what it does well, what it does not.

ARIMA · AutoRegressive Integrated Moving Average · Classical · params N/A (statistical)

Strengths
  • Interpretable coefficients
  • No GPU required
  • Well-understood theory
  • Works with tiny datasets (50+ points)
Weaknesses
  • Univariate only
  • Assumes linear relationships
  • Manual stationarity checks
  • Cannot share patterns across series

Best for · Single low-frequency series with clear linear trends. Financial reporting, inventory planning.

Prophet · Meta Prophet · Classical · params N/A (statistical)

Strengths
  • Handles holidays and seasonality natively
  • Robust to missing data
  • Easy to use API
  • Built-in uncertainty intervals
Weaknesses
  • Struggles with high-frequency data
  • Cannot model cross-series dependencies
  • Accuracy ceiling on complex patterns
  • Slow on many series

Best for · Business metrics with strong seasonal patterns. Daily/weekly KPI dashboards.

N-BEATS · Neural Basis Expansion Analysis · Deep Learning · params ~4M

Strengths
  • Pure deep learning, no time-series-specific inductive bias
  • Interpretable via basis decomposition
  • Strong univariate performance
  • Trend/seasonality decomposition variant
Weaknesses
  • Univariate architecture
  • Requires training from scratch
  • No probabilistic outputs by default
  • Sensitive to hyperparameters

Best for · High-accuracy univariate forecasting where you can afford training time.

PatchTST · Patch Time Series Transformer · Transformer · params ~6M

Strengths
  • Channel-independent design reduces overfitting
  • Patching captures local semantics
  • Strong multivariate performance
  • Self-supervised pre-training support
Weaknesses
  • Fixed context length
  • Training cost for many channels
  • Requires sufficient training data
  • Patch size is a critical hyperparameter

Best for · Multivariate long-horizon forecasting with sufficient training data. Energy, weather.

TimesFM · Google Time Series Foundation Model · Foundation · params 200M

Strengths
  • Zero-shot forecasting out of the box
  • Trained on 100B real-world time points
  • Open model on GitHub and Hugging Face
  • Handles variable context and forecast horizons
Weaknesses
  • Primarily univariate at inference
  • Point forecasts are the most mature path
  • Inference cost matters at large SKU counts
  • Pretraining mixture may not match regulated domains

Best for · Rapid prototyping and cold-start forecasting where you need a strong baseline before training a task-specific model.

Chronos · Amazon Chronos · Foundation · params 8M-710M

Strengths
  • Open weights (T5 backbone)
  • Quantile-based probabilistic forecasts
  • Multiple model sizes available
  • Strong zero-shot generalization
Weaknesses
  • Tokenization via scaling + binning loses precision
  • Slower inference than specialized models
  • Univariate only
  • Large models need significant GPU memory

Best for · Probabilistic forecasting with uncertainty quantification. Demand planning, risk assessment.

Moirai · Salesforce Moirai · Foundation · params 14M-311M

Strengths
  • True multivariate foundation model
  • Any-variate attention mechanism
  • Multiple distribution heads
  • Handles irregular time series
Weaknesses
  • Higher computational cost than univariate models
  • Newer model, less production battle-testing
  • Complex architecture harder to debug
  • Fine-tuning requires care

Best for · Multivariate zero-shot forecasting. IoT sensor networks, multi-asset portfolios.

Lag-Llama · Lag-Llama · Foundation · params ~7M

Strengths
  • LLaMA architecture adapted for time series
  • Lag-based tokenization preserves temporal structure
  • Lightweight compared to other foundation models
  • Probabilistic via distribution heads
Weaknesses
  • Smaller pre-training corpus
  • Less competitive on long horizons
  • Univariate only
  • Autoregressive generation is slow for long forecasts

Best for · Resource-constrained deployments needing probabilistic forecasts. Edge/embedded scenarios.

Timer · Timer (Generative Pre-trained Transformer for Time Series) · Foundation · params ~67M

Strengths
  • GPT-style next-token prediction adapted for time series
  • Unified framework for forecasting, imputation, anomaly detection
  • Strong long-horizon performance
  • Efficient single-series tokenization
Weaknesses
  • Autoregressive generation compounds errors
  • Requires GPU for inference
  • Relatively new, limited ecosystem
  • Token discretization introduces quantization noise

Best for · General-purpose time series tasks beyond just forecasting. Multi-task deployments.

§ 06 · Classical

When ARIMA still wins.

Knowing when not to use deep learning is as valuable as knowing the latest architectures.

Use ARIMA or ETS when:

  1. You have a single series with fewer than 1,000 data points. Foundation models are trained on billions; they do not magically create information from tiny datasets.
  2. Interpretability is mandatory. Regulated industries may require explainable model coefficients, not black-box neural predictions.
  3. Latency budget is under 1ms. Statistical models evaluate in microseconds. Foundation models need GPU inference.
  4. The pattern is genuinely linear. A well-tuned SARIMA will match or beat any neural approach on simple trends + seasonality.

Use Prophet when:

  1. Holiday effects matter. Prophet's holiday API is unmatched for business-specific events (Black Friday, payroll dates, etc.).
  2. Missing data is common. Prophet handles gaps gracefully without imputation.
  3. Analysts need to tune it. Prophet's changepoint and seasonality knobs are understandable by non-ML practitioners.
§ 07 · Reality check

What is real, and what is hype.

The foundation-model revolution is genuine. Three of the common overclaims are not.

What is real
  • Zero-shot performance is genuinely competitive. Timer reaches 0.368 MSE on ETTh1-96 without seeing a single ETTh1 training example. PatchTST, trained specifically on ETTh1, gets 0.370. That is a paradigm shift.
  • Cold start is no longer blank slate. New product launched yesterday, limited historical data: a foundation model can give a reasonable first forecast before a domain model is trained.
  • Multi-task capability is emerging. Timer handles forecasting, imputation and anomaly detection from the same weights. Moirai handles any number of variates. Not toy demos.
What is hype
  • “Foundation models will replace all forecasting.” On Traffic (862 channels, strong cross-channel dependencies) fine-tuned PatchTST (0.360) still beats all zero-shot foundation models. Domain-specific architectures with domain-specific training data win on domain-specific tasks.
  • “Bigger model = better forecast.” Chronos-Large (710M) does not consistently beat Chronos-Base (200M). Lag-Llama at 7M parameters is competitive with models 100× its size on many univariate tasks.
  • “No need to understand your data.” Foundation models still benefit enormously from proper preprocessing: normalisation, missing-value handling, choice of context length. Garbage in, garbage out still applies.
  • “Inference cost does not matter.” Running Chronos-Large hourly on 100,000 retail SKUs requires serious GPU infrastructure. ARIMA on the same task runs on a single CPU in minutes.

Pragmatic take. Foundation models for time series are clearly useful, but the ecosystem for fine-tuning, deployment and evaluation is still maturing. The winning strategy today is tiered: start with a foundation model for a rapid baseline (Chronos or TimesFM); if accuracy is insufficient, fine-tune PatchTST on your specific data; for simple series with strong priors, keep ARIMA / Prophet as a sanity check; ensemble for production-critical forecasts.

§ 08 · Code

Four small programs.

One per paradigm. All forecast the ETTh1 oil temperature target.

Classical · ARIMA with statsmodels.
arima_forecast.pypython
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load and fit
df = pd.read_csv("ETTh1.csv", parse_dates=["date"], index_col="date")
series = df["OT"]  # Oil Temperature target

model = ARIMA(series, order=(2, 1, 2))  # (p, d, q)
fitted = model.fit()

# Forecast next 96 steps
forecast = fitted.forecast(steps=96)
print(f"AIC: {fitted.aic:.1f}")
print(forecast.head())
Transformer · PatchTST with HuggingFace.
patchtst_forecast.pypython
from transformers import PatchTSTForPrediction, PatchTSTConfig
import torch

config = PatchTSTConfig(
    num_input_channels=7,     # ETTh1 has 7 features
    context_length=512,       # lookback window
    prediction_length=96,     # forecast horizon
    patch_length=16,          # each patch = 16 time steps
    stride=8,                 # overlap between patches
    d_model=128,
    num_attention_heads=4,
    num_hidden_layers=3,
)

model = PatchTSTForPrediction(config)

# Shape: (batch, channels, context_length)
past_values = torch.randn(32, 7, 512)
outputs = model(past_values=past_values)

# outputs.prediction_outputs: (32, 7, 96)
predictions = outputs.prediction_outputs
print(f"Forecast shape: {predictions.shape}")
Foundation · Chronos (zero-shot, probabilistic).
chronos_forecast.pypython
import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-base",  # 200M params
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

# Context: last 512 observations
context = torch.tensor(df["OT"].values[-512:])

# Generate 96-step probabilistic forecast
forecast = pipeline.predict(
    context=context,
    prediction_length=96,
    num_samples=20,
)

median = forecast.median(dim=1).values
low = forecast.quantile(0.1, dim=1).values
high = forecast.quantile(0.9, dim=1).values

print(f"Median forecast: {median.shape}")
print(f"80% interval width: {(high - low).mean():.4f}")
Foundation · TimesFM (zero-shot, point).
timesfm_forecast.pypython
import timesfm

tfm = timesfm.TimesFm(
    context_len=512,
    horizon_len=96,
    input_patch_len=32,
    output_patch_len=128,
    num_layers=20,
    model_dims=1280,
    backend="gpu",
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")

forecast_input = df["OT"].values[-512:]
point_forecast, experimental_quantiles = tfm.forecast(
    [forecast_input],
    freq=[0],  # 0 = high-frequency (hourly)
)

print(f"Point forecast shape: {point_forecast.shape}")
§ 09 · Decision

Which model, which context.

Four questions, four paths.

QuestionIf…Pick
Q1 · Training data for this specific series?NoFoundation model: TimesFM, Chronos or Moirai
Q2 · Multivariate (>1 channel)?YesPatchTST or iTransformer (or Moirai zero-shot)
Q3 · How many observations?<500ARIMA or Prophet
500–5,000Foundation model + classical ensemble
>5,000N-BEATS, or fine-tune a foundation model
Q4 · Need calibrated uncertainty?YesChronos (quantile) or Lag-Llama (distribution heads)
Related · Further reading

Continue through the registry.