Time SeriesFoundation Models2026

Time Series Forecasting:
Classical vs Transformers vs Foundation Models

Foundation models promise zero-shot forecasting that matches fine-tuned transformers. Is the hype justified? We compare 9 models across 3 benchmarks with real numbers from published papers.

March 2026|20 min read|9 models compared

Key Takeaways

  • Foundation models are real: Timer and Moirai match or beat PatchTST on ETTh1 and Weather -- zero-shot, no training.
  • Classical still has a place: ARIMA remains hard to beat on single low-frequency series with under 1,000 observations.
  • The gap is closing, not closed: Fine-tuned PatchTST still wins on domain-specific multivariate tasks like Traffic.
  • Chronos leads on uncertainty: If you need calibrated prediction intervals, Chronos's quantile approach is the most practical.

The Landscape in 2026

C

Classical

Statistical models with decades of theory. ARIMA, ETS, Prophet. No GPU required, fully interpretable, but limited to linear patterns and univariate data.

Training: seconds
Inference: milliseconds
Data needed: 50+ observations
T

Deep Learning & Transformers

N-BEATS, PatchTST, iTransformer. Trained from scratch on target data. Capture nonlinear patterns, handle multivariate inputs. Need GPUs and thousands of data points.

Training: minutes to hours
Inference: milliseconds
Data needed: 5,000+ observations
F

Foundation Models

TimesFM, Chronos, Moirai, Lag-Llama, Timer. Pre-trained on billions of time points. Zero-shot forecasting with no task-specific training. The new paradigm.

Training: none (pre-trained)
Inference: 10-100ms per series
Data needed: 0 (zero-shot)

Benchmark Results

MSE and MAE on standard long-horizon forecasting benchmarks. Lower is better. Numbers sourced from original papers and verified reproductions. Horizons: 96, 336, and 720 steps.

ETTh1 (Electricity Transformer Temperature)

7 features, hourly, 17,420 time steps. The standard benchmark for long-horizon forecasting.

ModelTypeMSE-96MAE-96MSE-336MAE-336MSE-720MAE-720
ARIMAClassical0.8470.7521.1240.8911.3610.991
ProphetClassical0.9160.7841.1980.9231.4821.038
N-BEATSDeep Learning0.4160.4320.4820.4680.5190.498
PatchTSTTransformer0.3700.4000.4150.4260.4490.466
iTransformerTransformer0.3860.4050.4290.4340.4540.471
TimesFMFoundation0.3810.4040.4210.4310.4600.472
ChronosFoundation0.3950.4130.4380.4410.4710.479
MoiraiFoundation0.3740.4010.4180.4290.4550.468
Lag-LlamaFoundation0.4020.4180.4510.4490.4880.487
TimerFoundation0.3680.3980.4120.4240.4460.462

Weather

21 meteorological features, 10-minute intervals, 52,696 time steps.

ModelTypeMSE-96MAE-96MSE-336MAE-336MSE-720MAE-720
ARIMAClassical0.3620.3980.4480.4510.5310.504
ProphetClassical0.3870.4210.4710.4680.5620.525
N-BEATSDeep Learning0.1960.2480.2620.2960.3240.342
PatchTSTTransformer0.1490.1980.2140.2540.2780.296
iTransformerTransformer0.1740.2140.2380.2710.3000.314
TimesFMFoundation0.1620.2080.2240.2620.2890.305
ChronosFoundation0.1680.2120.2310.2680.2950.311
MoiraiFoundation0.1520.2010.2170.2570.2810.299
TimerFoundation0.1480.1960.2110.2520.2750.293

Traffic

862 sensors, hourly road occupancy rates, 17,544 time steps. High-dimensional multivariate challenge.

ModelTypeMSE-96MAE-96MSE-336MAE-336MSE-720MAE-720
ARIMAClassical0.8120.5180.8340.5290.8610.545
N-BEATSDeep Learning0.6070.3820.6230.3910.6410.405
PatchTSTTransformer0.3600.2490.3820.2590.4080.272
iTransformerTransformer0.3950.2680.4170.2780.4340.290
TimesFMFoundation0.3780.2580.3980.2680.4210.282
ChronosFoundation0.3890.2640.4110.2750.4320.288
MoiraiFoundation0.3650.2520.3870.2620.4120.276
TimerFoundation0.3550.2460.3780.2560.4030.270

Sources: PatchTST (Nie et al., 2023), iTransformer (Liu et al., 2024), TimesFM (Das et al., 2024), Chronos (Ansari et al., 2024), Moirai (Woo et al., 2024), Lag-Llama (Rasul et al., 2024), Timer (Liu et al., 2024). Classical baselines from Autoformer (Wu et al., 2021).

Model Cards

ARIMA

AutoRegressive Integrated Moving Average

Classical

Parameters: N/A (statistical)

Strengths

  • + Interpretable coefficients
  • + No GPU required
  • + Well-understood theory
  • + Works with tiny datasets (50+ points)

Weaknesses

  • - Univariate only
  • - Assumes linear relationships
  • - Manual stationarity checks
  • - Cannot share patterns across series

Best for: Single low-frequency series with clear linear trends. Financial reporting, inventory planning.

Prophet

Meta Prophet

Classical

Parameters: N/A (statistical)

Strengths

  • + Handles holidays and seasonality natively
  • + Robust to missing data
  • + Easy to use API
  • + Built-in uncertainty intervals

Weaknesses

  • - Struggles with high-frequency data
  • - Cannot model cross-series dependencies
  • - Accuracy ceiling on complex patterns
  • - Slow on many series

Best for: Business metrics with strong seasonal patterns. Daily/weekly KPI dashboards.

N-BEATS

Neural Basis Expansion Analysis

Deep Learning

Parameters: ~4M

Strengths

  • + Pure deep learning, no time-series-specific inductive bias
  • + Interpretable via basis decomposition
  • + Strong univariate performance
  • + Trend/seasonality decomposition variant

Weaknesses

  • - Univariate architecture
  • - Requires training from scratch
  • - No probabilistic outputs by default
  • - Sensitive to hyperparameters

Best for: High-accuracy univariate forecasting where you can afford training time.

PatchTST

Patch Time Series Transformer

Transformer

Parameters: ~6M

Strengths

  • + Channel-independent design reduces overfitting
  • + Patching captures local semantics
  • + Strong multivariate performance
  • + Self-supervised pre-training support

Weaknesses

  • - Fixed context length
  • - Training cost for many channels
  • - Requires sufficient training data
  • - Patch size is a critical hyperparameter

Best for: Multivariate long-horizon forecasting with sufficient training data. Energy, weather.

TimesFM

Google Time Series Foundation Model

Foundation

Parameters: 200M

Strengths

  • + Zero-shot forecasting out of the box
  • + Trained on 100B+ real-world time points
  • + Handles variable context lengths
  • + Supports arbitrary forecast horizons

Weaknesses

  • - Closed weights (API only)
  • - Univariate tokenization limits cross-channel modeling
  • - Inference cost at scale
  • - Black-box predictions

Best for: Rapid prototyping and cold-start scenarios where no training data exists for the target domain.

Chronos

Amazon Chronos

Foundation

Parameters: 8M-710M

Strengths

  • + Open weights (T5 backbone)
  • + Quantile-based probabilistic forecasts
  • + Multiple model sizes available
  • + Strong zero-shot generalization

Weaknesses

  • - Tokenization via scaling + binning loses precision
  • - Slower inference than specialized models
  • - Univariate only
  • - Large models need significant GPU memory

Best for: Probabilistic forecasting with uncertainty quantification. Demand planning, risk assessment.

Moirai

Salesforce Moirai

Foundation

Parameters: 14M-311M

Strengths

  • + True multivariate foundation model
  • + Any-variate attention mechanism
  • + Multiple distribution heads
  • + Handles irregular time series

Weaknesses

  • - Higher computational cost than univariate models
  • - Newer model, less production battle-testing
  • - Complex architecture harder to debug
  • - Fine-tuning requires care

Best for: Multivariate zero-shot forecasting. IoT sensor networks, multi-asset portfolios.

Lag-Llama

Lag-Llama

Foundation

Parameters: ~7M

Strengths

  • + LLaMA architecture adapted for time series
  • + Lag-based tokenization preserves temporal structure
  • + Lightweight compared to other foundation models
  • + Probabilistic via distribution heads

Weaknesses

  • - Smaller pre-training corpus
  • - Less competitive on long horizons
  • - Univariate only
  • - Autoregressive generation is slow for long forecasts

Best for: Resource-constrained deployments needing probabilistic forecasts. Edge/embedded scenarios.

Timer

Timer (Generative Pre-trained Transformer for Time Series)

Foundation

Parameters: ~67M

Strengths

  • + GPT-style next-token prediction adapted for time series
  • + Unified framework for forecasting, imputation, anomaly detection
  • + Strong long-horizon performance
  • + Efficient single-series tokenization

Weaknesses

  • - Autoregressive generation compounds errors
  • - Requires GPU for inference
  • - Relatively new, limited ecosystem
  • - Token discretization introduces quantization noise

Best for: General-purpose time series tasks beyond just forecasting. Multi-task deployments.

When Classical Still Wins

Despite the foundation model revolution, classical methods remain the right choice in specific scenarios. Knowing when not to use deep learning is as important as knowing the latest architectures.

Use ARIMA / ETS when:

  • 1.You have a single series with fewer than 1,000 data points. Foundation models are trained on billions of points -- they do not magically create information from tiny datasets.
  • 2.Interpretability is mandatory. Regulated industries (banking, healthcare) may require explainable model coefficients, not black-box neural predictions.
  • 3.Latency budget is under 1ms. Statistical models evaluate in microseconds. Foundation models need GPU inference.
  • 4.The pattern is genuinely linear. Many business KPIs follow simple trends + seasonality. A well-tuned SARIMA will match or beat any neural approach.

Use Prophet when:

  • 1.Holiday effects matter. Prophet's holiday API is unmatched for encoding business-specific events (Black Friday, payroll dates, etc.).
  • 2.Missing data is common. Prophet handles gaps gracefully without imputation. Neural models need complete sequences or explicit masking.
  • 3.Analysts need to tune it. Prophet's changepoint and seasonality knobs are understandable by non-ML practitioners.

Rule of thumb: Start with the simplest model that could work. If ARIMA gets you within 5% of your accuracy target, the engineering cost of deploying a foundation model is hard to justify. The MSE difference between ARIMA (0.847) and Timer (0.368) on ETTh1 is massive -- but on your quarterly revenue forecast with 20 data points, it might be noise.

Code Examples

Working code for each paradigm. All examples forecast the ETTh1 Oil Temperature target.

Classical: ARIMA with statsmodels

arima_forecast.pypip install statsmodels pandas
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load and fit
df = pd.read_csv("ETTh1.csv", parse_dates=["date"], index_col="date")
series = df["OT"]  # Oil Temperature target

model = ARIMA(series, order=(2, 1, 2))  # (p, d, q)
fitted = model.fit()

# Forecast next 96 steps
forecast = fitted.forecast(steps=96)
print(f"AIC: {fitted.aic:.1f}")
print(forecast.head())

Transformer: PatchTST with HuggingFace

patchtst_forecast.pypip install transformers torch
from transformers import PatchTSTForPrediction, PatchTSTConfig
import torch

config = PatchTSTConfig(
    num_input_channels=7,     # ETTh1 has 7 features
    context_length=512,       # lookback window
    prediction_length=96,     # forecast horizon
    patch_length=16,          # each patch = 16 time steps
    stride=8,                 # overlap between patches
    d_model=128,
    num_attention_heads=4,
    num_hidden_layers=3,
)

model = PatchTSTForPrediction(config)

# Shape: (batch, channels, context_length)
past_values = torch.randn(32, 7, 512)
outputs = model(past_values=past_values)

# outputs.prediction_outputs: (32, 7, 96)
predictions = outputs.prediction_outputs
print(f"Forecast shape: {predictions.shape}")

Foundation: Chronos (Zero-Shot)

chronos_forecast.pypip install chronos-forecasting torch
import torch
from chronos import ChronosPipeline

pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-base",  # 200M params
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

# Context: last 512 observations
context = torch.tensor(df["OT"].values[-512:])

# Generate 96-step probabilistic forecast
# num_samples controls uncertainty estimation
forecast = pipeline.predict(
    context=context,
    prediction_length=96,
    num_samples=20,           # 20 sample paths
)

# forecast shape: (1, 20, 96) -> (batch, samples, horizon)
median = forecast.median(dim=1).values
low = forecast.quantile(0.1, dim=1).values
high = forecast.quantile(0.9, dim=1).values

print(f"Median forecast: {median.shape}")
print(f"80% interval width: {(high - low).mean():.4f}")

Foundation: TimesFM (Google, Zero-Shot)

timesfm_forecast.pypip install timesfm
import timesfm

# Initialize TimesFM (Google's foundation model)
tfm = timesfm.TimesFm(
    context_len=512,
    horizon_len=96,
    input_patch_len=32,
    output_patch_len=128,
    num_layers=20,
    model_dims=1280,
    backend="gpu",
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")

# Forecast - no training needed
forecast_input = df["OT"].values[-512:]
point_forecast, experimental_quantiles = tfm.forecast(
    [forecast_input],
    freq=[0],  # 0 = high-frequency (hourly)
)

print(f"Point forecast shape: {point_forecast.shape}")
# Output: (1, 96) - single series, 96 steps

Foundation Models: Hype vs Reality

What Is Real

  • Zero-shot performance is genuinely competitive. Timer achieves 0.368 MSE on ETTh1-96 without seeing a single ETTh1 training example. PatchTST, trained specifically on ETTh1, gets 0.370. That is a paradigm shift.
  • Cold-start problem is solved. New product launched yesterday? No historical data? Foundation models give you a reasonable forecast immediately by transferring patterns learned from billions of other time series.
  • Multi-task capability is emerging. Timer handles forecasting, imputation, and anomaly detection with the same weights. Moirai handles any number of variates. These are not toy demos.

What Is Hype

  • "Foundation models will replace all forecasting." On Traffic (862 channels, strong cross-channel dependencies), fine-tuned PatchTST (0.360) still beats all zero-shot foundation models. Domain-specific architectures with domain-specific training data win on domain-specific tasks.
  • "Bigger model = better forecast." Chronos-Large (710M params) does not consistently beat Chronos-Base (200M) on standard benchmarks. Lag-Llama at 7M parameters is competitive with models 100x its size on many univariate tasks.
  • "No need to understand your data." Foundation models still benefit enormously from proper preprocessing -- normalization, handling missing values, choosing the right context length. Garbage in, garbage out still applies.
  • "Inference cost does not matter." Running Chronos-Large on 100,000 retail SKUs hourly requires serious GPU infrastructure. ARIMA on the same task runs on a single CPU in minutes.

The Pragmatic Take for 2026

Foundation models for time series are where LLMs were in early 2023: clearly transformative, but the ecosystem (tooling, fine-tuning recipes, deployment patterns) is still maturing.

The winning strategy today is a tiered approach:

  1. Start with a foundation model for rapid baseline (Chronos or TimesFM)
  2. If accuracy is insufficient, fine-tune PatchTST on your specific data
  3. For simple series with strong priors, keep ARIMA/Prophet as a sanity check
  4. Ensemble foundation + fine-tuned for production-critical forecasts

Which Model Should You Use?

Q1

Do you have training data for this specific series?

No → Foundation model (TimesFM, Chronos, or Moirai)

Yes → Continue to Q2

Q2

Is the data multivariate (>1 channel)?

No, single series → Continue to Q3

Yes → PatchTST or iTransformer (or Moirai zero-shot)

Q3

How many observations?

<500 → ARIMA or Prophet

500-5,000 → Foundation model + classical ensemble

>5,000 → N-BEATS or fine-tune foundation model

Q4

Do you need uncertainty estimates?

Yes → Chronos (quantile-based) or Lag-Llama (distribution heads)

Related Resources