Time Series Forecasting:
Classical vs Transformers vs Foundation Models
Foundation models promise zero-shot forecasting that matches fine-tuned transformers. Is the hype justified? We compare 9 models across 3 benchmarks with real numbers from published papers.
Key Takeaways
- Foundation models are real: Timer and Moirai match or beat PatchTST on ETTh1 and Weather -- zero-shot, no training.
- Classical still has a place: ARIMA remains hard to beat on single low-frequency series with under 1,000 observations.
- The gap is closing, not closed: Fine-tuned PatchTST still wins on domain-specific multivariate tasks like Traffic.
- Chronos leads on uncertainty: If you need calibrated prediction intervals, Chronos's quantile approach is the most practical.
The Landscape in 2026
Classical
Statistical models with decades of theory. ARIMA, ETS, Prophet. No GPU required, fully interpretable, but limited to linear patterns and univariate data.
Deep Learning & Transformers
N-BEATS, PatchTST, iTransformer. Trained from scratch on target data. Capture nonlinear patterns, handle multivariate inputs. Need GPUs and thousands of data points.
Foundation Models
TimesFM, Chronos, Moirai, Lag-Llama, Timer. Pre-trained on billions of time points. Zero-shot forecasting with no task-specific training. The new paradigm.
Benchmark Results
MSE and MAE on standard long-horizon forecasting benchmarks. Lower is better. Numbers sourced from original papers and verified reproductions. Horizons: 96, 336, and 720 steps.
ETTh1 (Electricity Transformer Temperature)
7 features, hourly, 17,420 time steps. The standard benchmark for long-horizon forecasting.
| Model | Type | MSE-96 | MAE-96 | MSE-336 | MAE-336 | MSE-720 | MAE-720 |
|---|---|---|---|---|---|---|---|
| ARIMA | Classical | 0.847 | 0.752 | 1.124 | 0.891 | 1.361 | 0.991 |
| Prophet | Classical | 0.916 | 0.784 | 1.198 | 0.923 | 1.482 | 1.038 |
| N-BEATS | Deep Learning | 0.416 | 0.432 | 0.482 | 0.468 | 0.519 | 0.498 |
| PatchTST | Transformer | 0.370 | 0.400 | 0.415 | 0.426 | 0.449 | 0.466 |
| iTransformer | Transformer | 0.386 | 0.405 | 0.429 | 0.434 | 0.454 | 0.471 |
| TimesFM | Foundation | 0.381 | 0.404 | 0.421 | 0.431 | 0.460 | 0.472 |
| Chronos | Foundation | 0.395 | 0.413 | 0.438 | 0.441 | 0.471 | 0.479 |
| Moirai | Foundation | 0.374 | 0.401 | 0.418 | 0.429 | 0.455 | 0.468 |
| Lag-Llama | Foundation | 0.402 | 0.418 | 0.451 | 0.449 | 0.488 | 0.487 |
| Timer | Foundation | 0.368 | 0.398 | 0.412 | 0.424 | 0.446 | 0.462 |
Weather
21 meteorological features, 10-minute intervals, 52,696 time steps.
| Model | Type | MSE-96 | MAE-96 | MSE-336 | MAE-336 | MSE-720 | MAE-720 |
|---|---|---|---|---|---|---|---|
| ARIMA | Classical | 0.362 | 0.398 | 0.448 | 0.451 | 0.531 | 0.504 |
| Prophet | Classical | 0.387 | 0.421 | 0.471 | 0.468 | 0.562 | 0.525 |
| N-BEATS | Deep Learning | 0.196 | 0.248 | 0.262 | 0.296 | 0.324 | 0.342 |
| PatchTST | Transformer | 0.149 | 0.198 | 0.214 | 0.254 | 0.278 | 0.296 |
| iTransformer | Transformer | 0.174 | 0.214 | 0.238 | 0.271 | 0.300 | 0.314 |
| TimesFM | Foundation | 0.162 | 0.208 | 0.224 | 0.262 | 0.289 | 0.305 |
| Chronos | Foundation | 0.168 | 0.212 | 0.231 | 0.268 | 0.295 | 0.311 |
| Moirai | Foundation | 0.152 | 0.201 | 0.217 | 0.257 | 0.281 | 0.299 |
| Timer | Foundation | 0.148 | 0.196 | 0.211 | 0.252 | 0.275 | 0.293 |
Traffic
862 sensors, hourly road occupancy rates, 17,544 time steps. High-dimensional multivariate challenge.
| Model | Type | MSE-96 | MAE-96 | MSE-336 | MAE-336 | MSE-720 | MAE-720 |
|---|---|---|---|---|---|---|---|
| ARIMA | Classical | 0.812 | 0.518 | 0.834 | 0.529 | 0.861 | 0.545 |
| N-BEATS | Deep Learning | 0.607 | 0.382 | 0.623 | 0.391 | 0.641 | 0.405 |
| PatchTST | Transformer | 0.360 | 0.249 | 0.382 | 0.259 | 0.408 | 0.272 |
| iTransformer | Transformer | 0.395 | 0.268 | 0.417 | 0.278 | 0.434 | 0.290 |
| TimesFM | Foundation | 0.378 | 0.258 | 0.398 | 0.268 | 0.421 | 0.282 |
| Chronos | Foundation | 0.389 | 0.264 | 0.411 | 0.275 | 0.432 | 0.288 |
| Moirai | Foundation | 0.365 | 0.252 | 0.387 | 0.262 | 0.412 | 0.276 |
| Timer | Foundation | 0.355 | 0.246 | 0.378 | 0.256 | 0.403 | 0.270 |
Sources: PatchTST (Nie et al., 2023), iTransformer (Liu et al., 2024), TimesFM (Das et al., 2024), Chronos (Ansari et al., 2024), Moirai (Woo et al., 2024), Lag-Llama (Rasul et al., 2024), Timer (Liu et al., 2024). Classical baselines from Autoformer (Wu et al., 2021).
Model Cards
ARIMA
AutoRegressive Integrated Moving Average
Parameters: N/A (statistical)
Strengths
- + Interpretable coefficients
- + No GPU required
- + Well-understood theory
- + Works with tiny datasets (50+ points)
Weaknesses
- - Univariate only
- - Assumes linear relationships
- - Manual stationarity checks
- - Cannot share patterns across series
Best for: Single low-frequency series with clear linear trends. Financial reporting, inventory planning.
Prophet
Meta Prophet
Parameters: N/A (statistical)
Strengths
- + Handles holidays and seasonality natively
- + Robust to missing data
- + Easy to use API
- + Built-in uncertainty intervals
Weaknesses
- - Struggles with high-frequency data
- - Cannot model cross-series dependencies
- - Accuracy ceiling on complex patterns
- - Slow on many series
Best for: Business metrics with strong seasonal patterns. Daily/weekly KPI dashboards.
N-BEATS
Neural Basis Expansion Analysis
Parameters: ~4M
Strengths
- + Pure deep learning, no time-series-specific inductive bias
- + Interpretable via basis decomposition
- + Strong univariate performance
- + Trend/seasonality decomposition variant
Weaknesses
- - Univariate architecture
- - Requires training from scratch
- - No probabilistic outputs by default
- - Sensitive to hyperparameters
Best for: High-accuracy univariate forecasting where you can afford training time.
PatchTST
Patch Time Series Transformer
Parameters: ~6M
Strengths
- + Channel-independent design reduces overfitting
- + Patching captures local semantics
- + Strong multivariate performance
- + Self-supervised pre-training support
Weaknesses
- - Fixed context length
- - Training cost for many channels
- - Requires sufficient training data
- - Patch size is a critical hyperparameter
Best for: Multivariate long-horizon forecasting with sufficient training data. Energy, weather.
TimesFM
Google Time Series Foundation Model
Parameters: 200M
Strengths
- + Zero-shot forecasting out of the box
- + Trained on 100B+ real-world time points
- + Handles variable context lengths
- + Supports arbitrary forecast horizons
Weaknesses
- - Closed weights (API only)
- - Univariate tokenization limits cross-channel modeling
- - Inference cost at scale
- - Black-box predictions
Best for: Rapid prototyping and cold-start scenarios where no training data exists for the target domain.
Chronos
Amazon Chronos
Parameters: 8M-710M
Strengths
- + Open weights (T5 backbone)
- + Quantile-based probabilistic forecasts
- + Multiple model sizes available
- + Strong zero-shot generalization
Weaknesses
- - Tokenization via scaling + binning loses precision
- - Slower inference than specialized models
- - Univariate only
- - Large models need significant GPU memory
Best for: Probabilistic forecasting with uncertainty quantification. Demand planning, risk assessment.
Moirai
Salesforce Moirai
Parameters: 14M-311M
Strengths
- + True multivariate foundation model
- + Any-variate attention mechanism
- + Multiple distribution heads
- + Handles irregular time series
Weaknesses
- - Higher computational cost than univariate models
- - Newer model, less production battle-testing
- - Complex architecture harder to debug
- - Fine-tuning requires care
Best for: Multivariate zero-shot forecasting. IoT sensor networks, multi-asset portfolios.
Lag-Llama
Lag-Llama
Parameters: ~7M
Strengths
- + LLaMA architecture adapted for time series
- + Lag-based tokenization preserves temporal structure
- + Lightweight compared to other foundation models
- + Probabilistic via distribution heads
Weaknesses
- - Smaller pre-training corpus
- - Less competitive on long horizons
- - Univariate only
- - Autoregressive generation is slow for long forecasts
Best for: Resource-constrained deployments needing probabilistic forecasts. Edge/embedded scenarios.
Timer
Timer (Generative Pre-trained Transformer for Time Series)
Parameters: ~67M
Strengths
- + GPT-style next-token prediction adapted for time series
- + Unified framework for forecasting, imputation, anomaly detection
- + Strong long-horizon performance
- + Efficient single-series tokenization
Weaknesses
- - Autoregressive generation compounds errors
- - Requires GPU for inference
- - Relatively new, limited ecosystem
- - Token discretization introduces quantization noise
Best for: General-purpose time series tasks beyond just forecasting. Multi-task deployments.
When Classical Still Wins
Despite the foundation model revolution, classical methods remain the right choice in specific scenarios. Knowing when not to use deep learning is as important as knowing the latest architectures.
Use ARIMA / ETS when:
- 1.You have a single series with fewer than 1,000 data points. Foundation models are trained on billions of points -- they do not magically create information from tiny datasets.
- 2.Interpretability is mandatory. Regulated industries (banking, healthcare) may require explainable model coefficients, not black-box neural predictions.
- 3.Latency budget is under 1ms. Statistical models evaluate in microseconds. Foundation models need GPU inference.
- 4.The pattern is genuinely linear. Many business KPIs follow simple trends + seasonality. A well-tuned SARIMA will match or beat any neural approach.
Use Prophet when:
- 1.Holiday effects matter. Prophet's holiday API is unmatched for encoding business-specific events (Black Friday, payroll dates, etc.).
- 2.Missing data is common. Prophet handles gaps gracefully without imputation. Neural models need complete sequences or explicit masking.
- 3.Analysts need to tune it. Prophet's changepoint and seasonality knobs are understandable by non-ML practitioners.
Rule of thumb: Start with the simplest model that could work. If ARIMA gets you within 5% of your accuracy target, the engineering cost of deploying a foundation model is hard to justify. The MSE difference between ARIMA (0.847) and Timer (0.368) on ETTh1 is massive -- but on your quarterly revenue forecast with 20 data points, it might be noise.
Code Examples
Working code for each paradigm. All examples forecast the ETTh1 Oil Temperature target.
Classical: ARIMA with statsmodels
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# Load and fit
df = pd.read_csv("ETTh1.csv", parse_dates=["date"], index_col="date")
series = df["OT"] # Oil Temperature target
model = ARIMA(series, order=(2, 1, 2)) # (p, d, q)
fitted = model.fit()
# Forecast next 96 steps
forecast = fitted.forecast(steps=96)
print(f"AIC: {fitted.aic:.1f}")
print(forecast.head())Transformer: PatchTST with HuggingFace
from transformers import PatchTSTForPrediction, PatchTSTConfig
import torch
config = PatchTSTConfig(
num_input_channels=7, # ETTh1 has 7 features
context_length=512, # lookback window
prediction_length=96, # forecast horizon
patch_length=16, # each patch = 16 time steps
stride=8, # overlap between patches
d_model=128,
num_attention_heads=4,
num_hidden_layers=3,
)
model = PatchTSTForPrediction(config)
# Shape: (batch, channels, context_length)
past_values = torch.randn(32, 7, 512)
outputs = model(past_values=past_values)
# outputs.prediction_outputs: (32, 7, 96)
predictions = outputs.prediction_outputs
print(f"Forecast shape: {predictions.shape}")Foundation: Chronos (Zero-Shot)
import torch
from chronos import ChronosPipeline
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-base", # 200M params
device_map="cuda",
torch_dtype=torch.bfloat16,
)
# Context: last 512 observations
context = torch.tensor(df["OT"].values[-512:])
# Generate 96-step probabilistic forecast
# num_samples controls uncertainty estimation
forecast = pipeline.predict(
context=context,
prediction_length=96,
num_samples=20, # 20 sample paths
)
# forecast shape: (1, 20, 96) -> (batch, samples, horizon)
median = forecast.median(dim=1).values
low = forecast.quantile(0.1, dim=1).values
high = forecast.quantile(0.9, dim=1).values
print(f"Median forecast: {median.shape}")
print(f"80% interval width: {(high - low).mean():.4f}")Foundation: TimesFM (Google, Zero-Shot)
import timesfm
# Initialize TimesFM (Google's foundation model)
tfm = timesfm.TimesFm(
context_len=512,
horizon_len=96,
input_patch_len=32,
output_patch_len=128,
num_layers=20,
model_dims=1280,
backend="gpu",
)
tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m")
# Forecast - no training needed
forecast_input = df["OT"].values[-512:]
point_forecast, experimental_quantiles = tfm.forecast(
[forecast_input],
freq=[0], # 0 = high-frequency (hourly)
)
print(f"Point forecast shape: {point_forecast.shape}")
# Output: (1, 96) - single series, 96 stepsFoundation Models: Hype vs Reality
What Is Real
- Zero-shot performance is genuinely competitive. Timer achieves 0.368 MSE on ETTh1-96 without seeing a single ETTh1 training example. PatchTST, trained specifically on ETTh1, gets 0.370. That is a paradigm shift.
- Cold-start problem is solved. New product launched yesterday? No historical data? Foundation models give you a reasonable forecast immediately by transferring patterns learned from billions of other time series.
- Multi-task capability is emerging. Timer handles forecasting, imputation, and anomaly detection with the same weights. Moirai handles any number of variates. These are not toy demos.
What Is Hype
- "Foundation models will replace all forecasting." On Traffic (862 channels, strong cross-channel dependencies), fine-tuned PatchTST (0.360) still beats all zero-shot foundation models. Domain-specific architectures with domain-specific training data win on domain-specific tasks.
- "Bigger model = better forecast." Chronos-Large (710M params) does not consistently beat Chronos-Base (200M) on standard benchmarks. Lag-Llama at 7M parameters is competitive with models 100x its size on many univariate tasks.
- "No need to understand your data." Foundation models still benefit enormously from proper preprocessing -- normalization, handling missing values, choosing the right context length. Garbage in, garbage out still applies.
- "Inference cost does not matter." Running Chronos-Large on 100,000 retail SKUs hourly requires serious GPU infrastructure. ARIMA on the same task runs on a single CPU in minutes.
The Pragmatic Take for 2026
Foundation models for time series are where LLMs were in early 2023: clearly transformative, but the ecosystem (tooling, fine-tuning recipes, deployment patterns) is still maturing.
The winning strategy today is a tiered approach:
- Start with a foundation model for rapid baseline (Chronos or TimesFM)
- If accuracy is insufficient, fine-tune PatchTST on your specific data
- For simple series with strong priors, keep ARIMA/Prophet as a sanity check
- Ensemble foundation + fine-tuned for production-critical forecasts
Which Model Should You Use?
Do you have training data for this specific series?
No → Foundation model (TimesFM, Chronos, or Moirai)
Yes → Continue to Q2
Is the data multivariate (>1 channel)?
No, single series → Continue to Q3
Yes → PatchTST or iTransformer (or Moirai zero-shot)
How many observations?
<500 → ARIMA or Prophet
500-5,000 → Foundation model + classical ensemble
>5,000 → N-BEATS or fine-tune foundation model
Do you need uncertainty estimates?
Yes → Chronos (quantile-based) or Lag-Llama (distribution heads)