# ARIMA

> Source: https://aiwiki.ai/wiki/arima
> Updated: 2026-05-02
> Categories: Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Time series](/wiki/time_series), [Time series analysis](/wiki/time_series_analysis), [Forecasting](/wiki/forecasting), [Stationarity](/wiki/stationarity), [Linear regression](/wiki/linear_regression)*

**ARIMA** (Autoregressive Integrated Moving Average) is a class of statistical models for analyzing and [forecasting](/wiki/forecasting) [time series](/wiki/time_series) data. The model is specified by three non-negative integer orders written as ARIMA(p, d, q): `p` is the number of autoregressive lags, `d` is the order of differencing applied to make the series [stationary](/wiki/stationarity), and `q` is the number of moving-average lags applied to past forecast errors. ARIMA was popularized by [George E. P. Box](/wiki/george_box) and [Gwilym M. Jenkins](/wiki/gwilym_jenkins) in their 1970 book *Time Series Analysis: Forecasting and Control*, which gave the field a complete workflow for fitting these models and which is why the procedure is still called the Box-Jenkins methodology.

For most of the late twentieth century ARIMA was the default forecasting method in economics, finance, demand planning, and operations research. It still ships in every serious statistical package, it still wins on small clean univariate datasets, and it still appears as the baseline that every modern method has to beat. The M-competitions, run by Spyros Makridakis and colleagues since 1982, repeatedly found that simple statistical methods including ARIMA and exponential smoothing were hard to beat in practice, and that finding only began to flip when the M4 (2018) and M5 (2020) competitions saw hybrid and pure machine-learning entries take the top spots. In the era of [deep learning](/wiki/deep_learning) forecasters and time-series foundation models, ARIMA is no longer state of the art, but it is still where most practitioners and most textbooks start.

## Facts

| Field | Value |
|-------|-------|
| Model class | Linear, parametric, univariate time-series model |
| Year formalized | 1970 (Box & Jenkins, *Time Series Analysis: Forecasting and Control*) |
| Methodology | Box-Jenkins (identification, estimation, diagnostic checking) |
| Parameters | p (AR order), d (differencing order), q (MA order); seasonal extension adds (P, D, Q) at period s |
| Estimation | Conditional sum of squares or maximum likelihood, usually via Kalman filter on state-space form |
| Stationarity tests | Augmented Dickey-Fuller, KPSS, Phillips-Perron |
| Order selection | AIC, AICc, BIC, HQIC; Hyndman-Khandakar (2008) auto.arima algorithm |
| Key references | Box & Jenkins (1970, 1976, 1994 with Reinsel, 2015 with Reinsel & Ljung); Hyndman & Athanasopoulos *Forecasting: Principles and Practice* (otexts.com/fpp3) |
| Standard implementations | R `forecast::auto.arima`, `fable::ARIMA`; Python `statsmodels.SARIMAX`, `pmdarima.auto_arima`, `nixtla statsforecast` |

## ELI5: explain like I'm 5

Imagine you are tracking the temperature outside every day. ARIMA is a recipe for guessing tomorrow's temperature using three ingredients. First, today's temperature usually looks a lot like yesterday's, so you give some weight to the last few days (that is the autoregressive or AR part). Second, weather has trends and you cannot just use the raw numbers; you look at how much the temperature *changed* from one day to the next, because changes are more stable than absolute levels (that is the integrated or I part, called differencing). Third, you remember how wrong your guesses were on the last few days and you correct for those mistakes (that is the moving-average or MA part).

You pick three small numbers that say how many days back you want to look for each of the three ingredients. The computer then finds the best weights to combine them. After that, you can ask the model what tomorrow looks like, or next week, or next month. The further out you go, the more uncertain the forecast gets, which is why the prediction comes with a fan of error bars rather than a single line.

## History

The components of ARIMA are older than the name. [Autoregressive models](/wiki/autoregressive_model) go back to George Udny Yule's 1927 paper on sunspot numbers, with Gilbert Walker's 1931 paper on the Indian monsoon adding the higher-order AR formulation now associated with the Yule-Walker equations. Eugen Slutsky's 1927 paper showed that random shocks summed in a [moving-average](/wiki/moving_average_model) pattern can produce convincing-looking cycles, the result that Slutsky later expanded in his 1937 *Econometrica* article. Herman Wold's 1938 doctoral thesis proved a decomposition theorem that any covariance-stationary process can be written as a sum of a deterministic part and an infinite moving average of uncorrelated shocks, which is why ARMA representations are so general. Differencing as a way to handle nonstationary series was already standard practice in econometrics by the 1950s.

The contribution of Box and Jenkins was to package these ingredients into a single, repeatable procedure. Their 1970 book laid out an iterative three-stage workflow (identification, estimation, diagnostic checking), introduced the seasonal extension SARIMA, and supplied worked examples on series like airline passengers, IBM stock prices, and gas furnace data that became standard datasets for decades. The book has gone through five editions: the original Holden-Day edition in 1970, a revised printing in 1976, a third edition with Gregory Reinsel in 1994 (Prentice Hall), a fourth edition in 2008, and the fifth edition (Wiley, 2015) co-authored with Reinsel and Greta Ljung. Box and Jenkins were not the only people working on this in the late 1960s, but their book made the methodology teachable and reproducible, and the term *Box-Jenkins models* became more or less synonymous with ARIMA in industry.

The 1980s and 1990s extended ARIMA in several directions: vector versions for multivariate series (Tiao and Box, 1981), fractional differencing for long-memory processes (Granger and Joyeux 1980; Hosking 1981), volatility extensions through GARCH (Engle 1982; Bollerslev 1986), and nonlinear cousins like threshold AR (Tong 1978) and smooth transition AR. State-space representations, championed by Harvey (1989) and others, showed that almost every ARIMA model could be cast as a Kalman filter problem, which is how most modern software actually fits them.

The Hyndman-Khandakar algorithm published in 2008 (the engine behind R's `auto.arima` in the `forecast` package by [Rob J. Hyndman](/wiki/rob_hyndman)) automated the order-selection step that had previously required a human looking at autocorrelation plots. That is the version of ARIMA most people actually use today.

## Mathematical definition

ARIMA builds up from three simpler models. Throughout, `y_t` denotes the observed value at time `t`, `c` is a constant, and `ε_t` is white noise (uncorrelated with mean zero and constant variance).

### Autoregressive model AR(p)

An autoregressive process of order `p` regresses the current value on its own previous `p` values:

```
y_t = c + φ_1 y_{t-1} + φ_2 y_{t-2} + ... + φ_p y_{t-p} + ε_t
```

The coefficients `φ_1, ..., φ_p` are estimated from data. AR(1) is a simple persistence model where today depends on yesterday; AR(2) can produce damped oscillations; higher orders can fit richer dynamics. For the process to be stationary, the roots of the characteristic polynomial `1 - φ_1 z - ... - φ_p z^p` must lie outside the unit circle.

### Moving-average model MA(q)

A moving-average process of order `q` regresses the current value on its own previous `q` forecast errors:

```
y_t = c + ε_t + θ_1 ε_{t-1} + θ_2 ε_{t-2} + ... + θ_q ε_{t-q}
```

This is not the same thing as a moving-average filter on the data; it is a regression on past shocks. The coefficients `θ_1, ..., θ_q` are estimated, and the process is invertible (the shocks can be recovered from the observations) when the roots of `1 + θ_1 z + ... + θ_q z^q` lie outside the unit circle. Note that the sign convention varies across textbooks and software; some authors write the MA terms with minus signs.

### ARMA(p, q)

Combining the two:

```
y_t = c + φ_1 y_{t-1} + ... + φ_p y_{t-p} + ε_t + θ_1 ε_{t-1} + ... + θ_q ε_{t-q}
```

ARMA models are appropriate for stationary series. By Wold's decomposition theorem, any covariance-stationary process can be approximated arbitrarily well by an ARMA model with sufficiently large `p` and `q`, which is the theoretical justification for the whole framework.

### ARIMA(p, d, q)

If the series is not stationary, you difference it `d` times until it is, then fit an ARMA model to the differenced series. Differencing once means working with `Δy_t = y_t - y_{t-1}`. Differencing twice means working with `Δ²y_t = Δy_t - Δy_{t-1}`. Using the lag operator `L` (defined by `L y_t = y_{t-1}`), the ARIMA(p, d, q) model can be written compactly as:

```
(1 - φ_1 L - ... - φ_p L^p)(1 - L)^d y_t = c + (1 + θ_1 L + ... + θ_q L^q) ε_t
```

or in shorter notation:

```
φ(L) (1 - L)^d y_t = c + θ(L) ε_t
```

where `φ(L) = 1 - φ_1 L - ... - φ_p L^p` and `θ(L) = 1 + θ_1 L + ... + θ_q L^q`. Most series in practice need `d = 0` or `d = 1`; `d = 2` shows up for things like cumulative log-prices but is rare. Higher orders of `d` over-difference the series and inflate variance.

### Model variants

Special cases and extensions of ARIMA include:

| Notation | Meaning |
|----------|---------|
| AR(p) | Pure autoregressive on stationary series |
| MA(q) | Pure moving average |
| ARMA(p, q) | Mixed AR and MA on stationary series |
| ARIMA(0, 0, 0) | White noise |
| ARIMA(0, 1, 0) | Random walk |
| ARIMA(0, 1, 0) with drift | Random walk with drift |
| ARIMA(p, 0, 0) | Pure AR(p) on a stationary series |
| ARIMA(0, 0, q) | Pure MA(q) |
| ARIMA(0, 1, 1) | Equivalent to simple exponential smoothing |
| ARIMA(0, 2, 2) | Equivalent to Holt's linear method |
| SARIMA(p, d, q)(P, D, Q)_s | Seasonal ARIMA with seasonal period s |
| ARIMAX | ARIMA with exogenous regressors X |
| SARIMAX | Seasonal ARIMA with exogenous regressors |
| VAR(p) | Vector autoregression for multiple series |
| VARMA(p, q) | Vector ARMA |
| VARIMA(p, d, q) | Vector ARIMA |
| ARFIMA(p, d, q) | Fractional ARIMA, real-valued d |
| ARIMA-GARCH | ARIMA mean equation, GARCH conditional variance |

### Seasonal ARIMA: SARIMA(p, d, q)(P, D, Q)_s

Many real-world series have seasonal patterns: daily traffic peaks in the morning rush, retail sales spike in December, electricity load follows weekly cycles. SARIMA adds a seasonal block of orders `(P, D, Q)` at lag `s` (the seasonal period, for example 12 for monthly data with annual seasonality). The full model is:

```
Φ(L^s) φ(L) (1 - L)^d (1 - L^s)^D y_t = c + Θ(L^s) θ(L) ε_t
```

Here `Φ(L^s)` and `Θ(L^s)` are the seasonal AR and MA polynomials in the seasonal lag operator. SARIMA can capture combined non-seasonal and seasonal autocorrelation but only one seasonality at a time. Hourly electricity data has at least three (daily, weekly, yearly), and that is one of the cases where ARIMA-style models start to creak.

### ARIMAX and SARIMAX

ARIMAX adds external predictor variables (the X stands for exogenous) to a non-seasonal ARIMA model. SARIMAX is the seasonal version. These let the forecaster include things like price, advertising spend, holiday flags, or weather covariates. In the standard SARIMAX formulation, the response is regressed on the exogenous variables and ARIMA is fit to the residuals, giving:

```
y_t = β'x_t + n_t,    where n_t follows ARIMA(p, d, q)
```

In [statsmodels](/wiki/statsmodels) and many other packages, SARIMAX is the most general estimator and the plain ARIMA classes are special cases.

## Box-Jenkins methodology

Box and Jenkins prescribed a three-step iterative procedure. The steps are usually run by hand on small datasets and automated on large ones.

### 1. Identification

The goal of identification is to pick `(p, d, q)` (and the seasonal orders if relevant). Standard tools:

- Plot the series. Look for trend, seasonality, structural breaks, obvious outliers.
- Apply unit-root tests to decide `d`. The [augmented Dickey-Fuller (ADF) test](/wiki/adf_test) has a null of unit root; the [KPSS test](/wiki/kpss_test) (Kwiatkowski-Phillips-Schmidt-Shin) has a null of stationarity. Running both is standard because they answer the question from opposite sides. The Phillips-Perron test is a robust variant of ADF.
- Difference the series until it looks stationary, then examine the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the differenced series.
- A pure AR(p) typically has an ACF that decays geometrically and a PACF that cuts off after lag `p`. A pure MA(q) shows the opposite pattern. Mixed ARMA models show decays in both. In practice these clean patterns rarely appear and the ACF/PACF are used as suggestions rather than proof.

For seasonal data, the ACF and PACF are also examined at the seasonal lags to set `(P, D, Q)`.

### 2. Estimation

Given `(p, d, q)`, the coefficients are estimated from the differenced series. The two main objective functions are conditional sum of squares (CSS) and full [maximum likelihood estimation](/wiki/mle) (MLE). Most modern software writes the model in state-space form and runs the Kalman filter to evaluate the likelihood, which handles missing data and unobserved components naturally. Standard errors come from the inverse Hessian or from outer-product-of-gradient (OPG) approximations.

### 3. Diagnostic checking

After fitting, the residuals should look like white noise. Standard checks:

- ACF of residuals: bars should mostly stay inside the 95% confidence band.
- Ljung-Box portmanteau test for joint autocorrelation up to lag h.
- Normal Q-Q plot of residuals.
- Plot of standardized residuals over time, looking for changing variance (a sign that you might need a GARCH layer).

If the residuals fail any of these, you go back to identification, change the orders, refit, and recheck. The whole loop is iterative, which is why Box and Jenkins drew it as a flowchart with arrows pointing back to step 1.

## Stationarity and differencing

ARIMA assumes weak (covariance) stationarity of the differenced series: constant mean, constant variance, and an autocovariance that depends only on the lag. Most economic and financial series are visibly nonstationary, with trends, drifts, or shifting variances. Two common transformations:

- Differencing handles trend and unit roots. One difference removes a linear trend; two removes a quadratic. Seasonal differencing (`y_t - y_{t-s}`) handles seasonal trends.
- Logarithms or Box-Cox transformations stabilize variance when the series shows multiplicative behavior (variance growing with level).

Over-differencing is a real risk and inflates the variance of the differenced series, often introducing a spurious negative autocorrelation at lag 1. A common heuristic is to choose the smallest `d` that makes the unit-root test fail to reject stationarity. Hyndman and Khandakar's auto.arima implementation uses successive KPSS tests for the non-seasonal `d` and successive Canova-Hansen or OCSB tests for the seasonal `D`.

The ARIMA framework cannot handle structural breaks or regime changes natively. If the data-generating process changes (a new tax regime, a pandemic, a policy shift), refitting on a window after the break is the usual workaround.

### Stationarity tests at a glance

| Test | Null hypothesis | Notes |
|------|-----------------|-------|
| Augmented Dickey-Fuller (ADF) | Series has a unit root (nonstationary) | Lag selection by AIC or BIC; multiple variants for trend and intercept |
| KPSS | Series is trend or level stationary | Run alongside ADF; agreement gives more confidence |
| Phillips-Perron (PP) | Unit root | Nonparametric correction for serial correlation and heteroskedasticity |
| Canova-Hansen | Stationarity at the seasonal lag | Used for seasonal differencing decisions |
| OCSB (Osborn-Chui-Smith-Birchenhall) | No seasonal unit root | Default in fable's ARIMA |

## Model selection

With manual identification you pick orders by reading ACF/PACF plots. With automated selection you pick orders by minimizing an information criterion on a grid of candidates. The standard criteria are:

| Criterion | Formula | Behavior |
|-----------|---------|----------|
| [AIC](/wiki/aic) (Akaike) | `-2 log L + 2k` | Picks slightly larger models, asymptotically efficient for prediction |
| AICc (corrected AIC) | AIC plus a small-sample bias correction | Recommended when n/k < 40 |
| [BIC](/wiki/bic) (Schwarz) | `-2 log L + k log n` | Penalizes complexity more heavily, consistent for true model order |
| HQIC (Hannan-Quinn) | `-2 log L + 2k log log n` | A compromise |

Here `L` is the maximized likelihood, `k` the number of parameters, and `n` the sample size. None of the criteria is universally best; they encode different priors about complexity. Cross-validation on rolling forecast windows is more expensive but tends to give better out-of-sample performance.

### auto.arima and the Hyndman-Khandakar algorithm

Rob J. Hyndman and Yeasmin Khandakar published a step-wise search algorithm in 2008 (Journal of Statistical Software) that combines unit-root testing for `d` and `D` with a fast greedy search over `(p, q, P, Q)` minimizing AICc. It starts from a small set of seed models, perturbs the orders one at a time, and keeps the best until no neighbor improves. The implementation in the R `forecast` package became the de-facto reference for automated ARIMA fitting and was ported to Python as `pmdarima`.

auto.arima is not a replacement for thinking. It can pick weird models on noisy data, it does not know about structural breaks, and its default settings impose a maximum on the orders that the user can change. But for routine forecasting on hundreds of series, it is hard to beat in terms of effort-per-forecast.

## Variants and extensions

### ARFIMA (fractional ARIMA)

Granger and Joyeux (1980) and Hosking (1981) generalized differencing to fractional orders, allowing `d` to take any real value (typically `0 < d < 0.5` for stationarity). ARFIMA models long-memory processes whose autocorrelation decays as a power law rather than exponentially. They show up in hydrology, network traffic, and some financial volatility series.

### VAR and VARIMA

Vector autoregression (VAR), popularized by Christopher Sims's 1980 paper *Macroeconomics and Reality*, generalizes AR(p) to multiple series at once: each variable is regressed on lags of itself and all the others. VARIMA adds differencing, but in practice the multivariate world has been dominated by VAR rather than VARIMA, partly because identifying multivariate MA components is harder and partly because VAR has a clean interpretation in terms of impulse-response functions.

Cointegration, the Nobel-winning Engle-Granger and Johansen frameworks, sits on top of VAR. When several nonstationary series share a common stochastic trend, you can model their differences and the long-run equilibrium together in a vector error-correction model (VECM).

### GARCH and ARIMA-GARCH

ARIMA models the conditional mean and assumes constant variance. Robert Engle's 1982 ARCH model (Nobel Prize 2003) and Tim Bollerslev's 1986 generalization to GARCH model the conditional variance, which is essential for financial returns. A common pipeline is to fit ARIMA to the mean equation and GARCH to the residuals, giving an ARIMA-GARCH composite. See [GARCH](/wiki/garch).

### Threshold and nonlinear AR

Howard Tong's 1978 threshold autoregressive (TAR) model lets the AR coefficients switch between regimes depending on whether a state variable crosses a threshold. Smooth transition AR (STAR) replaces the hard switch with a logistic-like function. These nonlinear cousins of ARIMA capture asymmetries that linear models cannot, but they need more data and are harder to identify.

### State-space and unobserved components

Almost every ARIMA model has an equivalent state-space representation. Andrew Harvey's *Forecasting, Structural Time Series Models and the Kalman Filter* (1989) reformulated the field around the state-space view, which makes irregular time stamps, missing data, time-varying coefficients, and exogenous regressors much easier to handle. The Bayesian counterpart is BSTS (Bayesian Structural Time Series), used at Google for causal impact analysis.

## Software implementations

| Package | Language | Notes |
|---------|----------|-------|
| `forecast` (`auto.arima`, `Arima`) | R | Hyndman-Khandakar reference implementation |
| `fable` | R | Successor to forecast in the tidyverts |
| `statsmodels.tsa.arima.model.ARIMA` | Python | Standard ARIMA class in [statsmodels](/wiki/statsmodels) |
| `statsmodels.tsa.statespace.SARIMAX` | Python | More general state-space estimator |
| `pmdarima.auto_arima` | Python | Port of auto.arima |
| `darts` | Python | Wraps statsmodels and exposes ARIMA alongside neural forecasters |
| `nixtla statsforecast` | Python | Fast vectorized AutoARIMA for parallel fitting |
| `prophet` (Stan backend) | Python / R | Not strictly ARIMA but the most common alternative for the same problems |
| `TimeSeries.jl`, `StateSpaceModels.jl` | Julia | Julia ecosystem ARIMA tooling |
| PROC ARIMA | SAS | Used heavily in pharma and finance |
| EViews ARIMA / SARIMAX | EViews | Standard in academic econometrics |
| `arima` command | Stata | Estimation by ML or CSS |
| `forecast::Arima`, `prophet`, `fpp3` book code | R / book | Hyndman and Athanasopoulos's open-access *Forecasting: Principles and Practice* uses fable throughout |

Most of these compute the likelihood with a Kalman filter on the state-space form, which means they handle missing observations and irregular data naturally. The coefficient estimates and standard errors should agree across packages to several decimal places when the optimizer converges, but the default starting values, convergence tolerances, and order-selection heuristics differ.

## ARIMA versus modern deep-learning forecasters

The last decade brought a wave of neural and foundation-model forecasters. They sit at different points on the bias-variance and effort-to-deploy trade-offs.

| Method | Year | Type | Multivariate | Probabilistic | Notes |
|--------|------|------|--------------|---------------|-------|
| ARIMA | 1970 | Linear, parametric | No (use VAR or SARIMAX) | Gaussian residuals | Fast, interpretable, baseline |
| Exponential smoothing (ETS) | 1957-1985 | Linear, parametric | No | Gaussian residuals | Often tied with ARIMA in M-competitions |
| Theta method | 2000 | Decomposition | No | Gaussian residuals | Won M3; surprisingly hard to beat |
| [Prophet](/wiki/prophet) | 2017 | Additive components | No | Bayesian intervals | Built at Facebook for business series with multiple seasonalities |
| DeepAR | 2017 | Autoregressive RNN | Cross-series | Quantile or Gaussian | Built at Amazon, trains across many related series; uses [LSTM](/wiki/lstm) |
| [N-BEATS](/wiki/n_beats) | 2019 | Residual MLP stack | No | Point forecasts | Yoshua Bengio's group at Element AI; outperformed M4 winner |
| [Temporal Fusion Transformer](/wiki/temporal_fusion_transformer) | 2019 | Attention model | Yes | Quantile | Google Brain, handles known and unknown future inputs |
| N-HiTS | 2022 | Hierarchical interpolation | No | Point forecasts | Faster N-BEATS variant |
| TimesNet | 2023 | 2D inception | Yes | Point forecasts | Tsinghua group |
| Lag-Llama | 2023 | Decoder-only transformer | No | Probabilistic | Open foundation model trained on diverse series |
| [TimeGPT](/wiki/timegpt) | 2023 | Encoder-decoder transformer | Yes | Quantile | Nixtla; first commercial time-series foundation model API |
| Chronos | 2024 | T5 over tokenized values | No | Probabilistic | Built at Amazon, zero-shot |
| Moirai | 2024 | Masked encoder | Yes | Probabilistic | Salesforce, any-frequency any-variate |
| TimesFM | 2024 | Decoder transformer | No | Point forecasts | Built at Google, zero-shot |
| Time-LLM | 2024 | Reprogrammed LLM | No | Point forecasts | Wraps a frozen LLM with a series adapter |

A few honest comparisons:

- On a single short clean univariate series (a few hundred to a few thousand points), a well-chosen ARIMA or ETS often matches or beats neural methods. The bias-variance trade-off favors low-variance estimators when there is little data to estimate from.
- On large panels of related series (thousands of SKUs, retail stores, energy meters), neural models that share parameters across series usually win. DeepAR was the first published method to demonstrate this clearly. ARIMA fits each series independently and cannot borrow strength.
- On series with multiple seasonalities, holiday effects, and known future regressors, methods like Prophet, TFT, and N-BEATS-X tend to handle the structure more naturally than SARIMAX without extra feature engineering.
- Foundation models (Chronos, Moirai, TimesFM, TimeGPT, Lag-Llama) offer something ARIMA cannot: zero-shot forecasting on a new series with no fitting at all. The trade-off is that they are large, slow, opaque, and sometimes wrong in ways that are hard to diagnose.
- Hybrid models combining ARIMA with LSTM or other neural networks on the residuals (Zhang 2003 was an early proposal) sometimes outperform either component alone on series with a smooth linear trend plus nonlinear noise.

## M-competition history

The M-competitions, organized by Spyros Makridakis since 1982, are large empirical forecasting bake-offs. They are the closest thing the field has to a leaderboard.

| Competition | Year | Series | Winner | ARIMA finish |
|-------------|------|--------|--------|--------------|
| M (M1) | 1982 | 1,001 | Simple methods | Mid-pack; no clear ARIMA advantage |
| M2 | 1993 | 29 | Mixed | Mixed |
| M3 | 2000 | 3,003 | Theta method (later beaten by ForecastPro) | ARIMA competitive on yearly series, weaker on monthly |
| M4 | 2018 | 100,000 | Slawek Smyl's hybrid ES-RNN (Uber) | Pure ARIMA below ensemble baselines, still beat many ML methods |
| M5 | 2020 | 42,840 (Walmart) | LightGBM-based gradient boosting ensembles | ARIMA in the bottom half |
| M6 | 2022 | 50 (financial) | Forecasting plus investment | Not the focus |

The M3 paper (Makridakis and Hibon, 2000) became famous for the conclusion that "statistically sophisticated or complex methods do not necessarily produce more accurate forecasts than simpler ones," which was a hard sell for machine-learning advocates at the time. The M4 paper (2018) flipped this: the top six entries were all hybrids of statistical and machine-learning methods, and the pure-statistical baselines came in below them. The M5 result was even more decisive for machine learning, with gradient-boosting ensembles dominating. Even so, the M5 organizers noted that the simple combination benchmark (a mix of statistical methods including ARIMA and ETS) beat the majority of the 5,558 submitted entries.

## Where ARIMA still wins

Despite the press around foundation models, ARIMA holds up in several settings:

- Small datasets. With a few dozen to a few hundred observations and no related series to borrow from, low-parameter linear models are often the best you can do. Neural models overfit on small data.
- Univariate series with clean structure. A monthly series with a single annual seasonality and a few hundred points is the ideal SARIMA target.
- Interpretable forecasts. Coefficients have a meaning, confidence intervals are well-defined, and the diagnostic plots tell you when the model is wrong.
- Latency-sensitive deployment. ARIMA fits and predicts in milliseconds and ships in a few KB. Foundation models are often gigabytes.
- Regulatory or audit settings. In pharma trial monitoring, financial risk reporting, or any setting where you have to explain the model to a regulator, the linear-Gaussian story is much easier than "a transformer trained on five billion time points."
- As a baseline. Even when it loses, ARIMA is the reference that any new method has to clear. If a new model does not beat auto.arima, it has not yet proven anything.

## Limitations

ARIMA's weaknesses are essentially the inverse of its strengths:

- Linear and Gaussian. Real series often have heavy tails, asymmetries, and nonlinear dynamics that ARIMA cannot capture. The standard answer is to wrap a GARCH layer for volatility or move to TAR / STAR for nonlinearity.
- Single seasonality. SARIMA only handles one period at a time. Hourly electricity load needs at least daily, weekly, and yearly cycles, which is awkward in SARIMA but natural in TBATS, Prophet, or neural models.
- Univariate by default. Adding exogenous variables through SARIMAX works, but cross-series learning (where the model trains on a panel of related series and shares parameters) is not part of the framework. DeepAR and global neural models are built for that.
- Stationarity assumption. The differenced series has to be stationary. Series with structural breaks, regime changes, or evolving variance violate this. In practice you fit on a window after the most recent break.
- Long-range dependence. ARIMA's autocorrelation decays exponentially, which is too fast for long-memory series. ARFIMA fixes this at the cost of harder estimation.
- Order selection is fragile. ACF/PACF identification is more art than science, and auto.arima can pick odd orders on noisy data. Slightly different `(p, d, q)` choices can give noticeably different forecasts.
- Forecast uncertainty grows with horizon. The confidence intervals widen quickly for long horizons, often to the point of uselessness. This is honest but unhelpful when stakeholders want a single number.
- Manual tuning effort. Even with auto.arima, tuning seasonal orders, exogenous regressors, and Box-Cox transformations on a new dataset is not point-and-click. Foundation models bypass that step entirely at the cost of opacity.

## Practical applications

ARIMA still has a long tail of working deployments:

- Demand forecasting in retail and supply chain, where it competes with gradient-boosted trees and neural models.
- Macroeconomic forecasting at central banks, often inside larger structural or VAR models.
- Inventory planning, where the classical Box-Jenkins workflow is taught in operations research courses.
- Health-care monitoring, including hospital admissions, drug demand, and disease incidence.
- Energy load forecasting at short horizons, where it shows up in ensembles.
- Network traffic engineering, especially where ARFIMA's long-memory variants matter.
- Climate and weather as a baseline, although numerical weather prediction and now graph neural network models like GraphCast dominate the operational picture.

It is also still the most-taught time-series method in MBA programs, statistics departments, and engineering curricula, which means most working analysts have an ARIMA model in their toolkit even if they reach for something else first.

## See also

- [Time series](/wiki/time_series)
- [Time series analysis](/wiki/time_series_analysis)
- [Forecasting](/wiki/forecasting)
- [Stationarity](/wiki/stationarity)
- [Linear regression](/wiki/linear_regression)
- [Regression](/wiki/regression)
- [Machine learning](/wiki/machine_learning)
- [Deep learning](/wiki/deep_learning)
- [LSTM](/wiki/lstm)
- [GARCH](/wiki/garch)
- [Prophet](/wiki/prophet)
- [DeepAR](/wiki/deepar)
- [N-BEATS](/wiki/n_beats)
- [Temporal Fusion Transformer](/wiki/temporal_fusion_transformer)
- [TimeGPT](/wiki/timegpt)
- [TimesFM](/wiki/timesfm)
- [Chronos](/wiki/chronos)
- [Lag-Llama](/wiki/lag_llama)
- [Moirai](/wiki/moirai)
- [Time-LLM](/wiki/time_llm)
- [Statsmodels](/wiki/statsmodels)
- [Foundation model](/wiki/foundation_model)
- [Mean squared error](/wiki/mean_squared_error_mse)
- [Augmented Dickey-Fuller test](/wiki/adf_test)
- [KPSS test](/wiki/kpss_test)
- [AIC](/wiki/aic)
- [BIC](/wiki/bic)
- [Maximum likelihood estimation](/wiki/mle)
- [George Box](/wiki/george_box)
- [Gwilym Jenkins](/wiki/gwilym_jenkins)
- [Rob Hyndman](/wiki/rob_hyndman)

## References

1. Box, G. E. P., & Jenkins, G. M. (1970). *Time Series Analysis: Forecasting and Control*. Holden-Day.
2. Box, G. E. P., & Jenkins, G. M. (1976). *Time Series Analysis: Forecasting and Control* (revised ed.). Holden-Day.
3. Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). *Time Series Analysis: Forecasting and Control* (3rd ed.). Prentice Hall.
4. Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). *Time Series Analysis: Forecasting and Control* (5th ed.). Wiley.
5. Brockwell, P. J., & Davis, R. A. (1991). *Time Series: Theory and Methods* (2nd ed.). Springer.
6. Hamilton, J. D. (1994). *Time Series Analysis*. Princeton University Press.
7. Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: Principles and Practice* (3rd ed.). OTexts. https://otexts.com/fpp3/
8. Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. *Journal of Statistical Software*, 27(3), 1-22.
9. Wold, H. (1938). *A Study in the Analysis of Stationary Time Series*. Almqvist & Wiksell.
10. Yule, G. U. (1927). On a method of investigating periodicities in disturbed series, with special reference to Wolfer's sunspot numbers. *Philosophical Transactions of the Royal Society A*, 226, 267-298.
11. Walker, G. T. (1931). On periodicity in series of related terms. *Proceedings of the Royal Society A*, 131(818), 518-532.
12. Slutsky, E. (1937). The summation of random causes as the source of cyclic processes. *Econometrica*, 5(2), 105-146.
13. Granger, C. W. J., & Joyeux, R. (1980). An introduction to long-memory time series models and fractional differencing. *Journal of Time Series Analysis*, 1(1), 15-29.
14. Hosking, J. R. M. (1981). Fractional differencing. *Biometrika*, 68(1), 165-176.
15. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. *Econometrica*, 50(4), 987-1007.
16. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. *Journal of Econometrics*, 31(3), 307-327.
17. Sims, C. A. (1980). Macroeconomics and reality. *Econometrica*, 48(1), 1-48.
18. Tong, H. (1978). On a threshold model. In C. H. Chen (ed.), *Pattern Recognition and Signal Processing*. Sijthoff & Noordhoff.
19. Harvey, A. C. (1989). *Forecasting, Structural Time Series Models and the Kalman Filter*. Cambridge University Press.
20. Ljung, G. M., & Box, G. E. P. (1978). On a measure of lack of fit in time series models. *Biometrika*, 65(2), 297-303.
21. Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. *Journal of the American Statistical Association*, 74(366), 427-431.
22. Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root. *Journal of Econometrics*, 54(1-3), 159-178.
23. Makridakis, S., & Hibon, M. (2000). The M3-Competition: results, conclusions and implications. *International Journal of Forecasting*, 16(4), 451-476.
24. Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Competition: results, findings, conclusion and way forward. *International Journal of Forecasting*, 34(4), 802-808.
25. Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2022). M5 accuracy competition: results, findings, and conclusions. *International Journal of Forecasting*, 38(4), 1346-1364.
26. Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). DeepAR: probabilistic forecasting with autoregressive recurrent networks. *International Journal of Forecasting*, 36(3), 1181-1191.
27. Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2020). N-BEATS: neural basis expansion analysis for interpretable time series forecasting. *International Conference on Learning Representations*.
28. Lim, B., Arik, S. O., Loeff, N., & Pfister, T. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting. *International Journal of Forecasting*, 37(4), 1748-1764.
29. Taylor, S. J., & Letham, B. (2018). Forecasting at scale. *The American Statistician*, 72(1), 37-45.
30. Garza, A., Mergenthaler-Canseco, M., & Challu, C. (2023). TimeGPT-1. *Nixtla preprint*. https://docs.nixtla.io/
31. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., et al. (2024). Chronos: learning the language of time series. *Amazon Science*.
32. Das, A., Kong, W., Sen, R., & Zhou, Y. (2024). A decoder-only foundation model for time-series forecasting. *International Conference on Machine Learning*.
33. Woo, G., Liu, C., Kumar, A., Xiong, C., et al. (2024). Unified training of universal time series forecasting transformers (Moirai). *International Conference on Machine Learning*.
34. Rasul, K., Ashok, A., Williams, A. R., Khorasani, A., et al. (2023). Lag-Llama: towards foundation models for time series forecasting. *NeurIPS Time Series Workshop*.
35. Jin, M., Wang, S., Ma, L., Chu, Z., et al. (2024). Time-LLM: time series forecasting by reprogramming large language models. *International Conference on Learning Representations*.
36. Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. *Neurocomputing*, 50, 159-175.
37. Assimakopoulos, V., & Nikolopoulos, K. (2000). The theta model: a decomposition approach to forecasting. *International Journal of Forecasting*, 16(4), 521-530.
38. statsmodels documentation: ARIMA and SARIMAX. https://www.statsmodels.org/
39. R `forecast` package documentation: auto.arima. https://pkg.robjhyndman.com/forecast/
40. pmdarima documentation: auto_arima. https://alkaline-ml.com/pmdarima/

