See also: Machine learning terms
Temporal data is data where each observation is tagged with a timestamp, so the order in which observations arrive carries meaning. The most familiar form is a time series, a sequence of numerical values measured at successive points in time, but the term covers a wider family of structures including event streams, panel data, and irregularly sampled clinical records. Because adjacent observations are usually correlated rather than independent, temporal data breaks the i.i.d. assumption that underpins most of classical machine learning, which is why it has its own modeling traditions stretching back to the 1920s with Yule's autoregressive work and crystallizing in 1970 with the Box and Jenkins textbook on ARIMA.
The defining property of temporal data is that every record carries a timestamp and the timestamp matters. Three common shapes show up in practice:
| shape | example | typical task |
|---|---|---|
| time series (regular) | hourly electricity load, daily stock close, monthly retail sales | forecasting, anomaly detection, classification |
| time series (irregular) | trades on an order book, hospital lab results, IoT telemetry with dropouts | imputation, intensity modeling, point processes |
| panel or longitudinal | the same set of patients measured monthly for two years | mixed-effects models, survival analysis |
| event stream | clickstream, log lines, server traces | sessionization, sequence prediction |
A regular time series has equally spaced timestamps, which is the easy case and the one most textbooks assume. Irregular time series, where the gap between observations varies, are the rule rather than the exception in healthcare and finance and need either resampling, masking, or models that ingest the time deltas directly. Panel data sits between cross-sectional and time-series data: many short series, one per entity, often handled with hierarchical or global models that share information across series.
A series is strictly stationary if its joint distribution is invariant under time shifts, and weakly stationary (the version actually used in practice) if its mean, variance, and autocovariance depend only on the lag, not on the absolute time. Most classical methods assume weak stationarity, which is why preprocessing such as differencing, detrending, or seasonal adjustment is so common. If you fit an ARMA model to a series with a strong upward trend, you will get nonsense, because the parameters cannot represent a moving mean.
The Augmented Dickey-Fuller (ADF) test (Said and Dickey 1984) is the standard hypothesis test for a unit root, the technical condition that rules out stationarity. The KPSS test inverts the null hypothesis and is often used alongside ADF as a sanity check. When a series is non-stationary, the typical fix is differencing: replace each value with the gap from the previous value. One pass of differencing removes a linear trend; seasonal differencing at lag m removes a fixed seasonal pattern of period m.
A classical view splits a time series into three components: trend, seasonality, and a residual. The split can be additive (y_t = T_t + S_t + R_t), which is appropriate when the seasonal swing has a roughly constant size, or multiplicative (y_t = T_t * S_t * R_t), which fits series where the seasonal effect grows with the level. STL (Seasonal-Trend decomposition using Loess, Cleveland et al. 1990) is the most widely used implementation and handles a single seasonal period gracefully. MSTL extends it to multiple seasonalities, which matters for hourly data that has both daily and weekly cycles.
The autocorrelation function (ACF) measures the linear correlation between a series and a lagged copy of itself. The partial autocorrelation function (PACF) strips out the indirect effect of intermediate lags. Together they are the diagnostic backbone of the Box-Jenkins workflow: a PACF that cuts off after lag p and an ACF that decays exponentially suggests an AR(p); the mirror image suggests an MA(q).
Most of the classical toolkit was assembled between 1950 and 1985 and is still the right answer for many small datasets, especially when the series is short and the signal-to-noise ratio is poor.
| family | full name | typical use | reference |
|---|---|---|---|
| AR | autoregression | series whose value depends on its own past | Yule 1927 |
| MA | moving average | shocks that decay over a few steps | Slutsky 1937 |
| ARMA | autoregressive moving average | stationary series with both effects | Box and Jenkins 1970 |
| ARIMA | autoregressive integrated moving average | non-stationary series after differencing | Box and Jenkins 1970 |
| SARIMA | seasonal ARIMA | series with a fixed seasonal cycle | Box and Jenkins 1970 |
| SARIMAX | SARIMA with exogenous regressors | when external drivers are known | Box, Jenkins, Reinsel, Ljung 2015 |
| ETS / Holt-Winters | exponential smoothing state-space | level, trend, and seasonality with exponential weights | Hyndman et al. 2008 |
| state-space model / Kalman filter | linear Gaussian state-space | structural decomposition, online updates | Kalman 1960 |
| VAR | vector autoregression | small multivariate systems | Sims 1980 |
| GARCH | generalized autoregressive conditional heteroskedasticity | volatility clustering in returns | Bollerslev 1986 |
Box and Jenkins (1970) gave the field its first cohesive methodology: identify the model order using the ACF and PACF, estimate the parameters by maximum likelihood, then check the residuals for whiteness using the Ljung-Box test. The same recipe still appears in statsmodels, forecast (R), and pmdarima decades later.
Gradient-boosted trees changed the practical landscape of forecasting in the late 2010s. The trick is to convert the forecasting problem into supervised regression by creating lag features (the value at t-1, t-7, t-28, and so on), rolling statistics (rolling mean, rolling standard deviation), and calendar features (day of week, month, holiday flags). Once the data is in this tabular form, XGBoost, LightGBM, and CatBoost can be applied directly.
This approach won the M5 competition (Makridakis, Spiliotis, and Assimakopoulos 2022), the largest forecasting competition to date, which used hierarchical Walmart sales data covering more than 42,000 series. LightGBM models featured prominently in the winning ensembles for both the accuracy and uncertainty tracks, and the M5 organizers explicitly noted that gradient boosting outperformed both classical statistical baselines and the pure deep learning entries on this dataset. The result was a useful corrective to the assumption, common around 2019, that deep learning had already taken over forecasting.
Random forests with the same lag-feature setup are a reasonable baseline and rarely embarrassing, though they tend to lose to boosted trees by a few percent on most datasets.
Deep models for temporal data fall into three rough generations: recurrent, then attention-based, then foundation models.
The vanilla recurrent neural network processes a sequence one step at a time, carrying a hidden state forward. The 1990s diagnosis of the vanishing gradient problem (Hochreiter 1991, Bengio et al. 1994) explained why plain RNNs fail to learn long dependencies, and the LSTM (Hochreiter and Schmidhuber 1997) fixed the issue with a gated cell that preserves information across long stretches. The GRU (Cho et al. 2014) is a lighter alternative with comparable accuracy in most benchmarks. RNN-style models still dominate when sequences are short and predictions are streamed in real time.
The most influential recurrent forecaster is DeepAR (Salinas, Flunkert, and Gasthaus 2017, published in the International Journal of Forecasting in 2020). DeepAR trains a single LSTM globally across many related series and outputs the parameters of a likelihood (Gaussian or negative binomial) at each step, giving probabilistic forecasts rather than point predictions. It is the workhorse model behind Amazon SageMaker's built-in forecaster.
Between 2020 and 2024 a wave of transformer variants targeted the long-horizon forecasting problem.
| model | year | venue | key idea |
|---|---|---|---|
| Informer | 2021 | AAAI (best paper) | ProbSparse attention with O(L log L) complexity for long sequences |
| Temporal Fusion Transformer | 2021 | International Journal of Forecasting | gated variable selection plus interpretable multi-head attention |
| Autoformer | 2021 | NeurIPS | Auto-Correlation block in place of self-attention, series decomposition inside the model |
| FEDformer | 2022 | ICML | frequency-domain attention using Fourier or wavelet bases, linear complexity |
| PatchTST | 2023 | ICLR | patch the series like ViT patches an image, channel-independent encoding |
| iTransformer | 2024 | ICLR (spotlight) | invert the axes, treat each variable as a token instead of each timestamp |
The field has not been entirely smooth. Zeng et al. (2023, AAAI) caused a stir with a paper titled "Are Transformers Effective for Time Series Forecasting?" showing that a simple linear model called DLinear matched or beat several published transformer baselines. PatchTST, iTransformer, and the foundation models below were partly responses to that critique.
N-BEATS (Oreshkin, Carpov, Chapados, and Bengio 2019, presented at ICLR 2020) is a stack of fully connected residual blocks that learn additive basis functions. It set the state of the art on the M3, M4, and TOURISM datasets without any time-series-specific machinery, and remains a strong baseline that runs without a GPU.
N-HiTS (Challu, Olivares, Oreshkin, Garza, Mergenthaler-Canseco, and Dubrawski 2023) extended N-BEATS with multi-rate signal pooling and hierarchical interpolation. The published numbers report a 25% improvement over Informer on long-horizon benchmarks while running about 50x faster.
The transformer wave eventually spawned an attempt to do for time series what GPT did for text: pretrain a single large model on a huge cross-domain corpus and use it zero-shot or with light fine-tuning.
| model | maker | year | size / training data | notes |
|---|---|---|---|---|
| TimeGPT | Nixtla | 2023 | 100B+ datapoints, closed source | first commercial foundation forecaster, served via API |
| Lag-Llama | ServiceNow / Morgan Stanley / Mila | Oct 2023 | decoder-only transformer using lags as covariates | first open-weights foundation forecaster |
| TimesFM | Google Research | Oct 2023, ICML 2024 | 200M params, 100B time-points including Google Trends and Wikipedia pageviews | decoder-only, patch-based, open weights |
| Chronos | Amazon | Mar 2024 | 20M to 710M params, T5 backbone | tokenizes values via scaling and quantization, trains with cross-entropy |
| Moirai | Salesforce | 2024 (ICML) | masked encoder, LOTSA dataset of 27B observations across 9 domains | any-variate attention, mixture output distribution |
Chronos is probably the most distinctive of the bunch. It treats forecasting as language modeling: scale the series, quantize values into a finite vocabulary, then train a T5 to predict the next token using cross-entropy. The Chronos-Bolt variant released in November 2024 reports up to 250x speedup over the original at slightly better accuracy.
Forecasting is not the only temporal task. Time-series classification assigns a label to a whole sequence: is this ECG normal or arrhythmic, is this gesture a swipe or a tap, is this audio clip music or speech. The benchmark of record is the UCR Time Series Archive (Dau, Bagnall, Kamgar, Yeh, Zhu, Gharghabi, Ratanamahatana, and Keogh 2018), which expanded from 85 to 128 datasets and has been cited in well over a thousand papers. The companion UEA archive (Bagnall et al. 2018) covers 30 multivariate datasets.
Classical methods for classification include 1-nearest-neighbor with dynamic time warping, BOSS (bag of SFA symbols), shapelets, and the HIVE-COTE ensemble. ROCKET (Dempster, Petitjean, and Webb 2020) hits state-of-the-art accuracy by transforming each series with thousands of random convolutional kernels and feeding the result to a linear classifier. The original ROCKET trains and tests on all 85 bake-off UCR datasets in under two hours on a single CPU. MiniROCKET, MultiROCKET, and HYDRA are faster successors from the same group.
Anomaly detection on temporal data shows up in fraud monitoring, IT telemetry, manufacturing, and health alerts. The methods range from very old (Twitter's S-H-ESD using seasonal Hybrid ESD), to industrial standards (the Numenta Anomaly Benchmark, the Yahoo S5 dataset, and the SMD dataset from Tencent), to modern deep models (Anomaly Transformer, USAD, TranAD). The honest version of this story is that no single algorithm dominates, and a 2022 paper by Wu and Keogh ("Current Time Series Anomaly Detection Benchmarks are Flawed") argued that several headline benchmarks have such severe label issues that reported accuracy gains are not trustworthy.
| dataset | domain | size | typical task |
|---|---|---|---|
| M3 (Makridakis 2000) | mixed business series | 3,003 series | forecasting, mostly short |
| M4 (Makridakis 2018) | mixed business series | 100,000 series | forecasting at six frequencies |
| M5 (Makridakis 2022) | Walmart hierarchical retail | 42,840 series | hierarchical forecasting and uncertainty |
| Monash forecasting archive (Godahewa 2021) | 25 source datasets, 58 variants | varies | unified benchmark for global models |
| ETT (Zhou 2021) | electricity transformer temperature | 2 years hourly and 15-min | long-horizon forecasting |
| Traffic (Caltrans PEMS) | freeway sensor occupancy | hourly, 862 sensors | long-horizon multivariate |
| Weather | meteorological station | 10-min, 21 variables | multivariate forecasting |
| Electricity (UCI ElectricityLoadDiagrams) | client power demand | 15-min, 370 clients | global forecasting |
| UCR Archive (Dau 2018) | mixed | 128 datasets | univariate classification |
| UEA Archive (Bagnall 2018) | mixed | 30 datasets | multivariate classification |
The Monash archive deserves special mention: before its release in 2021, the field had no single place to compare global forecasting models, and benchmark cherry-picking was rampant.
The single most common beginner mistake is using random k-fold cross-validation on a time series. Random folding leaks information from the future into the training set, which inflates accuracy and hides overfitting. The correct approaches are:
[start, T], test on [T+1, T+h].The TimeSeriesSplit class in scikit-learn implements expanding-window splits. Hyndman's tsCV function in the forecast R package implements rolling-origin evaluation.
| metric | full name | use |
|---|---|---|
| MAE | mean absolute error | scale-dependent, robust to outliers |
| RMSE | root mean squared error | scale-dependent, penalizes large errors |
| MAPE | mean absolute percentage error | scale-free, undefined when actuals are zero, asymmetric |
| sMAPE | symmetric MAPE | bounded, used in M3 and M4 competitions |
| MASE | mean absolute scaled error | Hyndman and Koehler 2006, scale-free, well-behaved at zero |
| WAPE | weighted absolute percentage error | hierarchical reporting, used in M5 |
| pinball loss | quantile loss | training and evaluating quantile forecasts |
| CRPS | continuous ranked probability score | evaluating probabilistic forecasts against a true value |
| WIS | weighted interval score | evaluating quantile-based probabilistic forecasts |
MASE (Hyndman and Koehler 2006) is the metric Hyndman recommends as the all-purpose default because it works at any scale, handles zeros, and is interpretable as a ratio against a naive seasonal forecast.
| library | language | strengths |
|---|---|---|
| statsmodels | Python | full ARIMA, SARIMA, ETS, state-space, classical hypothesis tests |
| forecast / fable | R | Hyndman's reference implementations of ETS, ARIMA, TBATS |
| Prophet | Python and R | piecewise-linear trend plus Fourier seasonality, robust defaults |
| sktime | Python | scikit-learn-style API across forecasting, classification, regression |
| tslearn | Python | DTW, k-Shape, time-series k-means |
| GluonTS | Python | DeepAR, TFT, MQ-CNN, probabilistic models on MXNet and PyTorch |
| Darts | Python | unified API across statistical, ML, and deep models |
| NeuralForecast (Nixtla) | Python | N-BEATS, N-HiTS, TFT, PatchTST in a single package |
| StatsForecast (Nixtla) | Python | parallelized classical models, fast AutoARIMA |
| PyTorch Forecasting | Python | TFT and DeepAR with PyTorch Lightning |
| Kats (Meta) | Python | forecasting, anomaly detection, change-point detection |
| Chronos and Lag-Llama | Python | foundation models distributed via Hugging Face |
Prophet (Taylor and Letham 2018, The American Statistician) deserves a separate mention because it normalized the workflow of "fit a forecast in three lines and tweak the change points by hand" for non-specialists. It is not the most accurate model in any benchmark, but it is genuinely useful for analysts who want to add holiday effects without learning ARIMA.
Temporal data shows up almost everywhere there is a sensor, a transaction, or a clock:
Temporal data is data that has a time element, like how the temperature changes throughout the day or how many ice creams are sold each month. By looking at this data we can find patterns, like when it gets hot more ice creams are sold. Then we can use these patterns to guess how many ice creams will be sold next month or what the temperature will be like tomorrow. The hard part is that the future depends on the past, so we cannot mix up the order of our examples the way we might for pictures of cats and dogs. We have to keep yesterday before today, always.