Temporal data

See also: Machine learning terms

Temporal data is data where each observation is tagged with a timestamp, so the order in which observations arrive carries meaning. The most familiar form is a time series, a sequence of numerical values measured at successive points in time, but the term covers a wider family of structures including event streams, panel data, and irregularly sampled clinical records. Because adjacent observations are usually correlated rather than independent, temporal data breaks the i.i.d. assumption that underpins most of classical machine learning, which is why it has its own modeling traditions stretching back to the 1920s with Yule's autoregressive work and crystallizing in 1970 with the Box and Jenkins textbook on ARIMA.

what counts as temporal data

The defining property of temporal data is that every record carries a timestamp and the timestamp matters. Three common shapes show up in practice:

shape	example	typical task
time series (regular)	hourly electricity load, daily stock close, monthly retail sales	forecasting, anomaly detection, classification
time series (irregular)	trades on an order book, hospital lab results, IoT telemetry with dropouts	imputation, intensity modeling, point processes
panel or longitudinal	the same set of patients measured monthly for two years	mixed-effects models, survival analysis
event stream	clickstream, log lines, server traces	sessionization, sequence prediction

A regular time series has equally spaced timestamps, which is the easy case and the one most textbooks assume. Irregular time series, where the gap between observations varies, are the rule rather than the exception in healthcare and finance and need either resampling, masking, or models that ingest the time deltas directly. Panel data sits between cross-sectional and time-series data: many short series, one per entity, often handled with hierarchical or global models that share information across series.

core concepts

stationarity

A series is strictly stationary if its joint distribution is invariant under time shifts, and weakly stationary (the version actually used in practice) if its mean, variance, and autocovariance depend only on the lag, not on the absolute time. Most classical methods assume weak stationarity, which is why preprocessing such as differencing, detrending, or seasonal adjustment is so common. If you fit an ARMA model to a series with a strong upward trend, you will get nonsense, because the parameters cannot represent a moving mean.

The Augmented Dickey-Fuller (ADF) test (Said and Dickey 1984) is the standard hypothesis test for a unit root, the technical condition that rules out stationarity. The KPSS test inverts the null hypothesis and is often used alongside ADF as a sanity check. When a series is non-stationary, the typical fix is differencing: replace each value with the gap from the previous value. One pass of differencing removes a linear trend; seasonal differencing at lag m removes a fixed seasonal pattern of period m.

decomposition

A classical view splits a time series into three components: trend, seasonality, and a residual. The split can be additive (y_t = T_t + S_t + R_t), which is appropriate when the seasonal swing has a roughly constant size, or multiplicative (y_t = T_t * S_t * R_t), which fits series where the seasonal effect grows with the level. STL (Seasonal-Trend decomposition using Loess, Cleveland et al. 1990) is the most widely used implementation and handles a single seasonal period gracefully. MSTL extends it to multiple seasonalities, which matters for hourly data that has both daily and weekly cycles.

autocorrelation

The autocorrelation function (ACF) measures the linear correlation between a series and a lagged copy of itself. The partial autocorrelation function (PACF) strips out the indirect effect of intermediate lags. Together they are the diagnostic backbone of the Box-Jenkins workflow: a PACF that cuts off after lag p and an ACF that decays exponentially suggests an AR(p); the mirror image suggests an MA(q).

classical statistical methods

Most of the classical toolkit was assembled between 1950 and 1985 and is still the right answer for many small datasets, especially when the series is short and the signal-to-noise ratio is poor.

family	full name	typical use	reference
AR	autoregression	series whose value depends on its own past	Yule 1927
MA	moving average	shocks that decay over a few steps	Slutsky 1937
ARMA	autoregressive moving average	stationary series with both effects	Box and Jenkins 1970
ARIMA	autoregressive integrated moving average	non-stationary series after differencing	Box and Jenkins 1970
SARIMA	seasonal ARIMA	series with a fixed seasonal cycle	Box and Jenkins 1970
SARIMAX	SARIMA with exogenous regressors	when external drivers are known	Box, Jenkins, Reinsel, Ljung 2015
ETS / Holt-Winters	exponential smoothing state-space	level, trend, and seasonality with exponential weights	Hyndman et al. 2008
state-space model / Kalman filter	linear Gaussian state-space	structural decomposition, online updates	Kalman 1960
VAR	vector autoregression	small multivariate systems	Sims 1980
GARCH	generalized autoregressive conditional heteroskedasticity	volatility clustering in returns	Bollerslev 1986

Box and Jenkins (1970) gave the field its first cohesive methodology: identify the model order using the ACF and PACF, estimate the parameters by maximum likelihood, then check the residuals for whiteness using the Ljung-Box test. The same recipe still appears in statsmodels, forecast (R), and pmdarima decades later.

machine learning methods

Gradient-boosted trees changed the practical landscape of forecasting in the late 2010s. The trick is to convert the forecasting problem into supervised regression by creating lag features (the value at t-1, t-7, t-28, and so on), rolling statistics (rolling mean, rolling standard deviation), and calendar features (day of week, month, holiday flags). Once the data is in this tabular form, XGBoost, LightGBM, and CatBoost can be applied directly.

This approach won the M5 competition (Makridakis, Spiliotis, and Assimakopoulos 2022), the largest forecasting competition to date, which used hierarchical Walmart sales data covering more than 42,000 series. LightGBM models featured prominently in the winning ensembles for both the accuracy and uncertainty tracks, and the M5 organizers explicitly noted that gradient boosting outperformed both classical statistical baselines and the pure deep learning entries on this dataset. The result was a useful corrective to the assumption, common around 2019, that deep learning had already taken over forecasting.

Random forests with the same lag-feature setup are a reasonable baseline and rarely embarrassing, though they tend to lose to boosted trees by a few percent on most datasets.

deep learning methods

Deep models for temporal data fall into three rough generations: recurrent, then attention-based, then foundation models.

recurrent networks

The vanilla recurrent neural network processes a sequence one step at a time, carrying a hidden state forward. The 1990s diagnosis of the vanishing gradient problem (Hochreiter 1991, Bengio et al. 1994) explained why plain RNNs fail to learn long dependencies, and the LSTM (Hochreiter and Schmidhuber 1997) fixed the issue with a gated cell that preserves information across long stretches. The GRU (Cho et al. 2014) is a lighter alternative with comparable accuracy in most benchmarks. RNN-style models still dominate when sequences are short and predictions are streamed in real time.

The most influential recurrent forecaster is DeepAR (Salinas, Flunkert, and Gasthaus 2017, published in the International Journal of Forecasting in 2020). DeepAR trains a single LSTM globally across many related series and outputs the parameters of a likelihood (Gaussian or negative binomial) at each step, giving probabilistic forecasts rather than point predictions. It is the workhorse model behind Amazon SageMaker's built-in forecaster.

transformer-based forecasters

Between 2020 and 2024 a wave of transformer variants targeted the long-horizon forecasting problem.

model	year	venue	key idea
Informer	2021	AAAI (best paper)	ProbSparse attention with O(L log L) complexity for long sequences
Temporal Fusion Transformer	2021	International Journal of Forecasting	gated variable selection plus interpretable multi-head attention
Autoformer	2021	NeurIPS	Auto-Correlation block in place of self-attention, series decomposition inside the model
FEDformer	2022	ICML	frequency-domain attention using Fourier or wavelet bases, linear complexity
PatchTST	2023	ICLR	patch the series like ViT patches an image, channel-independent encoding
iTransformer	2024	ICLR (spotlight)	invert the axes, treat each variable as a token instead of each timestamp

The field has not been entirely smooth. Zeng et al. (2023, AAAI) caused a stir with a paper titled "Are Transformers Effective for Time Series Forecasting?" showing that a simple linear model called DLinear matched or beat several published transformer baselines. PatchTST, iTransformer, and the foundation models below were partly responses to that critique.

specialized neural architectures

N-BEATS (Oreshkin, Carpov, Chapados, and Bengio 2019, presented at ICLR 2020) is a stack of fully connected residual blocks that learn additive basis functions. It set the state of the art on the M3, M4, and TOURISM datasets without any time-series-specific machinery, and remains a strong baseline that runs without a GPU.

N-HiTS (Challu, Olivares, Oreshkin, Garza, Mergenthaler-Canseco, and Dubrawski 2023) extended N-BEATS with multi-rate signal pooling and hierarchical interpolation. The published numbers report a 25% improvement over Informer on long-horizon benchmarks while running about 50x faster.

foundation models for time series

The transformer wave eventually spawned an attempt to do for time series what GPT did for text: pretrain a single large model on a huge cross-domain corpus and use it zero-shot or with light fine-tuning.

model	maker	year	size / training data	notes
TimeGPT	Nixtla	2023	100B+ datapoints, closed source	first commercial foundation forecaster, served via API
Lag-Llama	ServiceNow / Morgan Stanley / Mila	Oct 2023	decoder-only transformer using lags as covariates	first open-weights foundation forecaster
TimesFM	Google Research	Oct 2023, ICML 2024	200M params, 100B time-points including Google Trends and Wikipedia pageviews	decoder-only, patch-based, open weights
Chronos	Amazon	Mar 2024	20M to 710M params, T5 backbone	tokenizes values via scaling and quantization, trains with cross-entropy
Moirai	Salesforce	2024 (ICML)	masked encoder, LOTSA dataset of 27B observations across 9 domains	any-variate attention, mixture output distribution

Chronos is probably the most distinctive of the bunch. It treats forecasting as language modeling: scale the series, quantize values into a finite vocabulary, then train a T5 to predict the next token using cross-entropy. The Chronos-Bolt variant released in November 2024 reports up to 250x speedup over the original at slightly better accuracy.

time-series classification

Forecasting is not the only temporal task. Time-series classification assigns a label to a whole sequence: is this ECG normal or arrhythmic, is this gesture a swipe or a tap, is this audio clip music or speech. The benchmark of record is the UCR Time Series Archive (Dau, Bagnall, Kamgar, Yeh, Zhu, Gharghabi, Ratanamahatana, and Keogh 2018), which expanded from 85 to 128 datasets and has been cited in well over a thousand papers. The companion UEA archive (Bagnall et al. 2018) covers 30 multivariate datasets.

Classical methods for classification include 1-nearest-neighbor with dynamic time warping, BOSS (bag of SFA symbols), shapelets, and the HIVE-COTE ensemble. ROCKET (Dempster, Petitjean, and Webb 2020) hits state-of-the-art accuracy by transforming each series with thousands of random convolutional kernels and feeding the result to a linear classifier. The original ROCKET trains and tests on all 85 bake-off UCR datasets in under two hours on a single CPU. MiniROCKET, MultiROCKET, and HYDRA are faster successors from the same group.

time-series anomaly detection

Anomaly detection on temporal data shows up in fraud monitoring, IT telemetry, manufacturing, and health alerts. The methods range from very old (Twitter's S-H-ESD using seasonal Hybrid ESD), to industrial standards (the Numenta Anomaly Benchmark, the Yahoo S5 dataset, and the SMD dataset from Tencent), to modern deep models (Anomaly Transformer, USAD, TranAD). The honest version of this story is that no single algorithm dominates, and a 2022 paper by Wu and Keogh ("Current Time Series Anomaly Detection Benchmarks are Flawed") argued that several headline benchmarks have such severe label issues that reported accuracy gains are not trustworthy.

datasets and benchmarks

dataset	domain	size	typical task
M3 (Makridakis 2000)	mixed business series	3,003 series	forecasting, mostly short
M4 (Makridakis 2018)	mixed business series	100,000 series	forecasting at six frequencies
M5 (Makridakis 2022)	Walmart hierarchical retail	42,840 series	hierarchical forecasting and uncertainty
Monash forecasting archive (Godahewa 2021)	25 source datasets, 58 variants	varies	unified benchmark for global models
ETT (Zhou 2021)	electricity transformer temperature	2 years hourly and 15-min	long-horizon forecasting
Traffic (Caltrans PEMS)	freeway sensor occupancy	hourly, 862 sensors	long-horizon multivariate
Weather	meteorological station	10-min, 21 variables	multivariate forecasting
Electricity (UCI ElectricityLoadDiagrams)	client power demand	15-min, 370 clients	global forecasting
UCR Archive (Dau 2018)	mixed	128 datasets	univariate classification
UEA Archive (Bagnall 2018)	mixed	30 datasets	multivariate classification

The Monash archive deserves special mention: before its release in 2021, the field had no single place to compare global forecasting models, and benchmark cherry-picking was rampant.

validation pitfalls

The single most common beginner mistake is using random k-fold cross-validation on a time series. Random folding leaks information from the future into the training set, which inflates accuracy and hides overfitting. The correct approaches are:

Hold-out by time: train on [start, T], test on [T+1, T+h].
Rolling-origin (walk-forward) validation: repeatedly slide the train/test boundary forward, refitting at each step, so every prediction is genuinely out-of-sample.
Expanding window: same as rolling-origin but the training window grows rather than slides.

The TimeSeriesSplit class in scikit-learn implements expanding-window splits. Hyndman's tsCV function in the forecast R package implements rolling-origin evaluation.

evaluation metrics

metric	full name	use
MAE	mean absolute error	scale-dependent, robust to outliers
RMSE	root mean squared error	scale-dependent, penalizes large errors
MAPE	mean absolute percentage error	scale-free, undefined when actuals are zero, asymmetric
sMAPE	symmetric MAPE	bounded, used in M3 and M4 competitions
MASE	mean absolute scaled error	Hyndman and Koehler 2006, scale-free, well-behaved at zero
WAPE	weighted absolute percentage error	hierarchical reporting, used in M5
pinball loss	quantile loss	training and evaluating quantile forecasts
CRPS	continuous ranked probability score	evaluating probabilistic forecasts against a true value
WIS	weighted interval score	evaluating quantile-based probabilistic forecasts

MASE (Hyndman and Koehler 2006) is the metric Hyndman recommends as the all-purpose default because it works at any scale, handles zeros, and is interpretable as a ratio against a naive seasonal forecast.

tools and libraries

library	language	strengths
statsmodels	Python	full ARIMA, SARIMA, ETS, state-space, classical hypothesis tests
forecast / fable	R	Hyndman's reference implementations of ETS, ARIMA, TBATS
Prophet	Python and R	piecewise-linear trend plus Fourier seasonality, robust defaults
sktime	Python	scikit-learn-style API across forecasting, classification, regression
tslearn	Python	DTW, k-Shape, time-series k-means
GluonTS	Python	DeepAR, TFT, MQ-CNN, probabilistic models on MXNet and PyTorch
Darts	Python	unified API across statistical, ML, and deep models
NeuralForecast (Nixtla)	Python	N-BEATS, N-HiTS, TFT, PatchTST in a single package
StatsForecast (Nixtla)	Python	parallelized classical models, fast AutoARIMA
PyTorch Forecasting	Python	TFT and DeepAR with PyTorch Lightning
Kats (Meta)	Python	forecasting, anomaly detection, change-point detection
Chronos and Lag-Llama	Python	foundation models distributed via Hugging Face

Prophet (Taylor and Letham 2018, The American Statistician) deserves a separate mention because it normalized the workflow of "fit a forecast in three lines and tweak the change points by hand" for non-specialists. It is not the most accurate model in any benchmark, but it is genuinely useful for analysts who want to add holiday effects without learning ARIMA.

applications

Temporal data shows up almost everywhere there is a sensor, a transaction, or a clock:

Finance: price prediction (rarely successful), volatility forecasting, risk modeling, high-frequency trading microstructure, fraud detection.
Healthcare: electronic health record sequences, ICU monitoring, ECG arrhythmia classification, glucose prediction in diabetes management, sepsis early warning.
Retail and supply chain: demand forecasting at SKU and store level, inventory optimization, hierarchical reconciliation across product families and regions.
Energy: electricity load forecasting, wind and solar generation forecasting, day-ahead market bidding.
Meteorology and climate: weather forecasting using deep models like Pangu-Weather (Bi et al. 2023, Nature) and GraphCast (Lam et al. 2023, Science), which now match or beat ECMWF on several variables.
Manufacturing and IoT: predictive maintenance from vibration and temperature sensors, quality control, anomaly detection on production lines.
Web and ad-tech: click prediction, traffic forecasting, A/B test analysis with sequential testing.
Epidemiology: case-count forecasting, the COVID-19 Forecast Hub (Cramer et al. 2022) ran a multi-team ensemble that became the CDC's official forecast.

explain like I'm 5

Temporal data is data that has a time element, like how the temperature changes throughout the day or how many ice creams are sold each month. By looking at this data we can find patterns, like when it gets hot more ice creams are sold. Then we can use these patterns to guess how many ice creams will be sold next month or what the temperature will be like tomorrow. The hard part is that the future depends on the past, so we cannot mix up the order of our examples the way we might for pictures of cats and dogs. We have to keep yesterday before today, always.

references

Box, G. E. P., and Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day, San Francisco.
Hyndman, R. J., Koehler, A. B., Ord, J. K., and Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. Springer.
Hyndman, R. J., and Koehler, A. B. (2006). "Another look at measures of forecast accuracy." International Journal of Forecasting 22(4), 679-688.
Cleveland, R. B., Cleveland, W. S., McRae, J. E., and Terpenning, I. (1990). "STL: A Seasonal-Trend Decomposition Procedure Based on Loess." Journal of Official Statistics 6, 3-73.
Hochreiter, S., and Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation 9(8), 1735-1780.
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). "DeepAR: Probabilistic forecasting with autoregressive recurrent networks." International Journal of Forecasting 36(3), 1181-1191. arXiv:1704.04110.
Oreshkin, B. N., Carpov, D., Chapados, N., and Bengio, Y. (2020). "N-BEATS: Neural basis expansion analysis for interpretable time series forecasting." ICLR 2020. arXiv:1905.10437.
Challu, C., Olivares, K. G., Oreshkin, B. N., Garza, F., Mergenthaler-Canseco, M., and Dubrawski, A. (2023). "NHITS: Neural Hierarchical Interpolation for Time Series Forecasting." AAAI 2023. arXiv:2201.12886.
Lim, B., Arik, S. O., Loeff, N., and Pfister, T. (2021). "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting." International Journal of Forecasting 37(4), 1748-1764.
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting." AAAI 2021. (Best Paper Award.)
Wu, H., Xu, J., Wang, J., and Long, M. (2021). "Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting." NeurIPS 2021.
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and Jin, R. (2022). "FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting." ICML 2022.
Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. (2023). "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." ICLR 2023. arXiv:2211.14730.
Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., and Long, M. (2024). "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting." ICLR 2024 Spotlight.
Zeng, A., Chen, M., Zhang, L., and Xu, Q. (2023). "Are Transformers Effective for Time Series Forecasting?" AAAI 2023.
Garza, A., Challu, C., and Mergenthaler-Canseco, M. (2023). "TimeGPT-1." arXiv:2310.03589.
Rasul, K. et al. (2023). "Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting." arXiv:2310.08278.
Das, A., Kong, W., Sen, R., and Zhou, Y. (2024). "A decoder-only foundation model for time-series forecasting." ICML 2024. arXiv:2310.10688.
Ansari, A. F. et al. (2024). "Chronos: Learning the Language of Time Series." Transactions on Machine Learning Research. arXiv:2403.07815.
Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., and Sahoo, D. (2024). "Unified Training of Universal Time Series Forecasting Transformers." ICML 2024. arXiv:2402.02592.
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). "M5 accuracy competition: Results, findings, and conclusions." International Journal of Forecasting 38(4), 1346-1364.
Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., and Montero-Manso, P. (2021). "Monash Time Series Forecasting Archive." NeurIPS Datasets and Benchmarks Track. arXiv:2105.06643.
Dau, H. A., Bagnall, A., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A., and Keogh, E. (2018). "The UCR Time Series Archive." arXiv:1810.07758.
Dempster, A., Petitjean, F., and Webb, G. I. (2020). "ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels." Data Mining and Knowledge Discovery 34(5), 1454-1495.
Taylor, S. J., and Letham, B. (2018). "Forecasting at Scale." The American Statistician 72(1), 37-45.
Said, S. E., and Dickey, D. A. (1984). "Testing for unit roots in autoregressive-moving average models of unknown order." Biometrika 71(3), 599-607.
Lam, R. et al. (2023). "Learning skillful medium-range global weather forecasting." Science 382(6677), 1416-1421.
Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q. (2023). "Accurate medium-range global weather forecasting with 3D neural networks." Nature 619, 533-538.

what counts as temporal data

core concepts

stationarity

decomposition

autocorrelation

classical statistical methods

machine learning methods

deep learning methods

recurrent networks

transformer-based forecasters

specialized neural architectures

foundation models for time series

time-series classification

time-series anomaly detection

datasets and benchmarks

validation pitfalls

evaluation metrics

tools and libraries

applications

explain like I'm 5

references

Improve this article

Related Articles

ARIMA

ARC-AGI 2

Stationarity

Nonstationarity

AUC-ROC

Machine learning terms/Clustering

what counts as temporal data

core concepts

stationarity

decomposition

autocorrelation

classical statistical methods

machine learning methods

deep learning methods

recurrent networks

transformer-based forecasters

specialized neural architectures

foundation models for time series

time-series classification

time-series anomaly detection

datasets and benchmarks

validation pitfalls

evaluation metrics

tools and libraries

applications

explain like I'm 5

references

Related Articles

ARIMA

ARC-AGI 2

Stationarity

Nonstationarity

AUC-ROC

Machine learning terms/Clustering