Time Series Analysis

Introduction

Time series analysis is a statistical and computational discipline focused on analyzing data points collected or recorded in chronological order. It plays a critical role in fields ranging from finance and economics to meteorology, healthcare, and industrial operations. In machine learning, time series analysis is used to build predictive models that forecast future events based on historical data, detect anomalies in streaming signals, classify temporal patterns, and impute missing observations. The primary goal is to extract meaningful insights from time-ordered data and utilize those insights to make better-informed decisions.

Over the past several decades, the toolkit for time series analysis has expanded considerably. Classical statistical methods such as ARIMA and exponential smoothing gave way to machine learning approaches built on gradient boosting and random forests, which were in turn complemented by deep learning architectures including LSTMs, temporal convolutional networks, and transformers. Most recently, foundation models pretrained on billions of time points have introduced zero-shot and few-shot forecasting capabilities that challenge traditional model-fitting workflows.

ELI5: explain like I'm 5

Imagine you are tracking how tall your sunflower grows every week. After a few months, you have a list of numbers: 2 inches, 5 inches, 9 inches, 12 inches, and so on. Time series analysis is like looking at that list to figure out a pattern. You might notice the sunflower grows about 3 inches per week (the trend), it grows faster in sunny weeks and slower in cloudy weeks (the seasonality), and some weeks it barely grows or shoots up for no obvious reason (the noise).

Once you understand those patterns, you can make a guess about how tall the sunflower will be next week or next month. That guess is called a forecast. Computers do the same thing with stock prices, weather temperatures, electricity usage, and almost anything else that changes over time. They look at the history, find the patterns, and then predict what comes next.

Time series data

A time series is a collection of data points collected or recorded sequentially over regular intervals of time. The data points in a time series typically represent a single variable observed at various time instances. Time series data can be found in many real-world applications, such as stock prices, weather data, and sensor readings.

Characteristics of time series data

Time series data is characterized by four key components:

Trend: A long-term increase or decrease in the data over time. For example, the global average temperature has exhibited an upward trend over the past century.
Seasonality: Regular patterns or fluctuations in the data that recur at specific time periods (for example daily, monthly, or yearly). Retail sales typically spike during holiday seasons and dip in January.
Cyclicality: Fluctuations in the data that are not periodic but occur over irregular intervals. Business cycles of expansion and recession are a classic example; they repeat, but the length of each cycle varies.
Noise: Random variations in the data that are not attributable to any specific trend, seasonality, or cyclicality. Noise is the residual variation left after accounting for all systematic components.

Stationarity and differencing

A time series is said to be stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Many classical forecasting methods assume stationarity, so practitioners often need to transform non-stationary data before modeling. Common transformations include:

Differencing: Subtracting the previous observation from the current one. First-order differencing removes linear trends; seasonal differencing (subtracting the value from one season ago) removes seasonal patterns.
Log transforms: Applying a logarithm to stabilize variance when the magnitude of fluctuations grows proportionally with the level of the series.
Box-Cox transforms: A family of power transforms parameterized by lambda that generalizes both the log transform and the square root transform.

The Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test are widely used to assess whether a time series is stationary. The ADF test has a null hypothesis of non-stationarity (a unit root), while the KPSS test has a null hypothesis of stationarity. Using both tests together provides a more robust assessment.

Time series analysis techniques

Several techniques are employed in time series analysis to extract useful information from the data. These techniques can be classified into two categories: time-domain methods and frequency-domain methods.

Time-domain methods

Time-domain methods analyze the data directly in the time domain and are primarily concerned with the temporal structure of the data. Some common time-domain methods include:

Autoregression (AR): A linear model that expresses the current value of a time series as a function of its previous values.
Moving average (MA): A method that calculates the average of a specified number of consecutive data points to smooth out short-term fluctuations and highlight long-term trends.
ARIMA (Autoregressive Integrated Moving Average): A combination of AR and MA models, which also includes a differencing step to account for non-stationarity in the data.
State space models: A class of models that describe the underlying state of a system and its evolution over time, allowing for both observed and unobserved components.

Frequency-domain methods

Frequency-domain methods transform the time series data into the frequency domain using techniques such as the Fourier transform and analyze the data in terms of its frequency components. Examples of frequency-domain methods include:

Spectral analysis: A technique that estimates the power spectrum of a time series, revealing the periodicities and dominant frequencies present in the data.
Wavelet analysis: A method that decomposes a time series into its constituent time-frequency components using wavelet functions, allowing for the analysis of both localized and global patterns in the data.

Classical statistical methods

Classical statistical methods for time series forecasting were developed primarily between the 1950s and the 1980s and remain in widespread use today. These methods rely on explicit statistical assumptions about the data-generating process and often serve as strong baselines against which newer methods are compared.

ARIMA and SARIMA

ARIMA (Autoregressive Integrated Moving Average), formalized by Box and Jenkins in 1970, is one of the most established approaches to time series forecasting. The model has three components: the autoregressive (AR) part models the relationship between an observation and a number of lagged observations; the integrated (I) part uses differencing to make the series stationary; and the moving average (MA) part models the relationship between an observation and a residual error from a lagged moving average model. The model is specified by three hyperparameters (p, d, q), where p is the order of the AR term, d is the degree of differencing, and q is the order of the MA term.

SARIMA (Seasonal ARIMA) extends ARIMA by adding seasonal components. It introduces additional seasonal parameters (P, D, Q, m), where m is the number of time steps per seasonal cycle. SARIMA is effective for data with strong repeating seasonal patterns, such as monthly retail sales or quarterly earnings.

ARIMA models generally perform well for short-to-medium horizon forecasts on univariate series with linear patterns. However, practical implementation often requires manual tuning, stationarity checks, and diagnostic analysis.

Exponential smoothing (ETS)

Exponential smoothing methods produce forecasts by computing weighted averages of past observations, with the weights decaying exponentially as the observations get older. The ETS (Error, Trend, Seasonality) framework provides a unified approach to exponential smoothing. Each component can be modeled as additive (A), multiplicative (M), or absent (N), producing a taxonomy of 30 possible model variants.

Simple Exponential Smoothing (SES) handles series with no trend or seasonality. Holt's Linear Method adds a trend component, while Holt-Winters' Method adds both trend and seasonal components. ETS models are fast to fit, require minimal tuning, and often outperform more complex models on short horizons. Along with ARIMA, ETS methods are among the two most widely used approaches to time series forecasting in practice.

Prophet

Prophet is an open-source forecasting tool developed by Facebook (now Meta) and released in 2017. It implements a decomposable additive regression model with three main components: a piecewise linear or logistic growth curve for trend, Fourier series for yearly and weekly seasonality, and a user-supplied list of holidays and special events. The model is expressed as:

y(t) = g(t) + s(t) + h(t) + e(t)

where g(t) represents the trend function, s(t) represents seasonal changes, h(t) captures holiday effects, and e(t) is the error term.

Prophet uses the Stan statistical programming language as its computational backend and is available in both Python and R. It is designed to handle missing data, shifts in trend, and outliers automatically. Prophet works best with daily or sub-daily data that has strong seasonal effects and several seasons of historical data. It has become popular for business forecasting because it requires minimal manual configuration while still producing reasonable results.

Method	Year introduced	Key idea	Strengths	Limitations
ARIMA	1970	Autoregressive + differencing + moving average	Strong linear modeling; well-understood theory	Requires stationarity; manual tuning of (p,d,q)
SARIMA	1970s	Seasonal extension of ARIMA	Captures seasonal patterns	More parameters to tune; limited to single seasonality
ETS	1950s-2000s	Exponential weighting of past observations	Fast; minimal tuning; solid short-horizon performance	Cannot model complex nonlinear relationships
Prophet	2017	Decomposable additive model with trend + seasonality + holidays	Handles missing data and outliers; user-friendly	Less effective for short, irregular, or non-seasonal series

Feature engineering for time series

Feature engineering is the process of transforming raw time series data into input variables that machine learning models can use effectively. Because tree-based models and many neural network architectures expect tabular or fixed-length inputs, thoughtful feature construction is often the single most important step in a time series machine learning pipeline.

Lag features

Lag features (also called lagged variables) represent previous values of the target series. For example, a lag-1 feature is the value from one time step ago, a lag-7 feature is the value from seven steps ago, and so on. The choice of which lags to include is guided by the autocorrelation function (ACF) and partial autocorrelation function (PACF), which reveal the strength of correlation at each lag. For daily retail data, lags at 1, 7, 14, and 28 days are common choices because they capture both recent momentum and weekly patterns.

Rolling (window) statistics

Rolling statistics calculate summary measures over a sliding window of recent observations. Common rolling features include:

Rolling mean: Smooths out short-term noise to reveal the underlying trend.
Rolling standard deviation: Captures recent volatility.
Rolling min/max: Identifies recent extremes.
Rolling skewness and kurtosis: Describes the shape of the recent distribution.

Window sizes are typically chosen to match meaningful time spans. For hourly energy data, a 24-hour rolling mean captures the daily cycle, while a 168-hour (one-week) rolling mean captures the weekly cycle.

Calendar and date features

Date-based features encode the position within known cycles: hour of day, day of week, day of month, month of year, quarter, whether a day is a weekend or public holiday, and the number of days until or since a known event. These features allow tree-based models to learn seasonal patterns without requiring explicit seasonal decomposition.

Fourier features

Fourier features use sine and cosine functions at different frequencies to represent seasonal and cyclical patterns as continuous variables. For a seasonal period of length P, a pair of Fourier features at order k is defined as sin(2 * pi * k * t / P) and cos(2 * pi * k * t / P). Using multiple Fourier orders captures both broad and fine-grained seasonal shapes. Prophet uses this technique internally for its seasonality components, and practitioners frequently add Fourier features when using gradient boosting models for time series.

Differencing and rate-of-change features

First differences (the change from one time step to the next), percentage changes, and higher-order differences can serve as features that make it easier for models to learn from non-stationary data. These features capture momentum and acceleration in the series.

Machine learning methods for time series

Starting in the 2010s, machine learning methods originally designed for tabular data were adapted for time series problems. These approaches typically reframe the forecasting problem as a supervised learning task, where lagged values and engineered features serve as input variables and future values serve as targets.

Gradient boosting: XGBoost, LightGBM, and CatBoost

Gradient boosting builds an ensemble of weak learners (usually decision trees) sequentially, where each new tree corrects the errors of the previous ensemble. For time series applications, the data must first be transformed into a supervised learning format using lag features, rolling statistics (such as moving averages and rolling standard deviations), date-based features (day of week, month, quarter), and other domain-specific features.

XGBoost (Extreme Gradient Boosting), introduced by Tianqi Chen in 2016, is an efficient and regularized implementation of gradient boosting. It grows trees level-by-level (depth-wise) and has been a favorite among competition winners on platforms such as Kaggle. XGBoost is effective for time series forecasting when combined with careful feature engineering.

LightGBM (Light Gradient Boosting Machine), developed by Microsoft, grows trees leaf-wise rather than level-wise, always splitting the leaf with the highest error reduction. This approach often reaches the same accuracy with fewer splits and trains significantly faster on large datasets due to its histogram-based algorithm. LightGBM consumes less memory than XGBoost because it stores binned feature values rather than exact values. For datasets exceeding several million rows, LightGBM is generally the more practical choice.

CatBoost, developed by Yandex, provides native handling of categorical features and uses ordered boosting to reduce prediction shift. It is particularly useful for time series datasets with categorical covariates (such as product IDs or region codes).

An important limitation of all tree-based models is their inability to extrapolate beyond the observed range during training. This becomes critical when forecasting series with trends, because the model cannot predict values higher or lower than anything it has seen in the training data. Practitioners address this by detrending the series before modeling or by combining tree-based models with trend components.

Random forests for time series

Random forests are ensemble methods that train multiple decision trees on random subsets of the data and aggregate their predictions. For time series forecasting, random forests are used with the same feature engineering approach as gradient boosting: lag features, rolling statistics, and calendar features. Random forests offer strong baseline performance, are resistant to overfitting, and provide feature importance rankings that help practitioners understand which lags and features drive predictions.

Compared to gradient boosting, random forests are easier to tune but typically achieve slightly lower accuracy on forecasting tasks. They share the same extrapolation limitation as other tree-based methods.

Method	Developer	Tree growth	Training speed	Memory	Key advantage
XGBoost	Tianqi Chen (2016)	Level-wise (depth)	Moderate	Higher	Regularization; wide adoption
LightGBM	Microsoft	Leaf-wise	Fast	Lower	Histogram-based; scales to large data
CatBoost	Yandex	Symmetric trees	Moderate	Moderate	Native categorical feature handling
Random Forest	Breiman (2001)	Multiple random trees	Fast	Moderate	Robust baseline; feature importance

Deep learning models for time series

Deep learning models have become increasingly prominent in time series analysis due to their ability to automatically learn complex temporal patterns without manual feature engineering.

Recurrent neural networks (RNNs)

Recurrent neural networks (RNNs) are a class of neural networks with loops that allow information to persist over time, making them well-suited for sequential data. At each time step, an RNN takes the current input and the hidden state from the previous time step to produce an output and an updated hidden state. However, vanilla RNNs suffer from the vanishing gradient problem, which makes it difficult for them to learn long-term dependencies.

Long short-term memory (LSTM) networks

Long short-term memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, address the vanishing gradient problem by incorporating a gating mechanism with three gates: an input gate, a forget gate, and an output gate. These gates control the flow of information through the network, allowing LSTMs to selectively remember or forget information over long sequences.

LSTMs have been applied extensively to time series forecasting, including stock price prediction, energy demand forecasting, and weather prediction. Variants such as the Gated Recurrent Unit (GRU), introduced by Cho et al. in 2014, simplify the gating mechanism while maintaining similar performance. Bidirectional LSTMs (BiLSTMs) process the sequence in both forward and reverse directions, which can improve performance on tasks where future context is available (such as classification and imputation, but not real-time forecasting).

Despite their effectiveness, LSTMs have limitations: they process sequences step by step (preventing parallelization during training), and their performance can degrade on very long sequences due to memory decay.

DeepAR

DeepAR, developed by Amazon and published by Salinas et al. in 2020, is a probabilistic forecasting method built on autoregressive recurrent neural networks. Instead of producing a single point forecast, DeepAR outputs a full probability distribution at each time step by using the RNN to parameterize a chosen likelihood function (such as a Gaussian or negative binomial distribution). A key strength of DeepAR is its ability to train a single global model on many related time series simultaneously, learning shared patterns across series while still respecting series-specific behavior through covariates. In benchmarks, DeepAR demonstrated accuracy improvements of roughly 15% over state-of-the-art methods on several real-world datasets. Amazon integrated DeepAR into its Forecast service as DeepAR+.

Temporal convolutional networks (TCN)

Temporal Convolutional Networks (TCNs) are a specialized convolutional neural network architecture designed for sequence modeling. TCNs use causal convolutions (where the output at time t depends only on inputs at time t and earlier) combined with dilated convolutions that exponentially increase the receptive field without increasing the number of parameters.

A TCN consists of dilated, causal 1D convolutional layers with residual connections. The dilation factor doubles at each layer, allowing the network to capture long-range dependencies efficiently. For example, with dilation factors of 1, 2, 4, 8, and 16, a five-layer TCN can look back over 32 time steps with just five layers.

TCNs offer several advantages over RNN-based models:

Parallelization: Convolutions can be computed in parallel, reducing training time by 30-40% compared to LSTMs.
Stable gradients: TCNs do not suffer from vanishing or exploding gradients.
Flexible receptive field: The receptive field can be adjusted by changing the dilation factor or the number of layers.
Lower memory: TCNs generally require less memory during training than LSTMs.

Comparative studies have shown that TCNs often outperform LSTMs on time series forecasting tasks, particularly on longer sequences.

WaveNet

WaveNet, introduced by van den Oord et al. at DeepMind in 2016, is a fully probabilistic generative model originally designed for raw audio synthesis. It uses stacks of dilated causal convolutions to model the conditional probability of each value given all previous values. The dilated convolution structure allows the receptive field to grow exponentially with depth, enabling the network to access hundreds or thousands of past time steps without a corresponding explosion in parameters.

Researchers subsequently adapted WaveNet for general time series forecasting. Borovykh et al. (2017) demonstrated that WaveNet-style architectures could produce competitive results on financial and other time series tasks. The key insight behind WaveNet's success is its ability to model complex, nonlinear conditional distributions while maintaining tractable training through parallelized convolutions.

N-BEATS and N-HiTS

N-BEATS (Neural Basis Expansion Analysis for Time Series), published by Oreshkin et al. at ICLR 2020, takes a different architectural approach by using a deep stack of fully connected layers organized into blocks and stacks rather than recurrent or convolutional layers. Each block produces a partial forecast (forward prediction) and a partial backcast (reconstruction of the input); the backcast residual is passed to the next block, allowing the model to iteratively refine its representation. An interpretable variant constrains the basis functions to produce explicit trend and seasonality decompositions. N-BEATS improved forecast accuracy by 11% over a statistical benchmark and by 3% over the ES-RNN winner of the M4 competition.

N-HiTS (Neural Hierarchical Interpolation for Time Series), introduced by Challu et al. in 2023, extends N-BEATS with multi-rate signal sampling and hierarchical interpolation. Each block operates at a different temporal resolution, allowing the model to efficiently capture patterns at multiple scales. N-HiTS achieved comparable or superior accuracy to N-BEATS while reducing computation by a factor of 5 to 50 on long-horizon benchmarks.

Transformer-based models for time series

The transformer architecture, originally introduced by Vaswani et al. in 2017 for natural language processing, has been adapted for time series forecasting with several important modifications.

Informer

Informer, published by Zhou et al. at AAAI 2021 (where it received the Best Paper Award), addresses the limitations of applying standard Transformers to long sequence time series forecasting (LSTF). The key innovation is the ProbSparse self-attention mechanism, which achieves O(L log L) time complexity and memory usage instead of the standard O(L squared). The model also introduces a self-attention distilling operation that halves the cascading layer input, and a generative-style decoder that predicts the entire output sequence in a single forward pass rather than step by step. Experiments on four large-scale datasets demonstrated that Informer significantly outperformed existing methods on the LSTF problem.

Autoformer and FEDformer

Autoformer, published at NeurIPS 2021, introduced a decomposition architecture that separates trend and seasonal components within the Transformer framework, along with an auto-correlation mechanism that replaces traditional self-attention. FEDformer (Frequency Enhanced Decomposed Transformer) applies attention in the frequency domain using Fourier and wavelet transforms, which can be more efficient for capturing periodic patterns.

PatchTST

PatchTST (Patch Time Series Transformer), published at ICLR 2023 by Nie et al., introduces two key design choices: (1) segmenting the time series into subseries-level patches that serve as input tokens to the Transformer, and (2) channel-independence, where each variable in a multivariate time series is processed independently. The patching approach provides three benefits: local semantic information is retained in each embedding, computation and memory usage of attention maps are quadratically reduced, and the model can attend to a longer history. Compared to the best previous Transformer-based results, PatchTST achieved a 21.0% reduction in MSE and 16.7% reduction in MAE on standard benchmarks including Weather, Traffic, and Electricity datasets.

Model	Year	Venue	Key innovation	Complexity
Informer	2021	AAAI (Best Paper)	ProbSparse attention; generative decoder	O(L log L)
Autoformer	2021	NeurIPS	Decomposition architecture; auto-correlation	O(L log L)
FEDformer	2022	ICML	Frequency-domain attention (Fourier/wavelet)	O(L)
PatchTST	2023	ICLR	Patch-based tokenization; channel-independence	O((L/P) squared) where P is patch size

Foundation models for time series

Inspired by the success of large language models in natural language processing, researchers have developed foundation models for time series that are pretrained on large corpora of time series data and can perform forecasting, classification, and anomaly detection with minimal or no task-specific training.

TimeGPT

TimeGPT, developed by Nixtla, was one of the first foundation models built specifically for time series forecasting. It uses an encoder-decoder Transformer architecture (not based on any existing LLM) trained on the largest publicly available collection of time series data, encompassing over 100 billion data points. The training set includes data from finance, economics, demographics, healthcare, weather, IoT sensor data, energy, web traffic, sales, transport, and banking. TimeGPT supports both zero-shot forecasting (where the model generates predictions on unseen data without any fine-tuning) and anomaly detection. It is accessed primarily through Nixtla's API.

Chronos (Amazon)

Chronos, developed by Amazon Science, is a framework that repurposes pretrained language models for time series forecasting. The original Chronos models are based on the T5 family of language models and treat time series values as tokens by quantizing them into discrete bins. This approach allows the model to leverage the sequence modeling capabilities of language models for forecasting.

Chronos-2, released in October 2025, extends the framework with zero-shot support for univariate, multivariate, and covariate-informed forecasting. It delivers state-of-the-art zero-shot forecasting performance that consistently beats tuned statistical models out of the box, processing over 300 forecasts per second on a single GPU. Chronos has accumulated millions of downloads on Hugging Face and integrates natively with AWS tools like SageMaker and AutoGluon.

TimesFM (Google)

TimesFM (Time Series Foundation Model), developed by Google Research and accepted at ICML 2024, is a decoder-only foundation model with 200 million parameters pretrained on a corpus of 100 billion real-world time points. The architecture treats patches (groups of contiguous time points) as tokens. A multilayer perceptron block with residual connections converts each patch into a token embedding, which is then processed through 20 stacked Transformer layers with causal multi-head self-attention and feed-forward layers. The model uses an input patch length of 32 and an output patch length of 128. TimesFM 2.5 is available on Hugging Face and integrated into Google Cloud BigQuery for enterprise use.

Lag-Llama

Lag-Llama, published in 2024, is the first open-source foundation model for univariate probabilistic time series forecasting. Its architecture is a decoder-only transformer that uses lags as covariates rather than raw sequential values. Instead of processing the time series as a continuous sequence, Lag-Llama feeds lagged values at predefined intervals (for example, lags at 1, 2, 3, 7, 14, 28 steps) into the model as input features. The model was pretrained on a large corpus of diverse time series data spanning multiple domains and produces probabilistic forecasts (full prediction distributions rather than point estimates). Lag-Llama demonstrates strong zero-shot generalization, and when fine-tuned on small fractions of unseen datasets, it achieves state-of-the-art performance compared to prior deep learning approaches.

Moirai (Salesforce)

Moirai, developed by Salesforce Research, is a universal forecasting foundation model that can handle time series of varying frequencies and prediction lengths without task-specific fine-tuning. It uses a mixture distribution with four different output distributions, which allows it to produce more flexible prediction intervals than models that assume a single distribution family. Moirai was released alongside LOTSA (Large-scale Open Time Series Archive), an open collection of time series data containing 27 billion data points. Moirai-2 further improved multivariate forecasting capabilities.

Tiny Time Mixers (TTM)

Tiny Time Mixers (TTM), developed by IBM Research and accepted at NeurIPS 2024, takes a different approach by prioritizing efficiency. TTM is based on the lightweight TSMixer architecture that uses MLP-Mixer blocks interleaved with gated attention as alternatives to the quadratic self-attention blocks in Transformers. Starting from just 1 million parameters (compared to hundreds of millions in other foundation models), TTM incorporates adaptive patching, diverse resolution sampling, and resolution prefix tuning to handle datasets at varied temporal resolutions with minimal model capacity. TTM outperforms existing benchmarks in zero/few-shot forecasting by 4-40% while being lightweight enough to run on CPU-only machines, making it practical for resource-constrained environments.

MOMENT (CMU)

MOMENT, developed by researchers at Carnegie Mellon University and the University of Pennsylvania and accepted at ICML 2024, is a family of open-source foundation models for general-purpose time series analysis. The architecture is based on a 385-million-parameter T5 model pretrained using a masked time series prediction task (similar to the masked language modeling approach used by BERT) on the Time Series Pile, a diverse collection of public time series data.

Unlike models focused solely on forecasting, MOMENT supports multiple downstream tasks: forecasting, classification, anomaly detection, and imputation. It can be adapted to specific tasks through either full fine-tuning or linear probing. MOMENT comes in three sizes (Small, Base, Large) and has demonstrated strong performance across all four task types.

Model	Developer	Year	Architecture	Parameters	Training data	Key capability
TimeGPT	Nixtla	2023	Encoder-decoder Transformer	Undisclosed	100B+ data points	Zero-shot forecasting; anomaly detection
Chronos / Chronos-2	Amazon	2024/2025	T5-based (tokenized time series)	Multiple sizes	Public + synthetic data	Univariate, multivariate, covariate forecasting
TimesFM	Google	2024	Decoder-only Transformer	200M	100B time points	Patch-based zero-shot forecasting
Lag-Llama	Rasul et al.	2024	Decoder-only Transformer (lag covariates)	Multiple sizes	Multi-domain corpus	Probabilistic zero-shot forecasting
Moirai / Moirai-2	Salesforce	2024	Universal Transformer	Multiple sizes	27B data points (LOTSA)	Flexible distribution outputs; any frequency
TTM	IBM Research	2024	MLP-Mixer with gated attention	1M+	Public datasets	Efficient zero/few-shot; CPU-friendly
MOMENT	CMU / UPenn	2024	T5-based (masked prediction)	385M	Time Series Pile	Forecasting, classification, anomaly detection, imputation

Multivariate time series

Many real-world systems produce multiple interrelated time series simultaneously. Multivariate time series analysis studies the joint behavior of these variables, capturing cross-variable dependencies that univariate methods ignore.

Vector autoregression (VAR)

Vector Autoregression (VAR) extends the univariate AR model to multiple variables. In a VAR model, each variable is modeled as a linear function of its own past values and the past values of every other variable in the system. A VAR(p) model with k variables has k equations, each containing p lags of all k variables. VAR is widely used in econometrics for studying the dynamic relationships among macroeconomic indicators such as GDP, inflation, and interest rates.

Granger causality

Granger causality is a statistical test used to determine whether one time series is useful for forecasting another. Variable X is said to Granger-cause variable Y if past values of X provide statistically significant information about future values of Y beyond what past values of Y alone provide. It is important to note that Granger causality measures predictive ability, not true causal influence. The test is commonly applied within the VAR framework and requires both series to be stationary.

Cointegration and VECM

When two or more non-stationary time series share a common stochastic trend, they are said to be cointegrated. Cointegrated series drift together over time, and their linear combination is stationary even though the individual series are not. The Vector Error Correction Model (VECM) is a restricted VAR that incorporates cointegration relationships, allowing it to capture both short-term dynamics and long-term equilibrium among the variables. The Johansen test is the standard method for detecting cointegration in multivariate systems.

Deep learning for multivariate time series

Deep learning approaches to multivariate time series include channel-dependent models (which explicitly model inter-variable relationships through shared attention or graph neural networks) and channel-independent models (which process each variable separately and rely on shared parameters to capture commonalities). PatchTST demonstrated that the channel-independent approach can be surprisingly effective, while models like Crossformer and iTransformer are designed to capture cross-variable dependencies explicitly.

Time series cross-validation

Standard k-fold cross-validation is inappropriate for time series data because it randomly shuffles observations, breaking the temporal order and allowing the model to train on future data when predicting past events (a form of data leakage). Time series cross-validation techniques respect chronological order.

Expanding window

In expanding window (also called cumulative or anchored) cross-validation, the training set starts from the beginning of the series and grows with each fold. Fold 1 trains on observations 1 through T and tests on T+1 through T+h; fold 2 trains on observations 1 through T+h and tests on T+h+1 through T+2h; and so on. This approach uses all available historical data at each step and is appropriate when long-term patterns are important and the data-generating process is relatively stable.

Sliding window

In sliding window (also called rolling window) cross-validation, the training set has a fixed size and moves forward with each fold. Fold 1 trains on observations 1 through T; fold 2 trains on observations 2 through T+1; and so forth. By discarding older observations, the sliding window adapts more quickly to changes in the data-generating process (regime shifts, concept drift). This approach is preferred when recent data is more informative than distant history.

Walk-forward validation

Walk-forward validation is the most production-realistic approach. At each step, the model is retrained (or updated) on all data up to the current point, makes a forecast for the next h steps, and then the window advances. This mirrors the actual deployment scenario where the model is periodically retrained as new data arrives. Walk-forward validation also forces practitioners to test their entire pipeline (feature generation, scaling, lag construction) incrementally, revealing bugs that would surface in production.

Strategy	Training set	Advantages	Best when
Expanding window	Grows from fixed start	Uses all history; stable estimates	Long-term patterns dominate; stable process
Sliding window	Fixed size; shifts forward	Adapts to regime changes; equal weighting of folds	Recent data more relevant; non-stationary process
Walk-forward	Retrain at each step	Most realistic; tests full pipeline	Production deployment; model retraining is feasible

Anomaly detection in time series

Time series anomaly detection identifies observations or subsequences that deviate significantly from expected behavior. Anomalies can be point anomalies (individual outliers), contextual anomalies (values that are unusual in a specific temporal context), or collective anomalies (subsequences that are abnormal as a group).

Anomaly detection is critical in cybersecurity (detecting intrusions), manufacturing (identifying equipment faults), healthcare (monitoring patient vital signs for dangerous changes), and finance (flagging fraudulent transactions).

Statistical methods

Classical approaches include control charts (such as Shewhart charts and CUSUM), z-score thresholds, and the Generalized Extreme Studentized Deviate (GESD) test. These methods are simple, interpretable, and computationally lightweight, but they assume the underlying process is stationary or follows a known distribution.

Machine learning methods

Isolation Forest is an unsupervised algorithm that detects anomalies by randomly partitioning the feature space; anomalies, being few and distinct, require fewer partitions to isolate. One-Class SVM learns a boundary around normal data in feature space and flags points outside that boundary. Local Outlier Factor (LOF) measures how isolated a data point is relative to its neighbors. These methods work well on multivariate feature vectors derived from time series windows but do not directly model temporal dependencies.

Deep learning methods

Autoencoder-based approaches train a neural network to reconstruct normal time series patterns; anomalies are identified as inputs with high reconstruction error. LSTM autoencoders are particularly popular because they capture temporal dependencies in the encoding. Variational autoencoders (VAEs) add a probabilistic layer that can quantify uncertainty. Generative adversarial networks (GANs) have also been applied: the generator learns the distribution of normal data, and samples that deviate from the generator's output are flagged as anomalies. More recent hybrid frameworks combine Transformer-based autoencoders with Isolation Forest and XGBoost to handle complex spatiotemporal dependencies in IoT and industrial settings.

Handling missing values and irregular sampling

Real-world time series frequently contain missing values due to sensor failures, network outages, data collection errors, or intentional sparse sampling. Handling these gaps appropriately is essential for accurate modeling.

Simple imputation

Forward fill (carrying the last observed value forward) and backward fill are the simplest strategies. Mean and median imputation replace missing values with a constant but destroy temporal structure. These methods are fast and serve as baselines but can introduce bias, particularly for long gaps.

Interpolation

Linear interpolation estimates missing values by drawing a straight line between neighboring observed points. It is computationally cheap and performs surprisingly well across diverse datasets; a 2025 benchmarking study found that linear interpolation outperformed more complex methods across multiple missingness mechanisms and percentages. Spline interpolation fits smooth curves through the data and captures nonlinear patterns more accurately but is more sensitive to noise. Cubic splines and Akima splines are common choices.

Model-based imputation

More sophisticated approaches use the time series structure itself. Kalman smoothing estimates missing values using a state space model. Seasonal decomposition methods impute by reconstructing the missing segment from estimated trend and seasonal components. Deep learning models such as MOMENT can reconstruct masked segments based on learned temporal patterns, effectively performing imputation as a byproduct of their pretraining objective.

Irregular time series

Some time series are inherently irregular, with observations arriving at uneven intervals (for example, electronic health records or event logs). Approaches to irregular time series include resampling to a regular grid (with interpolation to fill gaps), time-aware neural architectures (such as Time-LSTM, which incorporates elapsed time as an input gate modifier), and neural ordinary differential equations (Neural ODEs) that model continuous-time dynamics.

Task types in time series analysis

Time series analysis encompasses several distinct task types, each with different objectives and evaluation criteria.

Forecasting

Forecasting is the most common time series task. It involves predicting future values of one or more variables based on historical observations. Forecasting can be univariate (predicting a single variable) or multivariate (predicting multiple related variables simultaneously). It can also be single-step (predicting one time step ahead) or multi-step (predicting multiple time steps ahead). Applications include demand planning, financial prediction, weather forecasting, and capacity planning.

Classification

Time series classification assigns a categorical label to an entire time series or a segment of a time series. Rather than predicting future values, the goal is to identify which category a sequence belongs to. Examples include classifying heartbeat signals as normal or abnormal, identifying the type of activity from accelerometer data (walking, running, sitting), and categorizing industrial sensor readings by equipment state. Common approaches include distance-based methods (such as Dynamic Time Warping), shapelet-based methods, and deep learning classifiers.

Imputation

Time series imputation fills in missing values within a time series. Missing data is common in real-world time series due to sensor failures, network outages, or data collection errors. Imputation methods range from simple interpolation to sophisticated models such as MOMENT that can reconstruct masked segments based on learned temporal patterns.

Evaluation metrics

Selecting the right evaluation metric is essential for comparing forecasting methods and selecting models for production. Different metrics emphasize different aspects of forecast quality.

Scale-dependent metrics

Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values. It is expressed in the same units as the data, making it easy to interpret. MAE is robust to outliers because it does not square the errors. Minimizing MAE produces forecasts of the median. MAE is a good default metric when a simple, robust summary of average error is needed.

Root Mean Squared Error (RMSE) is similar to MAE but squares each error before averaging and then takes the square root. This penalizes large errors more heavily, making RMSE a better choice when large forecast errors are especially costly (for example in financial risk or safety-critical applications). Minimizing RMSE produces forecasts of the mean.

Mean Squared Error (MSE) is the squared version of RMSE (without the square root). It is commonly used as a training loss function for deep learning models because it is differentiable and convex.

Percentage-based metrics

Mean Absolute Percentage Error (MAPE) converts errors into percentages of the actual values, making it scale-independent. This is useful for comparing accuracy across datasets measured in different units. However, MAPE has two well-known drawbacks: it is undefined when actual values are zero, and it is asymmetric, penalizing over-forecasts more heavily than under-forecasts of the same magnitude.

Symmetric Mean Absolute Percentage Error (sMAPE) addresses MAPE's asymmetry by using the average of actual and forecast values in the denominator. This makes it more balanced between over- and under-forecasts. sMAPE was used as a primary metric in the M3 and M4 forecasting competitions.

Scale-free metrics

Mean Absolute Scaled Error (MASE) divides the MAE by the MAE of a naive one-step-ahead forecast (which simply predicts the previous value). A MASE less than 1 means the model outperforms the naive baseline. MASE is well-defined for all series (including those with zero values) and is the recommended metric in the Monash Time Series Forecasting Archive.

Metric	Formula summary	Scale-dependent	Handles zeros	Symmetric	Common use
MAE	Mean of	actual - predicted		Yes	Yes
RMSE	Square root of mean squared error	Yes	Yes	Yes	When large errors are costly
MAPE	Mean of	error/actual	* 100%	No	No
sMAPE	Mean of	error	/(	actual	+
MASE	MAE / MAE of naive forecast	No	Yes	Yes	Monash archive; academic benchmarks

Benchmark datasets

Standardized benchmark datasets are essential for comparing time series methods objectively. Several datasets and repositories have become community standards.

M competition series

The M competitions, organized by Spyros Makridakis and the International Institute of Forecasters, are the most influential forecasting competitions in the field. The M4 competition (2018) included 100,000 time series across six frequencies (yearly, quarterly, monthly, weekly, daily, and hourly). The winning method, ES-RNN by Slawek Smyl, combined exponential smoothing with recurrent neural networks, demonstrating the value of hybrid statistical-deep learning approaches. The M5 competition (2020) focused on hierarchical retail sales forecasting using Walmart data, and gradient boosting methods (particularly LightGBM) dominated the leaderboard.

Monash time series forecasting archive

The Monash Time Series Forecasting Archive, compiled by researchers at Monash University and published at NeurIPS 2021, is the first comprehensive benchmark repository for global and multivariate time series forecasting. It contains 30 datasets (with 58 dataset variations when accounting for different frequencies and missing value treatments) spanning domains such as tourism, electricity, traffic, weather, and healthcare. The archive provides baseline results for standard forecasting methods across ten error metrics, using MASE as the primary evaluation measure.

ETT (Electricity Transformer Temperature)

The ETT dataset, introduced alongside the Informer paper, has become one of the most widely used benchmarks for long-term time series forecasting. It consists of four sub-datasets: two at hourly resolution (ETTh1, ETTh2) and two at 15-minute resolution (ETTm1, ETTm2). Each dataset contains oil temperature and six power load features from electricity transformers, collected from July 2016 to July 2018 (approximately 70,080 data points per dataset). The data exhibits a mix of short-term periodic patterns, long-term periodic patterns, long-term trends, and irregular patterns, making it a challenging benchmark for evaluating model performance on real-world data.

Other notable benchmarks

Several other datasets are commonly used in time series research:

Weather: Contains 21 meteorological variables recorded every 10 minutes throughout 2020 at a weather station in Germany.
Electricity: Hourly electricity consumption of 321 clients from 2012 to 2014.
Traffic: Hourly road occupancy rates from 862 sensors in the San Francisco Bay Area.
Exchange Rate: Daily exchange rates of eight countries from 1990 to 2016.

Applications

Time series analysis is applied across nearly every industry. The following sections describe major application domains.

Finance

Financial time series analysis encompasses stock price prediction, algorithmic trading, risk management, fraud detection, and portfolio optimization. Financial data is notoriously noisy, non-stationary, and influenced by external events (earnings reports, geopolitical developments, regulatory changes). Forecasting methods in finance must account for volatility clustering, where periods of high variance tend to follow other high-variance periods. Specialized models such as GARCH (Generalized Autoregressive Conditional Heteroskedasticity) are used to model this volatility. Deep learning methods including LSTMs and Transformers have been applied to predict stock returns, though consistently outperforming the market remains extremely difficult due to the efficient market hypothesis.

Weather and climate

Weather forecasting is one of the oldest and most important applications of time series analysis. Numerical weather prediction (NWP) models solve physical equations governing atmospheric dynamics, but machine learning approaches have increasingly supplemented or replaced traditional methods. Google DeepMind's GraphCast and Huawei's Pangu-Weather have demonstrated that deep learning models can produce competitive medium-range weather forecasts at a fraction of the computational cost of NWP models. Time series methods are also used for climate analysis, including temperature trend detection, precipitation forecasting, and extreme weather event prediction.

Energy

Energy applications include electricity demand forecasting, renewable energy generation prediction (solar and wind), grid load balancing, and energy price forecasting. Accurate demand forecasts help utilities optimize power generation and avoid blackouts. Solar and wind power production depends on weather conditions, making it inherently variable and difficult to predict. Gradient boosting methods and LSTMs are widely used for short-term load forecasting, while foundation models like Chronos are being adopted for longer-horizon energy forecasting.

Demand forecasting and supply chain

Retail and e-commerce companies use time series forecasting to predict product demand, optimize inventory levels, plan promotions, and manage supply chains. The M5 competition highlighted the effectiveness of gradient boosting methods for hierarchical demand forecasting at scale. Companies such as Amazon and Walmart process millions of individual time series for demand planning, often using global models that learn shared patterns across products while accounting for product-specific features.

Healthcare

Healthcare applications of time series analysis include patient monitoring (detecting dangerous changes in vital signs), epidemic forecasting (predicting disease spread), electrocardiogram (ECG) analysis, sleep stage classification, and hospital resource planning. Anomaly detection on physiological signals (heart rate, blood pressure, oxygen saturation) enables early intervention for deteriorating patients. During the COVID-19 pandemic, time series models were extensively used to forecast case counts, hospitalizations, and mortality, though the accuracy of these forecasts varied widely depending on the method and data quality.

Internet of Things (IoT) and industrial applications

IoT sensors generate massive volumes of time series data from manufacturing equipment, vehicles, buildings, and infrastructure. Predictive maintenance uses time series analysis to detect early signs of equipment failure before it occurs, reducing downtime and repair costs. Industrial anomaly detection identifies unusual sensor patterns that may indicate quality issues or safety hazards. The high dimensionality and volume of IoT data have driven interest in scalable methods such as foundation models and streaming anomaly detection algorithms.

Historical evolution of methods

The following table summarizes the evolution of time series methods across different eras.

Era	Period	Representative methods	Key characteristics
Classical Statistics	1950s-1990s	ARIMA, Exponential Smoothing (ETS), GARCH, State Space Models	Explicit statistical assumptions; interpretable; manual model selection
Early Machine Learning	2000s-2010s	Support Vector Machines, Random Forests, Feature Engineering Pipelines	Supervised learning framework; lag features; cross-validation
Gradient Boosting	2014-present	XGBoost, LightGBM, CatBoost	Ensemble of decision trees; strong tabular performance; competition winners
Deep Learning (Recurrent)	2015-2020	RNN, LSTM, GRU, DeepAR	Sequence modeling; automatic feature learning; vanishing gradient issues
Deep Learning (Convolutional)	2016-present	TCN, WaveNet, TimesNet	Parallel training; dilated convolutions; stable gradients
Deep Learning (MLP)	2020-present	N-BEATS, N-HiTS	Fully connected stacks; residual learning; interpretable decompositions
Transformers	2020-present	Informer, Autoformer, FEDformer, PatchTST	Attention mechanisms; long-range dependencies; patch-based tokenization
Foundation Models	2023-present	TimeGPT, Chronos, TimesFM, Moirai, MOMENT, Lag-Llama, TTM	Pretrained on billions of data points; zero-shot and few-shot capability
Hybrid/Decomposable	2017-present	Prophet, ES-RNN, N-BEATS, N-HiTS	Combine statistical decomposition with neural networks

Software and libraries

A variety of open-source tools support time series analysis across different programming languages and frameworks.

statsmodels (Python): Provides ARIMA, SARIMA, ETS, VAR, and other classical statistical models.
Prophet (Python/R): Meta's decomposable forecasting tool.
skforecast (Python): Integrates scikit-learn-compatible models (XGBoost, LightGBM, Random Forest) for time series forecasting.
Darts (Python): A unified library supporting ARIMA, exponential smoothing, PyTorch-based deep learning models (LSTM, TCN, Transformer), and more.
NeuralForecast (Python): Nixtla's library for deep learning time series models, including PatchTST, N-BEATS, N-HiTS, and TCN.
GluonTS (Python): Amazon's toolkit for probabilistic time series modeling, tightly integrated with Chronos.
tslearn (Python): Focused on time series classification, clustering, and preprocessing.
PyTorch Forecasting: Provides implementations of temporal fusion transformers, DeepAR, and other deep learning models.
MLforecast (Python): Nixtla's library for scalable machine learning forecasting with automated feature engineering (lag features, rolling statistics, calendar features).

References

Box, G.E.P., Jenkins, G.M., Reinsel, G.C., & Ljung, G.M. (2015). *Time Series Analysis: Forecasting and Control* (5th ed.). Wiley.
Hyndman, R.J. & Athanasopoulos, G. (2021). *Forecasting: Principles and Practice* (3rd ed.). OTexts. https://otexts.com/fpp3/
Taylor, S.J. & Letham, B. (2018). "Forecasting at scale." *The American Statistician*, 72(1), 37-45.
Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
Vaswani, A. et al. (2017). "Attention is All You Need." *Advances in Neural Information Processing Systems*, 30.
Zhou, H. et al. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting." *Proceedings of AAAI 2021* (Best Paper Award).
Nie, Y. et al. (2023). "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." *ICLR 2023*.
Garza, A., Challu, C., & Mergenthaler-Canseco, M. (2023). "TimeGPT-1." *arXiv:2310.03589*.
Ansari, A.F. et al. (2024). "Chronos: Learning the Language of Time Series." *Amazon Science*.
Das, A. et al. (2024). "A Decoder-Only Foundation Model for Time-Series Forecasting." *ICML 2024*.
Woo, G. et al. (2024). "Unified Training of Universal Time Series Forecasting Transformers." (Moirai). *ICML 2024*.
Goswami, M. et al. (2024). "MOMENT: A Family of Open Time-series Foundation Models." *ICML 2024*.
Rasul, K. et al. (2024). "Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting." *NeurIPS 2024*.
Ekambaram, V. et al. (2024). "Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series." *NeurIPS 2024*.
Oreshkin, B.N. et al. (2020). "N-BEATS: Neural basis expansion analysis for interpretable time series forecasting." *ICLR 2020*.
Challu, C. et al. (2023). "N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting." *AAAI 2023*.
Salinas, D. et al. (2020). "DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks." *International Journal of Forecasting*, 36(3), 1181-1191.
Godahewa, R. et al. (2021). "Monash Time Series Forecasting Archive." *NeurIPS 2021 Datasets and Benchmarks Track*.
Makridakis, S. et al. (2020). "The M4 Competition: 100,000 time series and 61 forecasting methods." *International Journal of Forecasting*, 36(1), 54-74.
Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *KDD 2016*.
Bai, S., Kolter, J.Z., & Koltun, V. (2018). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." *arXiv:1803.01271*.
Lam, R. et al. (2023). "Learning skillful medium-range global weather forecasting." *Science*, 382(6677), 1416-1421.
van den Oord, A. et al. (2016). "WaveNet: A Generative Model for Raw Audio." *arXiv:1609.03499*.
Lutkepohl, H. (2005). *New Introduction to Multiple Time Series Analysis*. Springer.