See also: Machine learning terms
Time series analysis is a statistical and computational discipline focused on analyzing data points collected or recorded in chronological order. It plays a critical role in fields ranging from finance and economics to meteorology, healthcare, and industrial operations. In machine learning, time series analysis is used to build predictive models that forecast future events based on historical data, detect anomalies in streaming signals, classify temporal patterns, and impute missing observations. The primary goal is to extract meaningful insights from time-ordered data and utilize those insights to make better-informed decisions.
Over the past several decades, the toolkit for time series analysis has expanded considerably. Classical statistical methods such as ARIMA and exponential smoothing gave way to machine learning approaches built on gradient boosting and random forests, which were in turn complemented by deep learning architectures including LSTMs, temporal convolutional networks, and transformers. Most recently, foundation models pretrained on billions of time points have introduced zero-shot and few-shot forecasting capabilities that challenge traditional model-fitting workflows.
Imagine you are tracking how tall your sunflower grows every week. After a few months, you have a list of numbers: 2 inches, 5 inches, 9 inches, 12 inches, and so on. Time series analysis is like looking at that list to figure out a pattern. You might notice the sunflower grows about 3 inches per week (the trend), it grows faster in sunny weeks and slower in cloudy weeks (the seasonality), and some weeks it barely grows or shoots up for no obvious reason (the noise).
Once you understand those patterns, you can make a guess about how tall the sunflower will be next week or next month. That guess is called a forecast. Computers do the same thing with stock prices, weather temperatures, electricity usage, and almost anything else that changes over time. They look at the history, find the patterns, and then predict what comes next.
A time series is a collection of data points collected or recorded sequentially over regular intervals of time. The data points in a time series typically represent a single variable observed at various time instances. Time series data can be found in many real-world applications, such as stock prices, weather data, and sensor readings.
Time series data is characterized by four key components:
Trend: A long-term increase or decrease in the data over time. For example, the global average temperature has exhibited an upward trend over the past century.
Seasonality: Regular patterns or fluctuations in the data that recur at specific time periods (for example daily, monthly, or yearly). Retail sales typically spike during holiday seasons and dip in January.
Cyclicality: Fluctuations in the data that are not periodic but occur over irregular intervals. Business cycles of expansion and recession are a classic example; they repeat, but the length of each cycle varies.
Noise: Random variations in the data that are not attributable to any specific trend, seasonality, or cyclicality. Noise is the residual variation left after accounting for all systematic components.
A time series is said to be stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Many classical forecasting methods assume stationarity, so practitioners often need to transform non-stationary data before modeling. Common transformations include:
The Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test are widely used to assess whether a time series is stationary. The ADF test has a null hypothesis of non-stationarity (a unit root), while the KPSS test has a null hypothesis of stationarity. Using both tests together provides a more robust assessment.
Several techniques are employed in time series analysis to extract useful information from the data. These techniques can be classified into two categories: time-domain methods and frequency-domain methods.
Time-domain methods analyze the data directly in the time domain and are primarily concerned with the temporal structure of the data. Some common time-domain methods include:
Frequency-domain methods transform the time series data into the frequency domain using techniques such as the Fourier transform and analyze the data in terms of its frequency components. Examples of frequency-domain methods include:
Classical statistical methods for time series forecasting were developed primarily between the 1950s and the 1980s and remain in widespread use today. These methods rely on explicit statistical assumptions about the data-generating process and often serve as strong baselines against which newer methods are compared.
ARIMA (Autoregressive Integrated Moving Average), formalized by Box and Jenkins in 1970, is one of the most established approaches to time series forecasting. The model has three components: the autoregressive (AR) part models the relationship between an observation and a number of lagged observations; the integrated (I) part uses differencing to make the series stationary; and the moving average (MA) part models the relationship between an observation and a residual error from a lagged moving average model. The model is specified by three hyperparameters (p, d, q), where p is the order of the AR term, d is the degree of differencing, and q is the order of the MA term.
SARIMA (Seasonal ARIMA) extends ARIMA by adding seasonal components. It introduces additional seasonal parameters (P, D, Q, m), where m is the number of time steps per seasonal cycle. SARIMA is effective for data with strong repeating seasonal patterns, such as monthly retail sales or quarterly earnings.
ARIMA models generally perform well for short-to-medium horizon forecasts on univariate series with linear patterns. However, practical implementation often requires manual tuning, stationarity checks, and diagnostic analysis.
Exponential smoothing methods produce forecasts by computing weighted averages of past observations, with the weights decaying exponentially as the observations get older. The ETS (Error, Trend, Seasonality) framework provides a unified approach to exponential smoothing. Each component can be modeled as additive (A), multiplicative (M), or absent (N), producing a taxonomy of 30 possible model variants.
Simple Exponential Smoothing (SES) handles series with no trend or seasonality. Holt's Linear Method adds a trend component, while Holt-Winters' Method adds both trend and seasonal components. ETS models are fast to fit, require minimal tuning, and often outperform more complex models on short horizons. Along with ARIMA, ETS methods are among the two most widely used approaches to time series forecasting in practice.
Prophet is an open-source forecasting tool developed by Facebook (now Meta) and released in 2017. It implements a decomposable additive regression model with three main components: a piecewise linear or logistic growth curve for trend, Fourier series for yearly and weekly seasonality, and a user-supplied list of holidays and special events. The model is expressed as:
y(t) = g(t) + s(t) + h(t) + e(t)
where g(t) represents the trend function, s(t) represents seasonal changes, h(t) captures holiday effects, and e(t) is the error term.
Prophet uses the Stan statistical programming language as its computational backend and is available in both Python and R. It is designed to handle missing data, shifts in trend, and outliers automatically. Prophet works best with daily or sub-daily data that has strong seasonal effects and several seasons of historical data. It has become popular for business forecasting because it requires minimal manual configuration while still producing reasonable results.
| Method | Year introduced | Key idea | Strengths | Limitations |
|---|---|---|---|---|
| ARIMA | 1970 | Autoregressive + differencing + moving average | Strong linear modeling; well-understood theory | Requires stationarity; manual tuning of (p,d,q) |
| SARIMA | 1970s | Seasonal extension of ARIMA | Captures seasonal patterns | More parameters to tune; limited to single seasonality |
| ETS | 1950s-2000s | Exponential weighting of past observations | Fast; minimal tuning; solid short-horizon performance | Cannot model complex nonlinear relationships |
| Prophet | 2017 | Decomposable additive model with trend + seasonality + holidays | Handles missing data and outliers; user-friendly | Less effective for short, irregular, or non-seasonal series |
Feature engineering is the process of transforming raw time series data into input variables that machine learning models can use effectively. Because tree-based models and many neural network architectures expect tabular or fixed-length inputs, thoughtful feature construction is often the single most important step in a time series machine learning pipeline.
Lag features (also called lagged variables) represent previous values of the target series. For example, a lag-1 feature is the value from one time step ago, a lag-7 feature is the value from seven steps ago, and so on. The choice of which lags to include is guided by the autocorrelation function (ACF) and partial autocorrelation function (PACF), which reveal the strength of correlation at each lag. For daily retail data, lags at 1, 7, 14, and 28 days are common choices because they capture both recent momentum and weekly patterns.
Rolling statistics calculate summary measures over a sliding window of recent observations. Common rolling features include:
Window sizes are typically chosen to match meaningful time spans. For hourly energy data, a 24-hour rolling mean captures the daily cycle, while a 168-hour (one-week) rolling mean captures the weekly cycle.
Date-based features encode the position within known cycles: hour of day, day of week, day of month, month of year, quarter, whether a day is a weekend or public holiday, and the number of days until or since a known event. These features allow tree-based models to learn seasonal patterns without requiring explicit seasonal decomposition.
Fourier features use sine and cosine functions at different frequencies to represent seasonal and cyclical patterns as continuous variables. For a seasonal period of length P, a pair of Fourier features at order k is defined as sin(2 * pi * k * t / P) and cos(2 * pi * k * t / P). Using multiple Fourier orders captures both broad and fine-grained seasonal shapes. Prophet uses this technique internally for its seasonality components, and practitioners frequently add Fourier features when using gradient boosting models for time series.
First differences (the change from one time step to the next), percentage changes, and higher-order differences can serve as features that make it easier for models to learn from non-stationary data. These features capture momentum and acceleration in the series.
Starting in the 2010s, machine learning methods originally designed for tabular data were adapted for time series problems. These approaches typically reframe the forecasting problem as a supervised learning task, where lagged values and engineered features serve as input variables and future values serve as targets.
Gradient boosting builds an ensemble of weak learners (usually decision trees) sequentially, where each new tree corrects the errors of the previous ensemble. For time series applications, the data must first be transformed into a supervised learning format using lag features, rolling statistics (such as moving averages and rolling standard deviations), date-based features (day of week, month, quarter), and other domain-specific features.
XGBoost (Extreme Gradient Boosting), introduced by Tianqi Chen in 2016, is an efficient and regularized implementation of gradient boosting. It grows trees level-by-level (depth-wise) and has been a favorite among competition winners on platforms such as Kaggle. XGBoost is effective for time series forecasting when combined with careful feature engineering.
LightGBM (Light Gradient Boosting Machine), developed by Microsoft, grows trees leaf-wise rather than level-wise, always splitting the leaf with the highest error reduction. This approach often reaches the same accuracy with fewer splits and trains significantly faster on large datasets due to its histogram-based algorithm. LightGBM consumes less memory than XGBoost because it stores binned feature values rather than exact values. For datasets exceeding several million rows, LightGBM is generally the more practical choice.
CatBoost, developed by Yandex, provides native handling of categorical features and uses ordered boosting to reduce prediction shift. It is particularly useful for time series datasets with categorical covariates (such as product IDs or region codes).
An important limitation of all tree-based models is their inability to extrapolate beyond the observed range during training. This becomes critical when forecasting series with trends, because the model cannot predict values higher or lower than anything it has seen in the training data. Practitioners address this by detrending the series before modeling or by combining tree-based models with trend components.
Random forests are ensemble methods that train multiple decision trees on random subsets of the data and aggregate their predictions. For time series forecasting, random forests are used with the same feature engineering approach as gradient boosting: lag features, rolling statistics, and calendar features. Random forests offer strong baseline performance, are resistant to overfitting, and provide feature importance rankings that help practitioners understand which lags and features drive predictions.
Compared to gradient boosting, random forests are easier to tune but typically achieve slightly lower accuracy on forecasting tasks. They share the same extrapolation limitation as other tree-based methods.
| Method | Developer | Tree growth | Training speed | Memory | Key advantage |
|---|---|---|---|---|---|
| XGBoost | Tianqi Chen (2016) | Level-wise (depth) | Moderate | Higher | Regularization; wide adoption |
| LightGBM | Microsoft | Leaf-wise | Fast | Lower | Histogram-based; scales to large data |
| CatBoost | Yandex | Symmetric trees | Moderate | Moderate | Native categorical feature handling |
| Random Forest | Breiman (2001) | Multiple random trees | Fast | Moderate | Robust baseline; feature importance |
Deep learning models have become increasingly prominent in time series analysis due to their ability to automatically learn complex temporal patterns without manual feature engineering.
Recurrent neural networks (RNNs) are a class of neural networks with loops that allow information to persist over time, making them well-suited for sequential data. At each time step, an RNN takes the current input and the hidden state from the previous time step to produce an output and an updated hidden state. However, vanilla RNNs suffer from the vanishing gradient problem, which makes it difficult for them to learn long-term dependencies.
Long short-term memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, address the vanishing gradient problem by incorporating a gating mechanism with three gates: an input gate, a forget gate, and an output gate. These gates control the flow of information through the network, allowing LSTMs to selectively remember or forget information over long sequences.
LSTMs have been applied extensively to time series forecasting, including stock price prediction, energy demand forecasting, and weather prediction. Variants such as the Gated Recurrent Unit (GRU), introduced by Cho et al. in 2014, simplify the gating mechanism while maintaining similar performance. Bidirectional LSTMs (BiLSTMs) process the sequence in both forward and reverse directions, which can improve performance on tasks where future context is available (such as classification and imputation, but not real-time forecasting).
Despite their effectiveness, LSTMs have limitations: they process sequences step by step (preventing parallelization during training), and their performance can degrade on very long sequences due to memory decay.
DeepAR, developed by Amazon and published by Salinas et al. in 2020, is a probabilistic forecasting method built on autoregressive recurrent neural networks. Instead of producing a single point forecast, DeepAR outputs a full probability distribution at each time step by using the RNN to parameterize a chosen likelihood function (such as a Gaussian or negative binomial distribution). A key strength of DeepAR is its ability to train a single global model on many related time series simultaneously, learning shared patterns across series while still respecting series-specific behavior through covariates. In benchmarks, DeepAR demonstrated accuracy improvements of roughly 15% over state-of-the-art methods on several real-world datasets. Amazon integrated DeepAR into its Forecast service as DeepAR+.
Temporal Convolutional Networks (TCNs) are a specialized convolutional neural network architecture designed for sequence modeling. TCNs use causal convolutions (where the output at time t depends only on inputs at time t and earlier) combined with dilated convolutions that exponentially increase the receptive field without increasing the number of parameters.
A TCN consists of dilated, causal 1D convolutional layers with residual connections. The dilation factor doubles at each layer, allowing the network to capture long-range dependencies efficiently. For example, with dilation factors of 1, 2, 4, 8, and 16, a five-layer TCN can look back over 32 time steps with just five layers.
TCNs offer several advantages over RNN-based models:
Comparative studies have shown that TCNs often outperform LSTMs on time series forecasting tasks, particularly on longer sequences.
WaveNet, introduced by van den Oord et al. at DeepMind in 2016, is a fully probabilistic generative model originally designed for raw audio synthesis. It uses stacks of dilated causal convolutions to model the conditional probability of each value given all previous values. The dilated convolution structure allows the receptive field to grow exponentially with depth, enabling the network to access hundreds or thousands of past time steps without a corresponding explosion in parameters.
Researchers subsequently adapted WaveNet for general time series forecasting. Borovykh et al. (2017) demonstrated that WaveNet-style architectures could produce competitive results on financial and other time series tasks. The key insight behind WaveNet's success is its ability to model complex, nonlinear conditional distributions while maintaining tractable training through parallelized convolutions.
N-BEATS (Neural Basis Expansion Analysis for Time Series), published by Oreshkin et al. at ICLR 2020, takes a different architectural approach by using a deep stack of fully connected layers organized into blocks and stacks rather than recurrent or convolutional layers. Each block produces a partial forecast (forward prediction) and a partial backcast (reconstruction of the input); the backcast residual is passed to the next block, allowing the model to iteratively refine its representation. An interpretable variant constrains the basis functions to produce explicit trend and seasonality decompositions. N-BEATS improved forecast accuracy by 11% over a statistical benchmark and by 3% over the ES-RNN winner of the M4 competition.
N-HiTS (Neural Hierarchical Interpolation for Time Series), introduced by Challu et al. in 2023, extends N-BEATS with multi-rate signal sampling and hierarchical interpolation. Each block operates at a different temporal resolution, allowing the model to efficiently capture patterns at multiple scales. N-HiTS achieved comparable or superior accuracy to N-BEATS while reducing computation by a factor of 5 to 50 on long-horizon benchmarks.
The transformer architecture, originally introduced by Vaswani et al. in 2017 for natural language processing, has been adapted for time series forecasting with several important modifications.
Informer, published by Zhou et al. at AAAI 2021 (where it received the Best Paper Award), addresses the limitations of applying standard Transformers to long sequence time series forecasting (LSTF). The key innovation is the ProbSparse self-attention mechanism, which achieves O(L log L) time complexity and memory usage instead of the standard O(L squared). The model also introduces a self-attention distilling operation that halves the cascading layer input, and a generative-style decoder that predicts the entire output sequence in a single forward pass rather than step by step. Experiments on four large-scale datasets demonstrated that Informer significantly outperformed existing methods on the LSTF problem.
Autoformer, published at NeurIPS 2021, introduced a decomposition architecture that separates trend and seasonal components within the Transformer framework, along with an auto-correlation mechanism that replaces traditional self-attention. FEDformer (Frequency Enhanced Decomposed Transformer) applies attention in the frequency domain using Fourier and wavelet transforms, which can be more efficient for capturing periodic patterns.
PatchTST (Patch Time Series Transformer), published at ICLR 2023 by Nie et al., introduces two key design choices: (1) segmenting the time series into subseries-level patches that serve as input tokens to the Transformer, and (2) channel-independence, where each variable in a multivariate time series is processed independently. The patching approach provides three benefits: local semantic information is retained in each embedding, computation and memory usage of attention maps are quadratically reduced, and the model can attend to a longer history. Compared to the best previous Transformer-based results, PatchTST achieved a 21.0% reduction in MSE and 16.7% reduction in MAE on standard benchmarks including Weather, Traffic, and Electricity datasets.
| Model | Year | Venue | Key innovation | Complexity |
|---|---|---|---|---|
| Informer | 2021 | AAAI (Best Paper) | ProbSparse attention; generative decoder | O(L log L) |
| Autoformer | 2021 | NeurIPS | Decomposition architecture; auto-correlation | O(L log L) |
| FEDformer | 2022 | ICML | Frequency-domain attention (Fourier/wavelet) | O(L) |
| PatchTST | 2023 | ICLR | Patch-based tokenization; channel-independence | O((L/P) squared) where P is patch size |
Inspired by the success of large language models in natural language processing, researchers have developed foundation models for time series that are pretrained on large corpora of time series data and can perform forecasting, classification, and anomaly detection with minimal or no task-specific training.
TimeGPT, developed by Nixtla, was one of the first foundation models built specifically for time series forecasting. It uses an encoder-decoder Transformer architecture (not based on any existing LLM) trained on the largest publicly available collection of time series data, encompassing over 100 billion data points. The training set includes data from finance, economics, demographics, healthcare, weather, IoT sensor data, energy, web traffic, sales, transport, and banking. TimeGPT supports both zero-shot forecasting (where the model generates predictions on unseen data without any fine-tuning) and anomaly detection. It is accessed primarily through Nixtla's API.
Chronos, developed by Amazon Science, is a framework that repurposes pretrained language models for time series forecasting. The original Chronos models are based on the T5 family of language models and treat time series values as tokens by quantizing them into discrete bins. This approach allows the model to leverage the sequence modeling capabilities of language models for forecasting.
Chronos-2, released in October 2025, extends the framework with zero-shot support for univariate, multivariate, and covariate-informed forecasting. It delivers state-of-the-art zero-shot forecasting performance that consistently beats tuned statistical models out of the box, processing over 300 forecasts per second on a single GPU. Chronos has accumulated millions of downloads on Hugging Face and integrates natively with AWS tools like SageMaker and AutoGluon.
TimesFM (Time Series Foundation Model), developed by Google Research and accepted at ICML 2024, is a decoder-only foundation model with 200 million parameters pretrained on a corpus of 100 billion real-world time points. The architecture treats patches (groups of contiguous time points) as tokens. A multilayer perceptron block with residual connections converts each patch into a token embedding, which is then processed through 20 stacked Transformer layers with causal multi-head self-attention and feed-forward layers. The model uses an input patch length of 32 and an output patch length of 128. TimesFM 2.5 is available on Hugging Face and integrated into Google Cloud BigQuery for enterprise use.
Lag-Llama, published in 2024, is the first open-source foundation model for univariate probabilistic time series forecasting. Its architecture is a decoder-only transformer that uses lags as covariates rather than raw sequential values. Instead of processing the time series as a continuous sequence, Lag-Llama feeds lagged values at predefined intervals (for example, lags at 1, 2, 3, 7, 14, 28 steps) into the model as input features. The model was pretrained on a large corpus of diverse time series data spanning multiple domains and produces probabilistic forecasts (full prediction distributions rather than point estimates). Lag-Llama demonstrates strong zero-shot generalization, and when fine-tuned on small fractions of unseen datasets, it achieves state-of-the-art performance compared to prior deep learning approaches.
Moirai, developed by Salesforce Research, is a universal forecasting foundation model that can handle time series of varying frequencies and prediction lengths without task-specific fine-tuning. It uses a mixture distribution with four different output distributions, which allows it to produce more flexible prediction intervals than models that assume a single distribution family. Moirai was released alongside LOTSA (Large-scale Open Time Series Archive), an open collection of time series data containing 27 billion data points. Moirai-2 further improved multivariate forecasting capabilities.
Tiny Time Mixers (TTM), developed by IBM Research and accepted at NeurIPS 2024, takes a different approach by prioritizing efficiency. TTM is based on the lightweight TSMixer architecture that uses MLP-Mixer blocks interleaved with gated attention as alternatives to the quadratic self-attention blocks in Transformers. Starting from just 1 million parameters (compared to hundreds of millions in other foundation models), TTM incorporates adaptive patching, diverse resolution sampling, and resolution prefix tuning to handle datasets at varied temporal resolutions with minimal model capacity. TTM outperforms existing benchmarks in zero/few-shot forecasting by 4-40% while being lightweight enough to run on CPU-only machines, making it practical for resource-constrained environments.
MOMENT, developed by researchers at Carnegie Mellon University and the University of Pennsylvania and accepted at ICML 2024, is a family of open-source foundation models for general-purpose time series analysis. The architecture is based on a 385-million-parameter T5 model pretrained using a masked time series prediction task (similar to the masked language modeling approach used by BERT) on the Time Series Pile, a diverse collection of public time series data.
Unlike models focused solely on forecasting, MOMENT supports multiple downstream tasks: forecasting, classification, anomaly detection, and imputation. It can be adapted to specific tasks through either full fine-tuning or linear probing. MOMENT comes in three sizes (Small, Base, Large) and has demonstrated strong performance across all four task types.
| Model | Developer | Year | Architecture | Parameters | Training data | Key capability |
|---|---|---|---|---|---|---|
| TimeGPT | Nixtla | 2023 | Encoder-decoder Transformer | Undisclosed | 100B+ data points | Zero-shot forecasting; anomaly detection |
| Chronos / Chronos-2 | Amazon | 2024/2025 | T5-based (tokenized time series) | Multiple sizes | Public + synthetic data | Univariate, multivariate, covariate forecasting |
| TimesFM | 2024 | Decoder-only Transformer | 200M | 100B time points | Patch-based zero-shot forecasting | |
| Lag-Llama | Rasul et al. | 2024 | Decoder-only Transformer (lag covariates) | Multiple sizes | Multi-domain corpus | Probabilistic zero-shot forecasting |
| Moirai / Moirai-2 | Salesforce | 2024 | Universal Transformer | Multiple sizes | 27B data points (LOTSA) | Flexible distribution outputs; any frequency |
| TTM | IBM Research | 2024 | MLP-Mixer with gated attention | 1M+ | Public datasets | Efficient zero/few-shot; CPU-friendly |
| MOMENT | CMU / UPenn | 2024 | T5-based (masked prediction) | 385M | Time Series Pile | Forecasting, classification, anomaly detection, imputation |
Many real-world systems produce multiple interrelated time series simultaneously. Multivariate time series analysis studies the joint behavior of these variables, capturing cross-variable dependencies that univariate methods ignore.
Vector Autoregression (VAR) extends the univariate AR model to multiple variables. In a VAR model, each variable is modeled as a linear function of its own past values and the past values of every other variable in the system. A VAR(p) model with k variables has k equations, each containing p lags of all k variables. VAR is widely used in econometrics for studying the dynamic relationships among macroeconomic indicators such as GDP, inflation, and interest rates.
Granger causality is a statistical test used to determine whether one time series is useful for forecasting another. Variable X is said to Granger-cause variable Y if past values of X provide statistically significant information about future values of Y beyond what past values of Y alone provide. It is important to note that Granger causality measures predictive ability, not true causal influence. The test is commonly applied within the VAR framework and requires both series to be stationary.
When two or more non-stationary time series share a common stochastic trend, they are said to be cointegrated. Cointegrated series drift together over time, and their linear combination is stationary even though the individual series are not. The Vector Error Correction Model (VECM) is a restricted VAR that incorporates cointegration relationships, allowing it to capture both short-term dynamics and long-term equilibrium among the variables. The Johansen test is the standard method for detecting cointegration in multivariate systems.
Deep learning approaches to multivariate time series include channel-dependent models (which explicitly model inter-variable relationships through shared attention or graph neural networks) and channel-independent models (which process each variable separately and rely on shared parameters to capture commonalities). PatchTST demonstrated that the channel-independent approach can be surprisingly effective, while models like Crossformer and iTransformer are designed to capture cross-variable dependencies explicitly.
Standard k-fold cross-validation is inappropriate for time series data because it randomly shuffles observations, breaking the temporal order and allowing the model to train on future data when predicting past events (a form of data leakage). Time series cross-validation techniques respect chronological order.
In expanding window (also called cumulative or anchored) cross-validation, the training set starts from the beginning of the series and grows with each fold. Fold 1 trains on observations 1 through T and tests on T+1 through T+h; fold 2 trains on observations 1 through T+h and tests on T+h+1 through T+2h; and so on. This approach uses all available historical data at each step and is appropriate when long-term patterns are important and the data-generating process is relatively stable.
In sliding window (also called rolling window) cross-validation, the training set has a fixed size and moves forward with each fold. Fold 1 trains on observations 1 through T; fold 2 trains on observations 2 through T+1; and so forth. By discarding older observations, the sliding window adapts more quickly to changes in the data-generating process (regime shifts, concept drift). This approach is preferred when recent data is more informative than distant history.
Walk-forward validation is the most production-realistic approach. At each step, the model is retrained (or updated) on all data up to the current point, makes a forecast for the next h steps, and then the window advances. This mirrors the actual deployment scenario where the model is periodically retrained as new data arrives. Walk-forward validation also forces practitioners to test their entire pipeline (feature generation, scaling, lag construction) incrementally, revealing bugs that would surface in production.
| Strategy | Training set | Advantages | Best when |
|---|---|---|---|
| Expanding window | Grows from fixed start | Uses all history; stable estimates | Long-term patterns dominate; stable process |
| Sliding window | Fixed size; shifts forward | Adapts to regime changes; equal weighting of folds | Recent data more relevant; non-stationary process |
| Walk-forward | Retrain at each step | Most realistic; tests full pipeline | Production deployment; model retraining is feasible |
Time series anomaly detection identifies observations or subsequences that deviate significantly from expected behavior. Anomalies can be point anomalies (individual outliers), contextual anomalies (values that are unusual in a specific temporal context), or collective anomalies (subsequences that are abnormal as a group).
Anomaly detection is critical in cybersecurity (detecting intrusions), manufacturing (identifying equipment faults), healthcare (monitoring patient vital signs for dangerous changes), and finance (flagging fraudulent transactions).
Classical approaches include control charts (such as Shewhart charts and CUSUM), z-score thresholds, and the Generalized Extreme Studentized Deviate (GESD) test. These methods are simple, interpretable, and computationally lightweight, but they assume the underlying process is stationary or follows a known distribution.
Isolation Forest is an unsupervised algorithm that detects anomalies by randomly partitioning the feature space; anomalies, being few and distinct, require fewer partitions to isolate. One-Class SVM learns a boundary around normal data in feature space and flags points outside that boundary. Local Outlier Factor (LOF) measures how isolated a data point is relative to its neighbors. These methods work well on multivariate feature vectors derived from time series windows but do not directly model temporal dependencies.
Autoencoder-based approaches train a neural network to reconstruct normal time series patterns; anomalies are identified as inputs with high reconstruction error. LSTM autoencoders are particularly popular because they capture temporal dependencies in the encoding. Variational autoencoders (VAEs) add a probabilistic layer that can quantify uncertainty. Generative adversarial networks (GANs) have also been applied: the generator learns the distribution of normal data, and samples that deviate from the generator's output are flagged as anomalies. More recent hybrid frameworks combine Transformer-based autoencoders with Isolation Forest and XGBoost to handle complex spatiotemporal dependencies in IoT and industrial settings.
Real-world time series frequently contain missing values due to sensor failures, network outages, data collection errors, or intentional sparse sampling. Handling these gaps appropriately is essential for accurate modeling.
Forward fill (carrying the last observed value forward) and backward fill are the simplest strategies. Mean and median imputation replace missing values with a constant but destroy temporal structure. These methods are fast and serve as baselines but can introduce bias, particularly for long gaps.
Linear interpolation estimates missing values by drawing a straight line between neighboring observed points. It is computationally cheap and performs surprisingly well across diverse datasets; a 2025 benchmarking study found that linear interpolation outperformed more complex methods across multiple missingness mechanisms and percentages. Spline interpolation fits smooth curves through the data and captures nonlinear patterns more accurately but is more sensitive to noise. Cubic splines and Akima splines are common choices.
More sophisticated approaches use the time series structure itself. Kalman smoothing estimates missing values using a state space model. Seasonal decomposition methods impute by reconstructing the missing segment from estimated trend and seasonal components. Deep learning models such as MOMENT can reconstruct masked segments based on learned temporal patterns, effectively performing imputation as a byproduct of their pretraining objective.
Some time series are inherently irregular, with observations arriving at uneven intervals (for example, electronic health records or event logs). Approaches to irregular time series include resampling to a regular grid (with interpolation to fill gaps), time-aware neural architectures (such as Time-LSTM, which incorporates elapsed time as an input gate modifier), and neural ordinary differential equations (Neural ODEs) that model continuous-time dynamics.
Time series analysis encompasses several distinct task types, each with different objectives and evaluation criteria.
Forecasting is the most common time series task. It involves predicting future values of one or more variables based on historical observations. Forecasting can be univariate (predicting a single variable) or multivariate (predicting multiple related variables simultaneously). It can also be single-step (predicting one time step ahead) or multi-step (predicting multiple time steps ahead). Applications include demand planning, financial prediction, weather forecasting, and capacity planning.
Time series classification assigns a categorical label to an entire time series or a segment of a time series. Rather than predicting future values, the goal is to identify which category a sequence belongs to. Examples include classifying heartbeat signals as normal or abnormal, identifying the type of activity from accelerometer data (walking, running, sitting), and categorizing industrial sensor readings by equipment state. Common approaches include distance-based methods (such as Dynamic Time Warping), shapelet-based methods, and deep learning classifiers.
Time series imputation fills in missing values within a time series. Missing data is common in real-world time series due to sensor failures, network outages, or data collection errors. Imputation methods range from simple interpolation to sophisticated models such as MOMENT that can reconstruct masked segments based on learned temporal patterns.
Selecting the right evaluation metric is essential for comparing forecasting methods and selecting models for production. Different metrics emphasize different aspects of forecast quality.
Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values. It is expressed in the same units as the data, making it easy to interpret. MAE is robust to outliers because it does not square the errors. Minimizing MAE produces forecasts of the median. MAE is a good default metric when a simple, robust summary of average error is needed.
Root Mean Squared Error (RMSE) is similar to MAE but squares each error before averaging and then takes the square root. This penalizes large errors more heavily, making RMSE a better choice when large forecast errors are especially costly (for example in financial risk or safety-critical applications). Minimizing RMSE produces forecasts of the mean.
Mean Squared Error (MSE) is the squared version of RMSE (without the square root). It is commonly used as a training loss function for deep learning models because it is differentiable and convex.
Mean Absolute Percentage Error (MAPE) converts errors into percentages of the actual values, making it scale-independent. This is useful for comparing accuracy across datasets measured in different units. However, MAPE has two well-known drawbacks: it is undefined when actual values are zero, and it is asymmetric, penalizing over-forecasts more heavily than under-forecasts of the same magnitude.
Symmetric Mean Absolute Percentage Error (sMAPE) addresses MAPE's asymmetry by using the average of actual and forecast values in the denominator. This makes it more balanced between over- and under-forecasts. sMAPE was used as a primary metric in the M3 and M4 forecasting competitions.
Mean Absolute Scaled Error (MASE) divides the MAE by the MAE of a naive one-step-ahead forecast (which simply predicts the previous value). A MASE less than 1 means the model outperforms the naive baseline. MASE is well-defined for all series (including those with zero values) and is the recommended metric in the Monash Time Series Forecasting Archive.
| Metric | Formula summary | Scale-dependent | Handles zeros | Symmetric | Common use |
|---|---|---|---|---|---|
| MAE | Mean of | actual - predicted | Yes | Yes | |
| RMSE | Square root of mean squared error | Yes | Yes | Yes | When large errors are costly |
| MAPE | Mean of | error/actual | * 100% | No | No |
| sMAPE | Mean of | error | /( | actual | + |
| MASE | MAE / MAE of naive forecast | No | Yes | Yes | Monash archive; academic benchmarks |
Standardized benchmark datasets are essential for comparing time series methods objectively. Several datasets and repositories have become community standards.
The M competitions, organized by Spyros Makridakis and the International Institute of Forecasters, are the most influential forecasting competitions in the field. The M4 competition (2018) included 100,000 time series across six frequencies (yearly, quarterly, monthly, weekly, daily, and hourly). The winning method, ES-RNN by Slawek Smyl, combined exponential smoothing with recurrent neural networks, demonstrating the value of hybrid statistical-deep learning approaches. The M5 competition (2020) focused on hierarchical retail sales forecasting using Walmart data, and gradient boosting methods (particularly LightGBM) dominated the leaderboard.
The Monash Time Series Forecasting Archive, compiled by researchers at Monash University and published at NeurIPS 2021, is the first comprehensive benchmark repository for global and multivariate time series forecasting. It contains 30 datasets (with 58 dataset variations when accounting for different frequencies and missing value treatments) spanning domains such as tourism, electricity, traffic, weather, and healthcare. The archive provides baseline results for standard forecasting methods across ten error metrics, using MASE as the primary evaluation measure.
The ETT dataset, introduced alongside the Informer paper, has become one of the most widely used benchmarks for long-term time series forecasting. It consists of four sub-datasets: two at hourly resolution (ETTh1, ETTh2) and two at 15-minute resolution (ETTm1, ETTm2). Each dataset contains oil temperature and six power load features from electricity transformers, collected from July 2016 to July 2018 (approximately 70,080 data points per dataset). The data exhibits a mix of short-term periodic patterns, long-term periodic patterns, long-term trends, and irregular patterns, making it a challenging benchmark for evaluating model performance on real-world data.
Several other datasets are commonly used in time series research:
Time series analysis is applied across nearly every industry. The following sections describe major application domains.
Financial time series analysis encompasses stock price prediction, algorithmic trading, risk management, fraud detection, and portfolio optimization. Financial data is notoriously noisy, non-stationary, and influenced by external events (earnings reports, geopolitical developments, regulatory changes). Forecasting methods in finance must account for volatility clustering, where periods of high variance tend to follow other high-variance periods. Specialized models such as GARCH (Generalized Autoregressive Conditional Heteroskedasticity) are used to model this volatility. Deep learning methods including LSTMs and Transformers have been applied to predict stock returns, though consistently outperforming the market remains extremely difficult due to the efficient market hypothesis.
Weather forecasting is one of the oldest and most important applications of time series analysis. Numerical weather prediction (NWP) models solve physical equations governing atmospheric dynamics, but machine learning approaches have increasingly supplemented or replaced traditional methods. Google DeepMind's GraphCast and Huawei's Pangu-Weather have demonstrated that deep learning models can produce competitive medium-range weather forecasts at a fraction of the computational cost of NWP models. Time series methods are also used for climate analysis, including temperature trend detection, precipitation forecasting, and extreme weather event prediction.
Energy applications include electricity demand forecasting, renewable energy generation prediction (solar and wind), grid load balancing, and energy price forecasting. Accurate demand forecasts help utilities optimize power generation and avoid blackouts. Solar and wind power production depends on weather conditions, making it inherently variable and difficult to predict. Gradient boosting methods and LSTMs are widely used for short-term load forecasting, while foundation models like Chronos are being adopted for longer-horizon energy forecasting.
Retail and e-commerce companies use time series forecasting to predict product demand, optimize inventory levels, plan promotions, and manage supply chains. The M5 competition highlighted the effectiveness of gradient boosting methods for hierarchical demand forecasting at scale. Companies such as Amazon and Walmart process millions of individual time series for demand planning, often using global models that learn shared patterns across products while accounting for product-specific features.
Healthcare applications of time series analysis include patient monitoring (detecting dangerous changes in vital signs), epidemic forecasting (predicting disease spread), electrocardiogram (ECG) analysis, sleep stage classification, and hospital resource planning. Anomaly detection on physiological signals (heart rate, blood pressure, oxygen saturation) enables early intervention for deteriorating patients. During the COVID-19 pandemic, time series models were extensively used to forecast case counts, hospitalizations, and mortality, though the accuracy of these forecasts varied widely depending on the method and data quality.
IoT sensors generate massive volumes of time series data from manufacturing equipment, vehicles, buildings, and infrastructure. Predictive maintenance uses time series analysis to detect early signs of equipment failure before it occurs, reducing downtime and repair costs. Industrial anomaly detection identifies unusual sensor patterns that may indicate quality issues or safety hazards. The high dimensionality and volume of IoT data have driven interest in scalable methods such as foundation models and streaming anomaly detection algorithms.
The following table summarizes the evolution of time series methods across different eras.
| Era | Period | Representative methods | Key characteristics |
|---|---|---|---|
| Classical Statistics | 1950s-1990s | ARIMA, Exponential Smoothing (ETS), GARCH, State Space Models | Explicit statistical assumptions; interpretable; manual model selection |
| Early Machine Learning | 2000s-2010s | Support Vector Machines, Random Forests, Feature Engineering Pipelines | Supervised learning framework; lag features; cross-validation |
| Gradient Boosting | 2014-present | XGBoost, LightGBM, CatBoost | Ensemble of decision trees; strong tabular performance; competition winners |
| Deep Learning (Recurrent) | 2015-2020 | RNN, LSTM, GRU, DeepAR | Sequence modeling; automatic feature learning; vanishing gradient issues |
| Deep Learning (Convolutional) | 2016-present | TCN, WaveNet, TimesNet | Parallel training; dilated convolutions; stable gradients |
| Deep Learning (MLP) | 2020-present | N-BEATS, N-HiTS | Fully connected stacks; residual learning; interpretable decompositions |
| Transformers | 2020-present | Informer, Autoformer, FEDformer, PatchTST | Attention mechanisms; long-range dependencies; patch-based tokenization |
| Foundation Models | 2023-present | TimeGPT, Chronos, TimesFM, Moirai, MOMENT, Lag-Llama, TTM | Pretrained on billions of data points; zero-shot and few-shot capability |
| Hybrid/Decomposable | 2017-present | Prophet, ES-RNN, N-BEATS, N-HiTS | Combine statistical decomposition with neural networks |
A variety of open-source tools support time series analysis across different programming languages and frameworks.