See also: Machine learning terms
A timestep is a discrete unit of time progression in a sequential process. The term shows up in many corners of machine learning and applied math, and it does not always mean the same thing. In a time series, it is the gap between two consecutive observations. In a recurrent neural network, it is one pass through one element of an input sequence. In reinforcement learning, it is one tick of an agent-environment loop. In a diffusion model, it is an index into a noise schedule. In a numerical simulation, it is the integration step dt that advances the state of a differential equation.
The shared idea is that something continuous (or at least irregular) gets sliced into uniform chunks so a computer can process it. Choosing the size of those chunks is one of the more consequential design decisions a modeler makes. Pick dt too large and a physics simulator blows up. Pick a sampling interval too coarse and a speech model loses phonemes. Pick the diffusion T too small and image samples look smudged. The right answer is almost always domain-specific.
The table below summarizes how the word is used across the main areas where it appears. Each row links out to a fuller treatment further down the article or to a dedicated wiki page.
| Domain | Symbol | What a timestep is | Typical scale |
|---|---|---|---|
| Time series analysis | Δt | Interval between consecutive observations | Milliseconds to years |
| RNN, LSTM, GRU | t | One iteration over one element of the input sequence | One token, one frame, one tick |
| Sequence to sequence decoding | t | One generated output token or symbol | One token per step |
| Reinforcement learning and Markov decision process | t | One transition: state, action, reward, next state | Variable, set by environment |
| Diffusion model | t | Index into a noise schedule, from clean (t=0) to noise (t=T) | 1 to 1000 (DDPM); 10 to 50 (DDIM) |
| Simulation and physics-informed ML | dt | Integration step for a differential equation | Microseconds to seconds |
| Transformer | (position) | Strictly a position in the sequence, not a timestep | One token |
| Robotics control loop | Δt | One control update from sensors to actuators | 1 ms to 100 ms |
A few of these uses are formally identical and a few are only loosely related. The RNN and RL versions both refer to a step in a Markov-style sequential decision process. The diffusion and simulation versions both refer to a step in a numerical integrator. The time-series version is the most general and the most domain-agnostic.
A timestep in machine learning refers to a specific instance in time or the unit of time progression used in time-dependent algorithms. The concept is most relevant when working with time-series data, sequential data, and models for natural language processing or reinforcement learning. In these settings, understanding the role of a timestep is needed to model temporal patterns, relationships, and dependencies.
In the analysis of time series and other temporal data, a timestep is the chronological interval between consecutive observations or events. Time-series data is ordered in time, so the sequence in which the data points appear matters. The timestep is the basic unit for quantifying the temporal gap between points. It can be fixed (hourly, daily, yearly) or variable depending on the data source. The choice influences the structure of the dataset and constrains which models are appropriate. A daily timestep is fine for retail demand forecasting; it is useless for high-frequency trading, where ticks arrive in microseconds.
Sampling frequency is the inverse of the timestep. A 16 kHz audio signal has a timestep of 62.5 microseconds. A 60 fps video has a timestep of about 16.7 milliseconds. A weather station logging once per minute has a timestep of 60 seconds. Whenever a model ingests a sequence, the timestep determines both how much temporal detail it sees and how many timesteps a fixed-duration clip contains.
Recurrent neural networks (RNNs) are designed to handle sequence data and have an inherent capacity to model temporal dependencies. In an RNN, a timestep refers to a single iteration through one element of the input sequence. The network maintains a hidden state h_t that is updated at each timestep based on the previous hidden state h_{t-1} and the current input x_t. This recurrence lets the network remember and use information from earlier in the sequence, which is what makes RNNs useful for natural language processing, speech recognition, and sequence prediction tasks.
The canonical update for a vanilla RNN is h_t = tanh(W_h h_{t-1} + W_x x_t + b), with an output y_t = W_y h_t + b_y if needed. Long short-term memory networks (LSTMs) and gated recurrent units (GRUs) use the same notion of a timestep but add gating mechanisms that decide what to keep in memory and what to forget at each step.
Training an RNN means computing gradients with respect to weights that are reused at every timestep. The standard algorithm is backpropagation through time (BPTT), formally described by Paul Werbos in his 1990 paper "Backpropagation Through Time: What It Does and How to Do It." BPTT unrolls the recurrent network into a feedforward network with one layer per timestep, applies ordinary backpropagation, and then sums the gradients for each shared weight across all timesteps where it appears.
Full BPTT becomes impractical for very long sequences because the unrolled graph grows linearly with sequence length and so does memory usage. Truncated BPTT (TBPTT) addresses this by splitting the input sequence into chunks of fixed length k (commonly tens to a few hundred timesteps) and only backpropagating through one chunk at a time, while the hidden state itself is carried forward across chunks. The cost is a biased gradient: dependencies longer than the truncation horizon cannot be learned by gradient descent.
Because BPTT multiplies Jacobian matrices across many timesteps, gradients can shrink to zero or grow without bound as the sequence gets longer. Sepp Hochreiter identified this vanishing-gradient problem in his 1991 diploma thesis. The practical consequence is that vanilla RNNs struggle to learn dependencies that span more than ten or twenty timesteps. The 1997 LSTM paper by Hochreiter and Schmidhuber introduced gated cell states with an additive recurrence that lets gradients flow over hundreds or even thousands of timesteps without attenuation. GRUs, introduced by Cho and colleagues in 2014, use a simpler two-gate design that is cheaper to compute while retaining most of the long-range capability.
Gradient clipping (Pascanu, Mikolov, and Bengio, 2013) is the standard fix for the exploding side of the same problem: rescale the gradient norm down to a fixed threshold whenever it exceeds it. Together, gating and clipping make multi-timestep training tractable in practice.
In reinforcement learning, a timestep is one transition in a Markov decision process (MDP). At timestep t the agent observes state s_t, picks action a_t, receives reward r_{t+1}, and lands in next state s_{t+1}. An episode is a sequence of these transitions, possibly terminating in a special absorbing state.
The return from timestep t is the discounted sum of future rewards, written G_t = R_{t+1} + γ R_{t+2} + γ² R_{t+3} + ..., with discount factor 0 ≤ γ ≤ 1. The recursive form G_t = R_{t+1} + γ G_{t+1} is what makes Bellman equations and temporal-difference learning work: every timestep links to the next through the same recursion. When γ = 0 the agent is myopic and only cares about the immediate reward; when γ approaches 1 the agent treats far-future rewards almost as much as near ones. This formulation is laid out in detail in chapter 3 of Sutton and Barto's Reinforcement Learning: An Introduction.
The physical duration of a timestep is set by the environment, not by the agent. In Atari with frame-skip 4, one RL timestep covers four game frames, roughly 67 ms of game time. In MuJoCo locomotion benchmarks, the simulator runs an internal physics step (often 0.002 s) and an action repeat is wrapped on top, so the agent sees a control step around 50 Hz. In real robots, the timestep equals the control loop period, often between 1 ms and 100 ms.
In a diffusion model, the timestep t indexes how corrupted the data is. The forward process q(x_t | x_{t-1}) adds a small amount of Gaussian noise at each of T steps until x_T is approximately pure noise. The reverse process, learned by a neural network, predicts x_{t-1} from x_t and gradually denoises a sample back to clean data. The 2020 paper Denoising Diffusion Probabilistic Models by Jonathan Ho, Ajay Jain, and Pieter Abbeel introduced the modern formulation and used T = 1000 with a linear β schedule from 0.0001 to 0.02.
Because the network has to behave very differently at high noise (t near T) and low noise (t near 0), the timestep is fed to the model as a conditioning signal. The standard approach is to compute a sinusoidal embedding of t, the same construction used for transformer positional encodings, and inject it into the U-Net through small MLP projections at each residual block.
The choice of how β_t (or equivalently the variance schedule α_t = 1 - β_t) varies with t is called the noise schedule. Different schedules trade off training stability, sample quality, and the number of inference steps required.
| Schedule | Paper | Notes |
|---|---|---|
Linear β | Ho et al. 2020 (DDPM) | β_t increases linearly from 1e-4 to 2e-2 over T=1000 steps |
| Cosine | Nichol & Dhariwal 2021 | bar_α_t = cos²(((t/T + s)/(1+s)) · π/2). Avoids over-noising at low resolutions |
Karras σ | Karras et al. 2022 (EDM) | Sigmas spaced as σ_i^(1/ρ) quantiles with ρ=7, σ_min=0.002, σ_max=80 |
| Continuous time | Lipman et al. 2022 (Flow Matching) | No discrete schedule; t ∈ [0,1] is sampled and the model learns a vector field |
Sampling schedules at inference time are partially decoupled from training. Denoising Diffusion Implicit Models (DDIM) by Song, Meng, and Ermon, 2020, showed that a model trained with T = 1000 can be sampled in 50 or even 10 steps using a deterministic, non-Markovian update rule, with 10x to 50x wall-clock speedup at modest quality cost. EDM samplers go further by using higher-order Runge-Kutta integration along a custom sigma schedule, reaching state-of-the-art FID with 35 network evaluations per image. Flow matching dispenses with discrete diffusion timesteps entirely and works with continuous t, then leans on off-the-shelf ODE solvers at inference.
In numerical simulation, the timestep dt is the increment used by a time integrator to advance the state of an ordinary or partial differential equation. Forward Euler updates a state by x(t+dt) = x(t) + dt · f(x(t)). Higher-order schemes such as fourth-order Runge-Kutta (RK4) evaluate the derivative at multiple intermediate points and combine them for better accuracy at the same dt.
For explicit time-integration schemes applied to convection or wave equations, dt cannot be chosen freely. The Courant-Friedrichs-Lewy (CFL) condition, described by Richard Courant, Kurt Friedrichs, and Hans Lewy in their 1928 paper, requires that the numerical domain of dependence contains the physical one. In one dimension this becomes C = u · dt / dx ≤ C_max, where u is the wave speed, dx is the grid spacing, and C_max is on the order of 1 for typical schemes. Violate it and errors blow up exponentially. Implicit schemes are unconditionally stable and have no CFL constraint, at the cost of solving a linear (or nonlinear) system at every timestep.
This matters for physics-informed neural networks and learned simulators (GraphCast, GNS, ML force fields). Even if the network itself does not have a CFL condition, it has to be evaluated at a dt small enough to capture the dynamics it was trained on. In neural ODE models the integrator picks dt adaptively based on local error estimates.
In robotics and embodied AI, the control-loop timestep is the period at which sensors are read, a policy is queried, and actuator commands are sent. MuJoCo's mj_step integrates over a period set by mjModel.opt.timestep, typically 1 to 5 milliseconds for the physics. RL policies on top of MuJoCo are usually queried at a lower rate (around 50 Hz, so a 20 ms control step) with action repeat handling the difference.
Transformers do not have timesteps in the recurrent sense. They process all positions of a sequence in parallel and have no hidden state that gets updated step by step. The right word for the index of a token in a transformer's input is position, encoded with sinusoidal or learned positional embeddings. People sometimes still say "timestep" when talking about autoregressive decoding, where the model emits one token per generation step, but the model architecture itself has no inherent notion of time.
The overlap with diffusion is interesting. Both architectures use sinusoidal embeddings for an integer index. In a transformer, the integer is a position in the input. In a diffusion U-Net, it is a noise level. The same trick (encode an integer as a vector of sines and cosines at different frequencies) does double duty.
Three separate clocks often co-exist in any system that involves a model and an environment.
| Clock | What it measures |
|---|---|
| Real-time | Time as experienced in the physical world, Δt between sensor reads |
| Simulation-time | Time inside a simulator, advanced one dt per mj_step call |
| Wall-clock | Time spent on compute, regardless of what the simulation thinks |
In simulated training, simulation-time can run far faster than real-time: a MuJoCo Ant environment can produce thousands of simulation seconds per wall-clock second on a single CPU core. In live deployment, real-time and wall-clock have to match, which is why on-policy RL approaches that need many environment steps per gradient update often run faster in sim than they could ever run in reality. The control-loop timestep is the bridge: it is the longest wall-clock period the policy is allowed to take per inference, and it bounds how complex the model can be on a given target.
The timestep also sets the receptive field of a sequential model. An RNN can only carry information forward in increments of one timestep, so to influence a prediction 100 steps in the future the network has to keep that information alive across 100 hidden-state updates. A transformer can attend across the full context window in a single layer, but the resolution of "how far back" it can look is still set by the timestep at which the input was discretized: a transformer over per-second audio frames cannot resolve details smaller than a second.
In diffusion models, the spacing of timesteps determines how smoothly the model has to bridge between adjacent noise levels. Too few steps and each denoising step becomes a hard problem; too many and inference is slow. The Karras paper argued that the right way to think about this is in terms of the noise level σ_t rather than the integer t, which is one reason flow matching and continuous-time formulations have become more popular.
Imagine you are reading a storybook, and you read one sentence at a time. Each time you read a new sentence, you remember the sentences that came before it, so you can understand the story. In machine learning, a timestep is like reading one sentence of the story. In some types of learning, like reading a book, the order of the sentences is really important, so the machine needs to keep track of the order using timesteps. This helps the machine understand how things change over time, like in stories or when trying to predict the weather.
Now imagine that instead of a story, you are watching a movie one frame at a time, or pushing a swing one little nudge at a time, or coloring in a noisy picture by removing a tiny bit of fuzz at every brushstroke. All of those are timesteps too: each one is a small step forward, and stacking enough of them together gets you the whole movie, the swinging motion, or a clean picture.