See also: Reinforcement Learning, Policy, Epsilon Greedy Policy, Q-Learning
A random policy (also called a uniform random policy) is a policy in reinforcement learning (RL) that selects actions with equal probability, regardless of the current state, history, or any learned value estimates. Formally, given a finite action space A, the random policy assigns probability 1/|A| to every action in every state. The agent following a random policy makes no attempt to maximize reward; it simply explores the environment by choosing actions uniformly at random.
Despite its simplicity, the random policy plays several foundational roles in RL research and practice. It serves as a lower-bound baseline for evaluating learned algorithms, a default rollout mechanism in Monte Carlo tree search, a data-collection strategy for off-policy learning, and the implicit starting point of many algorithms that initialize policy parameters randomly. Understanding the random policy, its mathematical properties, and its limitations provides a useful reference point for studying how more sophisticated policies improve upon undirected exploration.
In the standard RL framework, an agent interacts with a Markov decision process (MDP) defined by a tuple (S, A, P, R, γ), where S is the state space, A is the action space, P is the transition function, R is the reward function, and γ is the discount factor. A stochastic policy π maps each state to a probability distribution over actions: π(a|s) = Pr(A_t = a | S_t = s).
A random policy π_rand is the special case where the distribution is uniform across all available actions in every state:
π_rand(a|s) = 1 / |A(s)| for all a ∈ A(s), for all s ∈ S
Here |A(s)| denotes the number of available actions in state s. When the action space is the same in every state, this simplifies to π_rand(a|s) = 1 / |A| for all a and s.
Key properties of the random policy include:
Although the random policy does not attempt to maximize reward, it still induces a well-defined value function. The state-value function under the random policy is:
V^π_rand(s) = E_π_rand [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s ]
This represents the expected discounted return when starting in state s and choosing all future actions uniformly at random. Similarly, the action-value function (Q-function) under the random policy is:
Q^π_rand(s, a) = E_π_rand [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s, A_0 = a ]
The value function under the random policy is generally far from optimal. Since the policy ignores the reward signal entirely, the expected return reflects the average outcome over all possible action sequences rather than the best achievable outcome. For any optimal policy π*, we have V^π*(s) ≥ V^π_rand(s) for all states s, with equality only in trivial environments where all action sequences yield identical returns.
Computing V^π_rand analytically is possible for small, tabular MDPs using the Bellman equation:
V^π_rand(s) = Σ_a (1/|A|) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π_rand(s')]
This linear system can be solved directly, or iteratively through policy evaluation. The value function under the random policy is used as a starting point in the policy iteration algorithm, where the process begins with an arbitrary (often random) policy, evaluates it, and then improves it by acting greedily with respect to the current value function.
The most common use of the random policy is as a performance baseline. If a trained RL algorithm cannot consistently outperform a random policy, this is strong evidence that the algorithm is not learning effectively, that the environment has no exploitable structure, or that there is a bug in the implementation.
This practice is widespread in the RL literature. In their landmark 2015 paper on Deep Q-Networks (DQN), Mnih et al. used random agent scores to establish the lower bound of their evaluation metric. The human-normalized score (HNS) used across Atari 2600 benchmarks is defined as:
HNS = 100 × (agent score - random score) / (human score - random score)
Under this formula, a random agent scores 0% and a human player scores 100%. The random agent in the DQN evaluation selected a random action every 6 frames (10 Hz), which avoids inflated baseline scores in certain games. DQN achieved above 75% of human performance in 29 out of 49 games tested, measured against this random baseline.
The following table shows example Atari game scores for random, DQN, and human players, illustrating how the random baseline anchors evaluation:
| Game | Random Agent Score | DQN Score | Human Score | DQN HNS (%) |
|---|---|---|---|---|
| Breakout | 1.7 | 401.2 | 31.8 | ~1,327 |
| Pong | -20.7 | 18.9 | 9.3 | ~132 |
| Seaquest | 68.4 | 5,286 | 20,182 | ~26 |
| Montezuma's Revenge | 0.0 | 0.0 | 4,753 | 0 |
The table highlights that DQN vastly exceeded human performance in some games (Breakout), matched humans in others (Pong), and completely failed in sparse-reward games like Montezuma's Revenge, where it scored no better than the random agent.
Before an agent has any knowledge of the environment, random action selection is a natural starting strategy. Many RL algorithms begin training with a policy that is effectively random, either by initializing neural network weights randomly (which produces near-uniform action probabilities) or by setting the exploration parameter to its maximum value.
In Deep Q-Networks, for example, the epsilon-greedy policy starts with ε = 1.0, meaning the agent selects every action uniformly at random during the first phase of training. This initial random exploration fills the experience replay buffer with diverse transitions, providing a broad sampling of the state-action space that stabilizes subsequent learning. The epsilon value is then annealed over the course of training (typically to 0.1 or 0.01), gradually shifting from random exploration to informed exploitation.
The Deep Deterministic Policy Gradient (DDPG) algorithm uses a similar approach for continuous action spaces: for a fixed number of initial steps, the agent samples actions from a uniform random distribution over the valid action range before switching to its learned policy augmented with noise.
Classical dynamic programming algorithms for solving MDPs typically begin with a random or arbitrary policy. In policy iteration, the algorithm alternates between two steps:
Starting from the random policy, the first improvement step produces a policy that selects the action maximizing expected immediate reward plus the discounted value of successor states, based on the value function of the random policy. Each subsequent iteration produces a strictly better policy (unless the current policy is already optimal). Because the number of deterministic policies is finite (|A|^|S| for a tabular MDP), this process is guaranteed to converge to the optimal policy π* in a finite number of iterations.
Value iteration, a closely related method, can also be understood as starting from a value function initialized to zero (or arbitrarily), which corresponds roughly to the value of taking random or uniformly poor actions.
Monte Carlo tree search (MCTS) is a planning algorithm widely used in game-playing systems, including the AlphaGo program that defeated the world champion in Go. MCTS builds a search tree incrementally through four repeated steps: selection, expansion, simulation, and backpropagation.
During the simulation step (also called a "rollout" or "playout"), the algorithm plays out the remainder of the game from a leaf node to estimate the value of that position. In the simplest form, this rollout uses a random policy: actions are chosen uniformly at random until the game reaches a terminal state. The outcome (win, loss, or draw) is then propagated back up the tree to update value estimates.
This type of rollout is called a light playout. It requires no domain knowledge beyond the rules of the game, making it easy to implement. A heavy playout, by contrast, uses heuristics or learned policies to bias move selection during simulation, often improving the quality of value estimates at the cost of additional computation.
An interesting empirical finding in the MCTS literature is that random rollout policies sometimes perform surprisingly well, and in certain games, deliberately playing slightly suboptimal moves during simulation can actually strengthen the overall tree search. This occurs because the value of a position under perfect play and under random play can be correlated, even if the absolute scores differ.
In off-policy reinforcement learning, the agent learns about a target policy while collecting data using a different behavior policy. The random policy is sometimes used as the behavior policy because it guarantees broad coverage of the state-action space, ensuring that no action is systematically excluded from the training data.
When a random behavior policy is used with off-policy methods like Q-learning, the agent can still learn the optimal policy because Q-learning's update rule uses the maximum Q-value of the next state (the greedy target policy) regardless of which action was actually taken. The behavior policy only needs to visit all state-action pairs with sufficient frequency, a condition that the random policy satisfies in ergodic or communicating MDPs.
For Monte Carlo off-policy methods, learning from random behavior policy data requires importance sampling to correct for the mismatch between the behavior policy and the target policy. The importance sampling ratio for a trajectory under the random policy versus a target policy π is:
ρ = Π_{t=0}^{T-1} π(A_t|S_t) / π_rand(A_t|S_t) = Π_{t=0}^{T-1} π(A_t|S_t) × |A|
Because the random policy assigns equal probability to all actions, these ratios can become very large for trajectories that the target policy strongly prefers, leading to high variance in the importance-weighted estimates. This variance issue is a practical limitation of using random behavior policies with importance-sampling-based methods, and weighted importance sampling is often used to mitigate it.
The multi-armed bandit problem is a simplified RL setting with a single state and k actions (arms). Each arm produces rewards from an unknown distribution, and the goal is to maximize total reward over T rounds.
A random (uniform) policy in the bandit setting selects each arm with probability 1/k at every round. This strategy is sometimes called "explore-first" when used over an initial phase, or "uniform exploration" when applied throughout.
The performance of a bandit algorithm is typically measured by its regret, defined as the difference between the total reward of the optimal arm and the total reward actually received. For a uniform random policy over T rounds:
Expected regret ≈ T × (μ - μ_avg)*
where μ* is the mean reward of the best arm and μ_avg is the average mean reward across all arms. This regret grows linearly with T, which is far worse than the logarithmic regret O(log T) achieved by algorithms like UCB (Upper Confidence Bound) or Thompson Sampling.
The following table compares regret bounds for common bandit strategies:
| Strategy | Regret Growth | Description |
|---|---|---|
| Uniform random | O(T) | Selects each arm with equal probability; linear regret |
| Epsilon-greedy (fixed ε) | O(T) | Greedy with probability 1-ε, random with probability ε |
| Epsilon-greedy (decaying ε) | O(log T) under conditions | Reduces exploration over time |
| UCB1 | O(log T) | Optimism in the face of uncertainty |
| Thompson Sampling | O(log T) | Bayesian posterior sampling |
| Exp3 (adversarial) | O(√(KT log K)) | Mixes exponential weights with uniform exploration |
The linear regret of the uniform random policy demonstrates that undirected exploration is fundamentally inefficient. Algorithms that direct exploration toward uncertain or promising actions (like UCB) achieve exponentially better scaling.
An important line of research uses random policies to assess the inherent difficulty of RL benchmark tasks. If a random agent achieves reasonable performance on a benchmark, this suggests the benchmark may not adequately test learning ability.
Raghu et al. (2020) proposed a random weight guessing (RWG) method that generates policy networks with randomly sampled parameters and evaluates them directly on standard benchmarks, without any training. By studying the score distribution of these untrained networks, they assessed the intrinsic complexity of each task. Their analysis revealed that several popular OpenAI Gym environments, particularly classic control tasks, are simpler than commonly assumed, since random networks can occasionally solve them.
Mania, Guy, and Recht (2018) demonstrated in their NeurIPS paper that a simple random search over static linear policies is competitive with state-of-the-art RL algorithms on MuJoCo locomotion benchmarks. Their method, called Augmented Random Search (ARS), perturbs policy parameters with random noise and uses finite-difference estimates to update the policy. ARS matched or exceeded the sample efficiency of methods like TRPO, PPO, and A2C while being at least 15 times faster. This result challenged the assumption that sophisticated gradient-based policy optimization is necessary for standard continuous control tasks and highlighted that random search in policy space can be surprisingly effective.
Recent work has explored using random policy data for in-context reinforcement learning (ICRL), where a pretrained transformer model learns to perform RL by conditioning on interaction histories provided in its context window.
Prior ICRL methods like Algorithm Distillation and Decision Pretrained Transformer required high-quality behavior policies (often near-optimal) in the pretraining data. Chen et al. (2024) introduced State-Action Distillation (SAD), which generates effective pretraining datasets using only random policies. SAD selects informative state-action pairs from random trajectories within a bounded "trust horizon" and distills them into training examples. In experiments, SAD outperformed the best baseline by 236.3% in offline evaluation and 135.2% in online evaluation across multiple benchmark environments.
This result is complemented by earlier work on the Decision Transformer (Chen et al., 2021), which showed that a transformer trained on random walk trajectories can learn to stitch together subsequences from different training episodes to produce near-optimal behavior at test time, even when no single training trajectory was close to optimal. The combination of hindsight return conditioning and the transformer's ability to attend over the full trajectory enables effective policy extraction from random data.
The following table summarizes how the random policy compares to other common policy types in reinforcement learning:
| Policy Type | Action Selection | Adapts Over Time | Exploration | Typical Use |
|---|---|---|---|---|
| Random (uniform) | Equal probability for all actions | No | Maximum (uniform) | Baselines, initial exploration, MCTS rollouts |
| Greedy | Always selects highest-value action | No (given fixed Q) | None | Evaluation, deployment |
| Epsilon-greedy | Greedy with probability 1-ε, random with probability ε | ε may decay | Controlled by ε | Q-learning, DQN training |
| Softmax (Boltzmann) | Actions weighted by Q-values via temperature | Temperature may decay | Value-sensitive | Policy gradient warm-up, bandit problems |
| Learned stochastic | Neural network outputs action distribution | Yes (via gradient updates) | Learned | Policy gradient methods (PPO, A2C, SAC) |
| Deterministic learned | Neural network outputs single action | Yes (via gradient updates) | Requires external noise | DDPG, TD3 |
The key distinction is that the random policy is the only type that makes no use of state information or feedback. Every other policy type either leverages value estimates, learned parameters, or both to make informed action choices.
The random policy has several fundamental limitations that prevent it from being useful as anything other than a baseline or temporary strategy:
No reward maximization: The random policy ignores the reward signal entirely. It cannot distinguish between high-reward and low-reward actions, so its expected return reflects the average outcome rather than the best achievable outcome.
No convergence to optimality: Because it does not learn or update, the random policy cannot converge to the optimal policy. No matter how many episodes are run, the policy remains unchanged.
Inefficient exploration in large spaces: While the random policy does explore, its exploration is undirected. In environments with large state or action spaces, the probability of randomly discovering reward-rich regions decreases rapidly. Sparse reward environments are particularly problematic; the random agent may never encounter a nonzero reward signal.
High variance importance sampling: When random policy data is used for off-policy evaluation or learning via importance sampling, the resulting estimates have high variance because the random policy assigns equal probability to actions that the target policy may strongly prefer or avoid.
No temporal coherence: Random action selection produces erratic behavior with no consistency across time steps. In environments where achieving goals requires coordinated sequences of actions (such as navigating a maze or manipulating objects), random policies fail because each action is chosen independently.
Poor scaling with action space size: As |A| grows, the probability of selecting any particular good action falls as 1/|A|. In continuous action spaces, a uniform random policy over the full range is unlikely to produce any action close to optimal, and the probability mass near optimal actions approaches zero.
A random policy is straightforward to implement. The following Python example shows a random agent interacting with a Gymnasium (formerly OpenAI Gym) environment:
import gymnasium as gym
def run_random_policy(env_name, num_episodes=100):
"""Run a random policy and return average episode reward."""
env = gym.make(env_name)
total_rewards = []
for episode in range(num_episodes):
obs, info = env.reset()
episode_reward = 0
terminated = truncated = False
while not (terminated or truncated):
# Random policy: sample uniformly from action space
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
total_rewards.append(episode_reward)
avg_reward = sum(total_rewards) / len(total_rewards)
print(f"Average reward over {num_episodes} episodes: {avg_reward:.2f}")
env.close()
return avg_reward
# Example usage
run_random_policy("CartPole-v1")
For environments with continuous action spaces, env.action_space.sample() draws from a uniform distribution over the valid action range. For discrete action spaces, it selects an integer uniformly at random from {0, 1, ..., |A|-1}.
Imagine you are playing a board game, but instead of thinking about your moves, you just roll a dice to decide what to do each turn. You do not look at the board, you do not remember what happened before, and you do not try to win. You just pick a random move every single time.
That is what a random policy does. It is like a robot playing a game with its eyes closed, choosing buttons to press completely at random. It will not play well, but scientists use it as a way to measure the very worst you could do. If a smart robot cannot beat the random one, something is wrong with the smart robot, not the game.