Random Policy

Introduction

A random policy (also called a uniform random policy) is a policy in reinforcement learning (RL) that selects actions with equal probability, regardless of the current state, history, or any learned value estimates. Formally, given a finite action space A, the random policy assigns probability 1/|A| to every action in every state. The agent following a random policy makes no attempt to maximize reward; it simply explores the environment by choosing actions uniformly at random.

Despite its simplicity, the random policy plays several foundational roles in RL research and practice. It serves as a lower-bound baseline for evaluating learned algorithms, a default rollout mechanism in Monte Carlo tree search, a data-collection strategy for off-policy learning, and the implicit starting point of many algorithms that initialize policy parameters randomly. Understanding the random policy, its mathematical properties, and its limitations provides a useful reference point for studying how more sophisticated policies improve upon undirected exploration.

Formal definition

In the standard RL framework, an agent interacts with a Markov decision process (MDP) defined by a tuple (S, A, P, R, γ), where S is the state space, A is the action space, P is the transition function, R is the reward function, and γ is the discount factor. A stochastic policy π maps each state to a probability distribution over actions: π(a|s) = Pr(A_t = a | S_t = s).

A random policy π_rand is the special case where the distribution is uniform across all available actions in every state:

π_rand(a|s) = 1 / |A(s)| for all a ∈ A(s), for all s ∈ S

Here |A(s)| denotes the number of available actions in state s. When the action space is the same in every state, this simplifies to π_rand(a|s) = 1 / |A| for all a and s.

Key properties of the random policy include:

State-independence: The action distribution does not depend on the state s. The random policy produces the same uniform distribution no matter what the agent observes.
History-independence: Past actions, rewards, and states have no influence on future action selection. Every decision is independent and identically distributed.
Maximum entropy: Among all policies over a finite action space, the uniform distribution achieves the highest entropy, H(π) = log|A|. In this sense, the random policy is the least informative, or most uncertain, policy possible.
No learning: The random policy does not update or adapt over time. It lacks any mechanism for incorporating feedback from the environment.

Value function under a random policy

Although the random policy does not attempt to maximize reward, it still induces a well-defined value function. The state-value function under the random policy is:

V^π_rand(s) = E_π_rand [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s ]

This represents the expected discounted return when starting in state s and choosing all future actions uniformly at random. Similarly, the action-value function (Q-function) under the random policy is:

Q^π_rand(s, a) = E_π_rand [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s, A_0 = a ]

The value function under the random policy is generally far from optimal. Since the policy ignores the reward signal entirely, the expected return reflects the average outcome over all possible action sequences rather than the best achievable outcome. For any optimal policy π*, we have V^π*(s) ≥ V^π_rand(s) for all states s, with equality only in trivial environments where all action sequences yield identical returns.

Computing V^π_rand analytically is possible for small, tabular MDPs using the Bellman equation:

V^π_rand(s) = Σ_a (1/|A|) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π_rand(s')]

This linear system can be solved directly, or iteratively through policy evaluation. The value function under the random policy is used as a starting point in the policy iteration algorithm, where the process begins with an arbitrary (often random) policy, evaluates it, and then improves it by acting greedily with respect to the current value function.

Roles in reinforcement learning

Baseline for algorithm evaluation

The most common use of the random policy is as a performance baseline. If a trained RL algorithm cannot consistently outperform a random policy, this is strong evidence that the algorithm is not learning effectively, that the environment has no exploitable structure, or that there is a bug in the implementation.

This practice is widespread in the RL literature. In their landmark 2015 paper on Deep Q-Networks (DQN), Mnih et al. used random agent scores to establish the lower bound of their evaluation metric. The human-normalized score (HNS) used across Atari 2600 benchmarks is defined as:

HNS = 100 × (agent score - random score) / (human score - random score)

Under this formula, a random agent scores 0% and a human player scores 100%. The random agent in the DQN evaluation selected a random action every 6 frames (10 Hz), which avoids inflated baseline scores in certain games. DQN achieved above 75% of human performance in 29 out of 49 games tested, measured against this random baseline.

The following table shows example Atari game scores for random, DQN, and human players, illustrating how the random baseline anchors evaluation:

Game	Random Agent Score	DQN Score	Human Score	DQN HNS (%)
Breakout	1.7	401.2	31.8	~1,327
Pong	-20.7	18.9	9.3	~132
Seaquest	68.4	5,286	20,182	~26
Montezuma's Revenge	0.0	0.0	4,753	0

The table highlights that DQN vastly exceeded human performance in some games (Breakout), matched humans in others (Pong), and completely failed in sparse-reward games like Montezuma's Revenge, where it scored no better than the random agent.

Exploration and environment discovery

Before an agent has any knowledge of the environment, random action selection is a natural starting strategy. Many RL algorithms begin training with a policy that is effectively random, either by initializing neural network weights randomly (which produces near-uniform action probabilities) or by setting the exploration parameter to its maximum value.

In Deep Q-Networks, for example, the epsilon-greedy policy starts with ε = 1.0, meaning the agent selects every action uniformly at random during the first phase of training. This initial random exploration fills the experience replay buffer with diverse transitions, providing a broad sampling of the state-action space that stabilizes subsequent learning. The epsilon value is then annealed over the course of training (typically to 0.1 or 0.01), gradually shifting from random exploration to informed exploitation.

The Deep Deterministic Policy Gradient (DDPG) algorithm uses a similar approach for continuous action spaces: for a fixed number of initial steps, the agent samples actions from a uniform random distribution over the valid action range before switching to its learned policy augmented with noise.

Initialization for policy iteration

Classical dynamic programming algorithms for solving MDPs typically begin with a random or arbitrary policy. In policy iteration, the algorithm alternates between two steps:

Policy evaluation: Compute the value function V^π for the current policy π.
Policy improvement: Construct a new policy π' by acting greedily with respect to V^π.

Starting from the random policy, the first improvement step produces a policy that selects the action maximizing expected immediate reward plus the discounted value of successor states, based on the value function of the random policy. Each subsequent iteration produces a strictly better policy (unless the current policy is already optimal). Because the number of deterministic policies is finite (|A|^|S| for a tabular MDP), this process is guaranteed to converge to the optimal policy π* in a finite number of iterations.

Value iteration, a closely related method, can also be understood as starting from a value function initialized to zero (or arbitrarily), which corresponds roughly to the value of taking random or uniformly poor actions.

Default rollout policy in Monte Carlo tree search

Monte Carlo tree search (MCTS) is a planning algorithm widely used in game-playing systems, including the AlphaGo program that defeated the world champion in Go. MCTS builds a search tree incrementally through four repeated steps: selection, expansion, simulation, and backpropagation.

During the simulation step (also called a "rollout" or "playout"), the algorithm plays out the remainder of the game from a leaf node to estimate the value of that position. In the simplest form, this rollout uses a random policy: actions are chosen uniformly at random until the game reaches a terminal state. The outcome (win, loss, or draw) is then propagated back up the tree to update value estimates.

This type of rollout is called a light playout. It requires no domain knowledge beyond the rules of the game, making it easy to implement. A heavy playout, by contrast, uses heuristics or learned policies to bias move selection during simulation, often improving the quality of value estimates at the cost of additional computation.

An interesting empirical finding in the MCTS literature is that random rollout policies sometimes perform surprisingly well, and in certain games, deliberately playing slightly suboptimal moves during simulation can actually strengthen the overall tree search. This occurs because the value of a position under perfect play and under random play can be correlated, even if the absolute scores differ.

Data collection for off-policy learning

In off-policy reinforcement learning, the agent learns about a target policy while collecting data using a different behavior policy. The random policy is sometimes used as the behavior policy because it guarantees broad coverage of the state-action space, ensuring that no action is systematically excluded from the training data.

When a random behavior policy is used with off-policy methods like Q-learning, the agent can still learn the optimal policy because Q-learning's update rule uses the maximum Q-value of the next state (the greedy target policy) regardless of which action was actually taken. The behavior policy only needs to visit all state-action pairs with sufficient frequency, a condition that the random policy satisfies in ergodic or communicating MDPs.

For Monte Carlo off-policy methods, learning from random behavior policy data requires importance sampling to correct for the mismatch between the behavior policy and the target policy. The importance sampling ratio for a trajectory under the random policy versus a target policy π is:

ρ = Π_{t=0}^{T-1} π(A_t|S_t) / π_rand(A_t|S_t) = Π_{t=0}^{T-1} π(A_t|S_t) × |A|

Because the random policy assigns equal probability to all actions, these ratios can become very large for trajectories that the target policy strongly prefers, leading to high variance in the importance-weighted estimates. This variance issue is a practical limitation of using random behavior policies with importance-sampling-based methods, and weighted importance sampling is often used to mitigate it.

Random policy in the multi-armed bandit setting

The multi-armed bandit problem is a simplified RL setting with a single state and k actions (arms). Each arm produces rewards from an unknown distribution, and the goal is to maximize total reward over T rounds.

A random (uniform) policy in the bandit setting selects each arm with probability 1/k at every round. This strategy is sometimes called "explore-first" when used over an initial phase, or "uniform exploration" when applied throughout.

The performance of a bandit algorithm is typically measured by its regret, defined as the difference between the total reward of the optimal arm and the total reward actually received. For a uniform random policy over T rounds:

Expected regret ≈ T × (μ - μ_avg)*

where μ* is the mean reward of the best arm and μ_avg is the average mean reward across all arms. This regret grows linearly with T, which is far worse than the logarithmic regret O(log T) achieved by algorithms like UCB (Upper Confidence Bound) or Thompson Sampling.

The following table compares regret bounds for common bandit strategies:

Strategy	Regret Growth	Description
Uniform random	O(T)	Selects each arm with equal probability; linear regret
Epsilon-greedy (fixed ε)	O(T)	Greedy with probability 1-ε, random with probability ε
Epsilon-greedy (decaying ε)	O(log T) under conditions	Reduces exploration over time
UCB1	O(log T)	Optimism in the face of uncertainty
Thompson Sampling	O(log T)	Bayesian posterior sampling
Exp3 (adversarial)	O(√(KT log K))	Mixes exponential weights with uniform exploration

The linear regret of the uniform random policy demonstrates that undirected exploration is fundamentally inefficient. Algorithms that direct exploration toward uncertain or promising actions (like UCB) achieve exponentially better scaling.

Random policies and benchmark complexity

An important line of research uses random policies to assess the inherent difficulty of RL benchmark tasks. If a random agent achieves reasonable performance on a benchmark, this suggests the benchmark may not adequately test learning ability.

Raghu et al. (2020) proposed a random weight guessing (RWG) method that generates policy networks with randomly sampled parameters and evaluates them directly on standard benchmarks, without any training. By studying the score distribution of these untrained networks, they assessed the intrinsic complexity of each task. Their analysis revealed that several popular OpenAI Gym environments, particularly classic control tasks, are simpler than commonly assumed, since random networks can occasionally solve them.

Mania, Guy, and Recht (2018) demonstrated in their NeurIPS paper that a simple random search over static linear policies is competitive with state-of-the-art RL algorithms on MuJoCo locomotion benchmarks. Their method, called Augmented Random Search (ARS), perturbs policy parameters with random noise and uses finite-difference estimates to update the policy. ARS matched or exceeded the sample efficiency of methods like TRPO, PPO, and A2C while being at least 15 times faster. This result challenged the assumption that sophisticated gradient-based policy optimization is necessary for standard continuous control tasks and highlighted that random search in policy space can be surprisingly effective.

Random policy in in-context reinforcement learning

Recent work has explored using random policy data for in-context reinforcement learning (ICRL), where a pretrained transformer model learns to perform RL by conditioning on interaction histories provided in its context window.

Prior ICRL methods like Algorithm Distillation and Decision Pretrained Transformer required high-quality behavior policies (often near-optimal) in the pretraining data. Chen et al. (2024) introduced State-Action Distillation (SAD), which generates effective pretraining datasets using only random policies. SAD selects informative state-action pairs from random trajectories within a bounded "trust horizon" and distills them into training examples. In experiments, SAD outperformed the best baseline by 236.3% in offline evaluation and 135.2% in online evaluation across multiple benchmark environments.

This result is complemented by earlier work on the Decision Transformer (Chen et al., 2021), which showed that a transformer trained on random walk trajectories can learn to stitch together subsequences from different training episodes to produce near-optimal behavior at test time, even when no single training trajectory was close to optimal. The combination of hindsight return conditioning and the transformer's ability to attend over the full trajectory enables effective policy extraction from random data.

Comparison with other policy types

The following table summarizes how the random policy compares to other common policy types in reinforcement learning:

Policy Type	Action Selection	Adapts Over Time	Exploration	Typical Use
Random (uniform)	Equal probability for all actions	No	Maximum (uniform)	Baselines, initial exploration, MCTS rollouts
Greedy	Always selects highest-value action	No (given fixed Q)	None	Evaluation, deployment
Epsilon-greedy	Greedy with probability 1-ε, random with probability ε	ε may decay	Controlled by ε	Q-learning, DQN training
Softmax (Boltzmann)	Actions weighted by Q-values via temperature	Temperature may decay	Value-sensitive	Policy gradient warm-up, bandit problems
Learned stochastic	Neural network outputs action distribution	Yes (via gradient updates)	Learned	Policy gradient methods (PPO, A2C, SAC)
Deterministic learned	Neural network outputs single action	Yes (via gradient updates)	Requires external noise	DDPG, TD3

The key distinction is that the random policy is the only type that makes no use of state information or feedback. Every other policy type either leverages value estimates, learned parameters, or both to make informed action choices.

Limitations

The random policy has several fundamental limitations that prevent it from being useful as anything other than a baseline or temporary strategy:

No reward maximization: The random policy ignores the reward signal entirely. It cannot distinguish between high-reward and low-reward actions, so its expected return reflects the average outcome rather than the best achievable outcome.
No convergence to optimality: Because it does not learn or update, the random policy cannot converge to the optimal policy. No matter how many episodes are run, the policy remains unchanged.
Inefficient exploration in large spaces: While the random policy does explore, its exploration is undirected. In environments with large state or action spaces, the probability of randomly discovering reward-rich regions decreases rapidly. Sparse reward environments are particularly problematic; the random agent may never encounter a nonzero reward signal.
High variance importance sampling: When random policy data is used for off-policy evaluation or learning via importance sampling, the resulting estimates have high variance because the random policy assigns equal probability to actions that the target policy may strongly prefer or avoid.
No temporal coherence: Random action selection produces erratic behavior with no consistency across time steps. In environments where achieving goals requires coordinated sequences of actions (such as navigating a maze or manipulating objects), random policies fail because each action is chosen independently.
Poor scaling with action space size: As |A| grows, the probability of selecting any particular good action falls as 1/|A|. In continuous action spaces, a uniform random policy over the full range is unlikely to produce any action close to optimal, and the probability mass near optimal actions approaches zero.

Implementation

A random policy is straightforward to implement. The following Python example shows a random agent interacting with a Gymnasium (formerly OpenAI Gym) environment:

import gymnasium as gym

def run_random_policy(env_name, num_episodes=100):
    """Run a random policy and return average episode reward."""
    env = gym.make(env_name)
    total_rewards = []

    for episode in range(num_episodes):
        obs, info = env.reset()
        episode_reward = 0
        terminated = truncated = False

        while not (terminated or truncated):
            # Random policy: sample uniformly from action space
            action = env.action_space.sample()
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward

        total_rewards.append(episode_reward)

    avg_reward = sum(total_rewards) / len(total_rewards)
    print(f"Average reward over {num_episodes} episodes: {avg_reward:.2f}")
    env.close()
    return avg_reward

# Example usage
run_random_policy("CartPole-v1")

For environments with continuous action spaces, env.action_space.sample() draws from a uniform distribution over the valid action range. For discrete action spaces, it selects an integer uniformly at random from {0, 1, ..., |A|-1}.

Explain like I'm 5 (ELI5)

Imagine you are playing a board game, but instead of thinking about your moves, you just roll a dice to decide what to do each turn. You do not look at the board, you do not remember what happened before, and you do not try to win. You just pick a random move every single time.

That is what a random policy does. It is like a robot playing a game with its eyes closed, choosing buttons to press completely at random. It will not play well, but scientists use it as a way to measure the very worst you could do. If a smart robot cannot beat the random one, something is wrong with the smart robot, not the game.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Mania, H., Guy, A., & Recht, B. (2018). "Simple random search of static linear policies is competitive for reinforcement learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 31.
Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., & Colton, S. (2012). "A survey of Monte Carlo tree search methods." *IEEE Transactions on Computational Intelligence and AI in Games*, 4(1), 1-43.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., & Mordatch, I. (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling." *Advances in Neural Information Processing Systems (NeurIPS)*, 34.
Raghu, A., Raghu, M., Kornblith, S., Duvenaud, D., & Hinton, G. (2020). "Analyzing Reinforcement Learning Benchmarks with Random Weight Guessing." *arXiv preprint arXiv:2004.07707*.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). "Reinforcement Learning: A Survey." *Journal of Artificial Intelligence Research*, 4, 237-285.
Watkins, C. J. C. H., & Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3-4), 279-292.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." *Machine Learning*, 47(2-3), 235-256.
Chen, Z., Gong, K., & Wang, J. (2024). "Random Policy Enables In-Context Reinforcement Learning within Trust Horizons." *arXiv preprint arXiv:2410.19982*.
Precup, D., Sutton, R. S., & Singh, S. (2000). "Eligibility Traces for Off-Policy Policy Evaluation." *Proceedings of the 17th International Conference on Machine Learning (ICML)*, 759-766.
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. Wiley-Interscience.

Introduction

Formal definition

Value function under a random policy

Roles in reinforcement learning

Baseline for algorithm evaluation

Exploration and environment discovery

Initialization for policy iteration

Default rollout policy in Monte Carlo tree search

Data collection for off-policy learning

Random policy in the multi-armed bandit setting

Random policies and benchmark complexity

Random policy in in-context reinforcement learning

Comparison with other policy types

Limitations

Implementation

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Introduction

Formal definition

Value function under a random policy

Roles in reinforcement learning

Baseline for algorithm evaluation

Exploration and environment discovery

Initialization for policy iteration

Default rollout policy in Monte Carlo tree search

Data collection for off-policy learning

Random policy in the multi-armed bandit setting

Random policies and benchmark complexity

Random policy in in-context reinforcement learning

Comparison with other policy types

Limitations

Implementation

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)