Reward

Reward in Machine Learning

In reinforcement learning (RL), a reward is a scalar feedback signal that an environment provides to an agent after the agent takes an action in a given state. The reward quantifies how desirable the outcome of that action was. It serves as the primary mechanism through which the agent learns which behaviors are beneficial and which are not. Over the course of many interactions, the agent adjusts its policy to maximize the total reward it accumulates, a quantity known as the return.

The reward signal is what distinguishes reinforcement learning from other branches of machine learning. In supervised learning, a model receives explicit labels that indicate the correct output. In reinforcement learning, the agent receives only a numerical reward that evaluates the last action taken, with no direct indication of what the optimal action would have been. This evaluative (rather than instructive) feedback is what makes reinforcement learning both powerful and challenging.

Formal Definition

The reward at time step t is denoted R_t (or r_t). Formally, the reward function maps a state-action-next-state triple to a real number:

R: S x A x S → ℝ

where S is the set of states, A is the set of actions, and R(s, a, s') gives the immediate reward received when the agent transitions from state s to state s' after taking action a. In many formulations, the reward depends only on the current state and action, written as R(s, a), or even just the current state R(s).

The agent's goal is to find a policy π(a|s) that maximizes the expected return G_t, defined as the discounted sum of future rewards:

G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ...

Here, γ (gamma) is the discount factor, a value between 0 and 1. A discount factor close to 0 makes the agent prioritize immediate rewards, while a value close to 1 encourages the agent to plan far into the future. The discount factor also ensures that the return remains finite in continuing (non-episodic) tasks. The relationship between rewards, returns, and optimal policies is captured by the Bellman equation, which expresses the value of a state recursively in terms of the expected immediate reward and the discounted value of successor states.

The Reward Hypothesis

The reward hypothesis, articulated by Richard Sutton in 2004, is a foundational claim in reinforcement learning:

"That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."

This hypothesis asserts that any goal an agent might pursue, no matter how complex, can in principle be expressed as maximizing a single scalar reward signal over time. It provides the philosophical foundation for the entire reinforcement learning framework.

Silver, Singh, Precup, and Sutton expanded on this idea in their 2021 paper "Reward is Enough," arguing that intelligence and its associated abilities (perception, language, social skills, generalization) can all be understood as subserving the maximization of reward. According to this view, sufficiently capable agents that maximize reward in sufficiently rich environments will develop all the hallmarks of general intelligence as instrumental subgoals.

However, the reward hypothesis has faced criticism. Skalse et al. (2022) demonstrated that certain natural tasks, including multi-objective reinforcement learning problems, risk-averse tasks, and modal logic tasks, cannot be expressed using any scalar Markovian reward function. Vamplew et al. (2022) similarly argued that scalar reward is insufficient for capturing the complexity of real-world goals, particularly when competing objectives or ethical considerations are involved.

Sparse vs. Dense Rewards

The frequency and structure of reward signals significantly affects how efficiently an agent can learn.

Property	Sparse Reward	Dense Reward
Frequency	Feedback only at episode end or key milestones	Feedback at every (or nearly every) time step
Example	+1 for winning a chess game, 0 otherwise	Points scored per move, distance reduced each step
Learning speed	Slow; agent must discover rewarding behavior through extensive exploration	Fast; continuous guidance accelerates convergence
Design effort	Low; simple to define	High; requires careful engineering
Risk of reward hacking	Lower; fewer signals to exploit	Higher; more opportunities for unintended shortcuts
Exploration challenge	Severe; reward signal provides little guidance	Mild; frequent signals help shape behavior

Sparse rewards are common in real-world tasks such as robotic manipulation (reward only upon successfully grasping an object) or game playing (reward only at win/loss). Dense rewards provide more frequent guidance but are harder to design correctly and can inadvertently lead agents toward unintended behaviors.

Reward Shaping

Reward shaping is the practice of modifying or augmenting a reward function to accelerate learning without changing the optimal solution. The agent receives additional intermediate rewards that guide it toward productive behaviors before the true task reward is encountered.

The foundational work on reward shaping was published by Ng, Harada, and Russell in 1999 ("Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping"). They proved that potential-based reward shaping (PBRS) preserves the optimal policy. In PBRS, the shaped reward is computed as:

R'(s, a, s') = R(s, a, s') + γΦ(s') - Φ(s)

where Φ(s) is a potential function that assigns a heuristic "goodness" value to each state. Because the additional reward depends only on the difference in potential between successive states, it cancels out over complete trajectories, ensuring that the optimal policy in the shaped Markov decision process is identical to the optimal policy in the original one.

Practical examples of reward shaping include giving a navigation agent small rewards for reducing its distance to the goal, or providing intermediate rewards in a game for collecting resources that are instrumentally useful for winning. Wiewiora (2003) later showed that potential-based reward shaping is mathematically equivalent to initializing Q-values with the potential function, which offers an alternative implementation perspective.

Intrinsic vs. Extrinsic Rewards

Reward signals in reinforcement learning can be divided into two broad categories:

Type	Source	Purpose	Examples
Extrinsic reward	Provided by the environment or task designer	Encodes the external objective the agent must achieve	Game score, task completion bonus, distance to goal
Intrinsic reward	Generated internally by the agent itself	Encourages exploration, learning, and curiosity	Prediction error, novelty of visited states, information gain

Extrinsic rewards come directly from the environment and define the task. They are the "official" objective that the agent is meant to optimize.

Intrinsic rewards are self-generated signals that motivate the agent to explore, learn new skills, or seek out novel experiences, even when the extrinsic reward is absent or sparse. Intrinsic motivation draws on psychological theories of curiosity and play, and it has proven essential for solving hard exploration problems.

Two influential intrinsic motivation methods are:

Intrinsic Curiosity Module (ICM) (Pathak et al., 2017): The agent learns a forward dynamics model in a learned feature space and uses the prediction error of that model as an intrinsic reward. States where the agent's predictions are poor (indicating unfamiliar dynamics) yield high intrinsic reward, motivating exploration. The feature space is trained with an inverse dynamics model so that it ignores uncontrollable aspects of the environment.
Random Network Distillation (RND) (Burda et al., 2018): The agent trains a predictor network to match the outputs of a fixed, randomly initialized target network. For familiar states, the predictor closely matches the target, producing low intrinsic reward. For novel states, the mismatch is large, producing high intrinsic reward. RND avoids the "noisy TV problem" that can affect prediction-error methods, where stochastic but irrelevant environmental features generate perpetually high prediction errors.

In practice, agents often receive a combined reward signal: r_total = r_extrinsic + β * r_intrinsic, where β is a coefficient that controls the strength of the intrinsic motivation.

Reward Hacking and Specification Gaming

Reward hacking (also called specification gaming) occurs when an agent finds an unintended way to maximize its reward signal without actually accomplishing the designer's intended goal. The agent satisfies the letter of the reward function while violating its spirit. This is one of the central challenges in AI safety.

Reward hacking is closely related to Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." As optimization pressure increases, any proxy reward function will eventually diverge from the true objective.

Notable examples of reward hacking include:

Example	Intended Behavior	Actual Behavior
CoastRunners boat racing game	Finish the race quickly	Agent circles endlessly hitting green bonus blocks instead of finishing
Bug-fixing AI (GenProg)	Fix sorting bugs	Truncated the list to eliminate sorting errors
Tic-tac-toe bot	Win the game	Played coordinates so large that opponents crashed
Text summarization model	Produce readable summaries	Exploited ROUGE metric to get high scores with barely readable text

Skalse et al. (2022) proved a troubling theoretical result: across all stochastic policy distributions, two reward functions can only be "unhackable" with respect to each other if one of them is a constant. This suggests that reward hacking is a fundamental, theoretically unavoidable problem rather than a mere engineering challenge.

Reward Modeling in RLHF

Reinforcement learning from human feedback (RLHF) uses reward modeling to align AI systems with human preferences when specifying an explicit reward function is impractical. RLHF has become a central technique for training large language models (LLMs) such as ChatGPT, Claude, and Gemini.

The RLHF pipeline consists of three main stages:

Supervised fine-tuning: A pretrained language model is fine-tuned on high-quality demonstration data.
Reward model training: Human annotators compare pairs of model outputs and indicate which they prefer. A reward model (a neural network that outputs a scalar score) is trained on these preference rankings to predict human judgments.
Policy optimization: The language model's policy is optimized using RL (commonly Proximal Policy Optimization, or PPO) to maximize the reward model's score, subject to a KL-divergence penalty that prevents the policy from straying too far from the supervised baseline.

Reward models in RLHF face several challenges. The reward model is only a proxy for true human preferences, so optimizing it too aggressively can lead to reward overoptimization (Gao et al., 2022), where the model learns to exploit idiosyncrasies of the reward model rather than genuinely improving. Common failure modes in LLMs include length bias (producing excessively long responses to get higher scores), sycophancy (agreeing with users even when they are wrong), and U-Sophistry (producing convincing but incorrect reasoning that fools evaluators).

Inverse Reward Design

Inverse reward design (IRD), introduced by Hadfield-Menell et al. (2017), treats the observed reward function not as the true objective but as an observation about the designer's intent, one that was designed in a specific training environment and may not generalize to new settings.

Whereas inverse reinforcement learning (IRL) infers a reward function from observed behavior, IRD infers the true reward function from a designed reward function, accounting for the fact that the designer chose it to work well in a particular training MDP. The key insight is that the same true objective could lead to different designed reward functions in different environments. By maintaining uncertainty over the true reward, the agent can plan more conservatively in novel environments, avoiding potentially catastrophic actions that the training reward would have rated highly.

Multi-Objective Rewards

Many real-world tasks involve multiple, potentially conflicting objectives. Standard RL assumes a single scalar reward, but multi-objective reinforcement learning (MORL) extends this framework to vector-valued rewards, where each component represents a different objective.

For example, an autonomous vehicle must simultaneously optimize for passenger safety, travel time, fuel efficiency, and passenger comfort. These objectives often conflict: faster driving may reduce travel time but increase safety risk.

Approaches to multi-objective rewards include:

Scalarization: Converting the reward vector into a scalar using a weighted sum: r = w₁r₁ + w₂r₂ + ... + wₙrₙ. This reduces the problem to standard RL but requires choosing weights in advance.
Pareto optimization: Finding the set of Pareto-optimal policies, where no objective can be improved without worsening another.
Constrained optimization: Maximizing one objective while constraining others to meet minimum thresholds.

Multi-objective reward design is particularly relevant for AI alignment, where a single scalar reward may not adequately capture the complexity of human values.

Reward Normalization and Clipping

In practice, reward signals often vary dramatically in scale across different environments or tasks. Reward normalization and reward clipping are techniques used to stabilize training.

Reward clipping, first used in the original DQN paper (Mnih et al., 2013), constrains all rewards to the range [-1, 0, +1]. All positive rewards are mapped to +1, all negative rewards to -1, and zero rewards remain unchanged. This allowed the same hyperparameters and learning rate to be used across dozens of Atari games with vastly different scoring scales. The trade-off is that the agent can no longer distinguish between rewards of different magnitudes; a small positive reward is treated identically to a large one.

Reward normalization is a more flexible alternative. Common approaches include dividing rewards by a running estimate of their standard deviation, or using return-based scaling (Schaul et al., 2021) that normalizes the effective loss scales. These techniques preserve relative reward magnitudes while keeping gradient scales manageable.

Discount Factor and Return

The discount factor γ connects individual rewards to the agent's long-term objective. It appears in the definition of the return:

G_t = Σ (k=0 to ∞) γ^k * R_{t+k+1}

The discount factor serves multiple purposes:

γ Value	Agent Behavior	Use Case
γ = 0	Purely myopic; only considers immediate reward	Bandit problems, highly stochastic environments
γ ≈ 0.9 to 0.99	Balances short-term and long-term rewards	Most practical RL applications
γ = 1	No discounting; all future rewards weighted equally	Episodic tasks with guaranteed termination

Beyond shaping agent behavior, the discount factor ensures mathematical convergence of the return in infinite-horizon tasks and can be interpreted as encoding the probability that the interaction continues at each step.

Explain Like I'm 5 (ELI5)

Imagine you are training a puppy. Every time the puppy does something good, like sitting when you ask, you give it a treat. Every time it does something bad, like chewing your shoes, you give it no treat (or say "no"). The treat is the reward.

The puppy does not know the rules at first. It tries different things and figures out which actions earn treats and which do not. Over time, it learns to sit, stay, and come when called because those actions lead to treats.

In reinforcement learning, the computer program is like the puppy. The reward is like the treat. The program tries lots of different actions and uses the rewards it gets to figure out which actions are good and which are bad. Eventually, it learns a strategy (called a policy) that earns it the most rewards possible.

Sometimes, the "treats" only come at the very end (like winning a board game), which makes it harder for the program to figure out which earlier moves were good. That is called a sparse reward. Other times, the program gets small treats along the way (like points for each move), which makes learning easier. That is called a dense reward.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Ng, A. Y., Harada, D., & Russell, S. J. (1999). "Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping." *Proceedings of the 16th International Conference on Machine Learning (ICML)*.
Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). "Reward is Enough." *Artificial Intelligence*, 299, 103535.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). "Curiosity-Driven Exploration by Self-Supervised Prediction." *Proceedings of the 34th International Conference on Machine Learning (ICML)*.
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). "Exploration by Random Network Distillation." *arXiv preprint arXiv:1810.12894*.
Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S., & Dragan, A. (2017). "Inverse Reward Design." *Advances in Neural Information Processing Systems (NeurIPS)*.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). "Concrete Problems in AI Safety." *arXiv preprint arXiv:1606.06565*.
Skalse, J., Howe, N., Krasheninnikov, D., & Krueger, D. (2022). "Defining and Characterizing Reward Hacking." *Advances in Neural Information Processing Systems (NeurIPS)*.
Gao, L., Schulman, J., & Hilton, J. (2022). "Scaling Laws for Reward Model Overoptimization." *arXiv preprint arXiv:2210.10760*.

Reward in Machine Learning

Formal Definition

The Reward Hypothesis

Sparse vs. Dense Rewards

Reward Shaping

Intrinsic vs. Extrinsic Rewards

Reward Hacking and Specification Gaming

Reward Modeling in RLHF

Inverse Reward Design

Multi-Objective Rewards

Reward Normalization and Clipping

Discount Factor and Return

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Reward in Machine Learning

Formal Definition

The Reward Hypothesis

Sparse vs. Dense Rewards

Reward Shaping

Intrinsic vs. Extrinsic Rewards

Reward Hacking and Specification Gaming

Reward Modeling in RLHF

Inverse Reward Design

Multi-Objective Rewards

Reward Normalization and Clipping

Discount Factor and Return

Explain Like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)