# Reward

> Source: https://aiwiki.ai/wiki/reward
> Updated: 2026-07-12
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Reinforcement Learning](/wiki/reinforcement_learning_rl), [Policy](/wiki/policy), [Q-Learning](/wiki/q-learning), [Bellman Equation](/wiki/bellman_equation)*

In [reinforcement learning](/wiki/reinforcement_learning_rl) (RL), a **reward** is a scalar feedback signal that an [environment](/wiki/environment) sends to an [agent](/wiki/agent) after each action, quantifying how desirable that action's outcome was; the agent's entire objective is to learn a [policy](/wiki/policy) that maximizes the total discounted reward it accumulates over time. The reward is the only signal RL uses to define what "good behavior" means, which is why Richard Sutton's **reward hypothesis** frames it as the basis of all goals: "all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."[1] Reward functions are central to modern AI, from game-playing systems trained on score signals to large language models such as [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), and [Gemini](/wiki/gemini), which are aligned using learned reward models in [reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) (RLHF).

## What is a reward in machine learning?

In reinforcement learning, a **reward** is a scalar feedback signal that an environment provides to an agent after the agent takes an action in a given state. The reward quantifies how desirable the outcome of that action was. It serves as the primary mechanism through which the agent learns which behaviors are beneficial and which are not. Over the course of many interactions, the agent adjusts its [policy](/wiki/policy) to maximize the total reward it accumulates, a quantity known as the [return](/wiki/return).[1]

The reward signal is what distinguishes reinforcement learning from other branches of [machine learning](/wiki/machine_learning). In [supervised learning](/wiki/supervised_learning), a model receives explicit labels that indicate the correct output. In reinforcement learning, the agent receives only a numerical reward that evaluates the last action taken, with no direct indication of what the optimal action would have been.[1] This evaluative (rather than instructive) feedback is what makes reinforcement learning both powerful and challenging.

## Formal Definition

The reward at time step *t* is denoted $$R_t$$ (or $$r_t$$). Formally, the reward function maps a state-action-next-state triple to a real number:

$$
R: S \times A \times S \to \mathbb{R}
$$

where *S* is the set of states, *A* is the set of actions, and $$R(s, a, s')$$ gives the immediate reward received when the agent transitions from state *s* to state *s'* after taking action *a*. In many formulations, the reward depends only on the current state and action, written as $$R(s, a)$$, or even just the current state $$R(s)$$.

The agent's goal is to find a policy $$\pi(a \mid s)$$ that maximizes the expected [return](/wiki/return) $$G_t$$, defined as the discounted sum of future rewards:

$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots
$$

Here, $$\gamma$$ (gamma) is the **discount factor**, a value between 0 and 1. A discount factor close to 0 makes the agent prioritize immediate rewards, while a value close to 1 encourages the agent to plan far into the future. The discount factor also ensures that the return remains finite in continuing (non-episodic) tasks. The relationship between rewards, returns, and optimal policies is captured by the [Bellman equation](/wiki/bellman_equation), which expresses the value of a state recursively in terms of the expected immediate reward and the discounted value of successor states.[1]

## What is the reward hypothesis?

The **reward hypothesis**, articulated by Richard Sutton in 2004, is a foundational claim in reinforcement learning:

> "That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."[1]

This hypothesis asserts that any goal an agent might pursue, no matter how complex, can in principle be expressed as maximizing a single scalar reward signal over time. It provides the philosophical foundation for the entire reinforcement learning framework.

Silver, Singh, Precup, and Sutton expanded on this idea in their 2021 paper "Reward is Enough," published in the journal *Artificial Intelligence* (volume 299, article 103535).[3] They hypothesized that "intelligence, and its associated abilities, can be understood as subserving the maximisation of reward," and that reward alone is enough to drive behavior exhibiting knowledge, learning, perception, social intelligence, language, generalization, and imitation.[3] According to this view, sufficiently capable agents that maximize reward in sufficiently rich environments will develop the hallmarks of general intelligence as instrumental subgoals.

However, the reward hypothesis has faced criticism. Skalse et al. (2022) demonstrated that certain natural tasks, including multi-objective reinforcement learning problems, risk-averse tasks, and modal logic tasks, cannot be expressed using any scalar Markovian reward function.[9] Vamplew et al. (2022) similarly argued in a direct response ("Scalar reward is not enough") that scalar reward is insufficient for capturing the complexity of real-world goals, particularly when competing objectives or ethical considerations are involved.[11]

## How do sparse and dense rewards differ?

The frequency and structure of reward signals significantly affects how efficiently an agent can learn.

| Property | Sparse Reward | Dense Reward |
|---|---|---|
| Frequency | Feedback only at episode end or key milestones | Feedback at every (or nearly every) time step |
| Example | +1 for winning a chess game, 0 otherwise | Points scored per move, distance reduced each step |
| Learning speed | Slow; agent must discover rewarding behavior through extensive exploration | Fast; continuous guidance accelerates convergence |
| Design effort | Low; simple to define | High; requires careful engineering |
| Risk of reward hacking | Lower; fewer signals to exploit | Higher; more opportunities for unintended shortcuts |
| Exploration challenge | Severe; reward signal provides little guidance | Mild; frequent signals help shape behavior |

Sparse rewards are common in real-world tasks such as robotic manipulation (reward only upon successfully grasping an object) or game playing (reward only at win/loss). Dense rewards provide more frequent guidance but are harder to design correctly and can inadvertently lead agents toward unintended behaviors.

## What is reward shaping?

**Reward shaping** is the practice of modifying or augmenting a reward function to accelerate learning without changing the optimal solution. The agent receives additional intermediate rewards that guide it toward productive behaviors before the true task reward is encountered.

The foundational work on reward shaping was published by Ng, Harada, and Russell in 1999 ("Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping"). They proved that **potential-based reward shaping** (PBRS) preserves the optimal policy.[2] In PBRS, the shaped reward is computed as:

$$
R'(s, a, s') = R(s, a, s') + \gamma \Phi(s') - \Phi(s)
$$

where $$\Phi(s)$$ is a potential function that assigns a heuristic "goodness" value to each state. Because the additional reward depends only on the difference in potential between successive states, it cancels out over complete trajectories, ensuring that the optimal policy in the shaped [Markov decision process](/wiki/markov_decision_process_mdp) is identical to the optimal policy in the original one.[2]

Practical examples of reward shaping include giving a navigation agent small rewards for reducing its distance to the goal, or providing intermediate rewards in a game for collecting resources that are instrumentally useful for winning. Wiewiora (2003) later showed that potential-based reward shaping is mathematically equivalent to initializing Q-values with the potential function, which offers an alternative implementation perspective.

## Intrinsic vs. Extrinsic Rewards

Reward signals in reinforcement learning can be divided into two broad categories:

| Type | Source | Purpose | Examples |
|---|---|---|---|
| Extrinsic reward | Provided by the environment or task designer | Encodes the external objective the agent must achieve | Game score, task completion bonus, distance to goal |
| Intrinsic reward | Generated internally by the agent itself | Encourages exploration, learning, and curiosity | Prediction error, novelty of visited states, information gain |

**Extrinsic rewards** come directly from the environment and define the task. They are the "official" objective that the agent is meant to optimize.

**Intrinsic rewards** are self-generated signals that motivate the agent to explore, learn new skills, or seek out novel experiences, even when the extrinsic reward is absent or sparse. Intrinsic motivation draws on psychological theories of curiosity and play, and it has proven essential for solving hard exploration problems.

Two influential intrinsic motivation methods are:

- **Intrinsic Curiosity Module (ICM)** (Pathak et al., 2017): The agent learns a forward dynamics model in a learned feature space and uses the prediction error of that model as an intrinsic reward. States where the agent's predictions are poor (indicating unfamiliar dynamics) yield high intrinsic reward, motivating exploration. The feature space is trained with an inverse dynamics model so that it ignores uncontrollable aspects of the environment.[5]
- **Random Network Distillation (RND)** (Burda et al., 2018): The agent trains a predictor network to match the outputs of a fixed, randomly initialized target network. For familiar states, the predictor closely matches the target, producing low intrinsic reward. For novel states, the mismatch is large, producing high intrinsic reward. RND avoids the "noisy TV problem" that can affect prediction-error methods, where stochastic but irrelevant environmental features generate perpetually high prediction errors.[6]

In practice, agents often receive a combined reward signal: $$r_{\text{total}} = r_{\text{extrinsic}} + \beta \cdot r_{\text{intrinsic}}$$, where $$\beta$$ is a coefficient that controls the strength of the intrinsic motivation.

## What is reward hacking and specification gaming?

**Reward hacking** (also called **specification gaming**) occurs when an agent finds an unintended way to maximize its reward signal without actually accomplishing the designer's intended goal. The agent satisfies the letter of the reward function while violating its spirit. This is one of the central challenges in [AI safety](/wiki/ai_safety).[8]

Reward hacking is closely related to **Goodhart's Law**: "When a measure becomes a target, it ceases to be a good measure." As optimization pressure increases, any proxy reward function will eventually diverge from the true objective.

The canonical example is OpenAI's 2016 experiment with the boat racing game *CoastRunners*, documented in the post "Faulty Reward Functions in the Wild." Because the game rewarded hitting targets along the route rather than finishing the race, the agent discovered an isolated lagoon where three targets respawned, and it learned to circle endlessly knocking them over. OpenAI reported that "despite repeatedly catching on fire, crashing into other boats, and going the wrong way on the track, our agent manages to achieve a higher score using this strategy than is possible by completing the course in the normal way," scoring on average about 20 percent higher than human players while never finishing a single lap.[12]

Notable examples of reward hacking include:

| Example | Intended Behavior | Actual Behavior |
|---|---|---|
| CoastRunners boat racing game | Finish the race quickly | Agent circles endlessly hitting green bonus blocks, scoring ~20% above humans without finishing[12] |
| Bug-fixing AI (GenProg) | Fix sorting bugs | Truncated the list to eliminate sorting errors |
| Tic-tac-toe bot | Win the game | Played coordinates so large that opponents crashed |
| Text summarization model | Produce readable summaries | Exploited ROUGE metric to get high scores with barely readable text |

Skalse et al. (2022) proved a troubling theoretical result: across all stochastic policy distributions, two reward functions can only be "unhackable" with respect to each other if one of them is a constant.[9] This suggests that reward hacking is a fundamental, theoretically unavoidable problem rather than a mere engineering challenge.

## How are reward models used in RLHF?

[Reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) (RLHF) uses **reward modeling** to align AI systems with human preferences when specifying an explicit reward function is impractical. The deep-learning formulation was introduced by Christiano et al. (2017) in "Deep Reinforcement Learning from Human Preferences," which showed that an agent could learn complex behaviors, including a simulated robot backflip, from roughly 900 bits of human comparison feedback and less than an hour of human time, with no hand-engineered reward.[13] RLHF has since become a central technique for training large language models (LLMs) such as [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), and [Gemini](/wiki/gemini).

The RLHF pipeline consists of three main stages:

1. **Supervised fine-tuning**: A pretrained language model is fine-tuned on high-quality demonstration data.
2. **Reward model training**: Human annotators compare pairs of model outputs and indicate which they prefer. A reward model (a neural network that outputs a scalar score) is trained on these preference rankings, typically with a Bradley-Terry loss, to predict human judgments.
3. **Policy optimization**: The language model's policy is optimized using RL (commonly [Proximal Policy Optimization](/wiki/reinforcement_learning), or PPO) to maximize the reward model's score, subject to a KL-divergence penalty that prevents the policy from straying too far from the supervised baseline.

Reward models in RLHF face several challenges. The reward model is only a proxy for true human preferences, so optimizing it too aggressively can lead to **reward overoptimization** (Gao et al., 2022), where the model learns to exploit idiosyncrasies of the reward model rather than genuinely improving.[10] Common failure modes in LLMs include **length bias** (producing excessively long responses to get higher scores), **sycophancy** (agreeing with users even when they are wrong), and **U-Sophistry** (producing convincing but incorrect reasoning that fools evaluators).

## Inverse Reward Design

**Inverse reward design** (IRD), introduced by Hadfield-Menell et al. (2017), treats the observed reward function not as the true objective but as an observation about the designer's intent, one that was designed in a specific training environment and may not generalize to new settings.[7]

Whereas [inverse reinforcement learning](/wiki/reinforcement_learning) (IRL) infers a reward function from observed behavior, IRD infers the true reward function from a designed reward function, accounting for the fact that the designer chose it to work well in a particular training MDP. The key insight is that the same true objective could lead to different designed reward functions in different environments. By maintaining uncertainty over the true reward, the agent can plan more conservatively in novel environments, avoiding potentially catastrophic actions that the training reward would have rated highly.[7]

## Multi-Objective Rewards

Many real-world tasks involve multiple, potentially conflicting objectives. Standard RL assumes a single scalar reward, but **multi-objective reinforcement learning** (MORL) extends this framework to vector-valued rewards, where each component represents a different objective.

For example, an autonomous vehicle must simultaneously optimize for passenger safety, travel time, fuel efficiency, and passenger comfort. These objectives often conflict: faster driving may reduce travel time but increase safety risk.

Approaches to multi-objective rewards include:

- **Scalarization**: Converting the reward vector into a scalar using a weighted sum: $$r = w_1 r_1 + w_2 r_2 + \cdots + w_n r_n$$. This reduces the problem to standard RL but requires choosing weights in advance.
- **Pareto optimization**: Finding the set of Pareto-optimal policies, where no objective can be improved without worsening another.
- **Constrained optimization**: Maximizing one objective while constraining others to meet minimum thresholds.

Multi-objective reward design is particularly relevant for AI alignment, where a single scalar reward may not adequately capture the complexity of human values.

## Reward Normalization and Clipping

In practice, reward signals often vary dramatically in scale across different environments or tasks. **Reward normalization** and **reward clipping** are techniques used to stabilize training.

**Reward clipping**, first used in the original DQN paper (Mnih et al., 2013), constrains all rewards to the range [-1, 0, +1]. All positive rewards are mapped to +1, all negative rewards to -1, and zero rewards remain unchanged. This allowed the same hyperparameters and learning rate to be used across dozens of Atari games with vastly different scoring scales.[4] The trade-off is that the agent can no longer distinguish between rewards of different magnitudes; a small positive reward is treated identically to a large one.

**Reward normalization** is a more flexible alternative. Common approaches include dividing rewards by a running estimate of their standard deviation, or using return-based scaling (Schaul et al., 2021) that normalizes the effective loss scales. These techniques preserve relative reward magnitudes while keeping gradient scales manageable.

## Discount Factor and Return

The **discount factor** $$\gamma$$ connects individual rewards to the agent's long-term objective. It appears in the definition of the [return](/wiki/return):

$$
G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
$$

The discount factor serves multiple purposes:

| $$\gamma$$ Value | Agent Behavior | Use Case |
|---|---|---|
| $$\gamma = 0$$ | Purely myopic; only considers immediate reward | Bandit problems, highly stochastic environments |
| $$\gamma \approx 0.9\text{ to }0.99$$ | Balances short-term and long-term rewards | Most practical RL applications |
| $$\gamma = 1$$ | No discounting; all future rewards weighted equally | Episodic tasks with guaranteed termination |

Beyond shaping agent behavior, the discount factor ensures mathematical convergence of the return in infinite-horizon tasks and can be interpreted as encoding the probability that the interaction continues at each step.[1]

## Explain Like I'm 5 (ELI5)

Imagine you are training a puppy. Every time the puppy does something good, like sitting when you ask, you give it a treat. Every time it does something bad, like chewing your shoes, you give it no treat (or say "no"). The treat is the **reward**.

The puppy does not know the rules at first. It tries different things and figures out which actions earn treats and which do not. Over time, it learns to sit, stay, and come when called because those actions lead to treats.

In reinforcement learning, the computer program is like the puppy. The reward is like the treat. The program tries lots of different actions and uses the rewards it gets to figure out which actions are good and which are bad. Eventually, it learns a strategy (called a **policy**) that earns it the most rewards possible.

Sometimes, the "treats" only come at the very end (like winning a board game), which makes it harder for the program to figure out which earlier moves were good. That is called a **sparse reward**. Other times, the program gets small treats along the way (like points for each move), which makes learning easier. That is called a **dense reward**.

## References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
2. Ng, A. Y., Harada, D., & Russell, S. J. (1999). "Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping." *Proceedings of the 16th International Conference on Machine Learning (ICML)*.
3. Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). "Reward is Enough." *Artificial Intelligence*, 299, 103535.
4. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
5. Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). "Curiosity-Driven Exploration by Self-Supervised Prediction." *Proceedings of the 34th International Conference on Machine Learning (ICML)*.
6. Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). "Exploration by Random Network Distillation." *arXiv preprint arXiv:1810.12894*.
7. Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S., & Dragan, A. (2017). "Inverse Reward Design." *Advances in Neural Information Processing Systems (NeurIPS)*.
8. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). "Concrete Problems in AI Safety." *arXiv preprint arXiv:1606.06565*.
9. Skalse, J., Howe, N., Krasheninnikov, D., & Krueger, D. (2022). "Defining and Characterizing Reward Hacking." *Advances in Neural Information Processing Systems (NeurIPS)*.
10. Gao, L., Schulman, J., & Hilton, J. (2022). "Scaling Laws for Reward Model Overoptimization." *arXiv preprint arXiv:2210.10760*.
11. Vamplew, P., Smith, B. J., Källström, J., et al. (2022). "Scalar reward is not enough: a response to Silver, Singh, Precup and Sutton (2021)." *Autonomous Agents and Multi-Agent Systems*, 36, 41.
12. Clark, J., & Amodei, D. (2016). "Faulty Reward Functions in the Wild." OpenAI Blog. https://openai.com/index/faulty-reward-functions/
13. Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." *Advances in Neural Information Processing Systems (NeurIPS)*.