Markov Decision Process

Markov Decision Process
💡No image available
Overview
Purpose	Mathematical framework for decision-making under uncertainty
Also known as	Markov decision process (MDP)
Key assumptions	Markov property and stochastic state transitions

A Markov decision process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are partly random and partly under the control of a decision maker. It describes a system that moves between states according to probabilistic rules and allows an agent to choose actions to influence future outcomes. MDPs are widely used in reinforcement learning, control theory, and operations research.

Formal definition

An MDP is typically defined as a tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\), where \(\mathcal{S}\) is a set of states, \(\mathcal{A}\) is a set of actions available to the agent, and \(P(s' \mid s, a)\) is a transition probability that specifies the likelihood of moving to state \(s'\) after taking action \(a\) in state \(s\). The function \(R(s,a)\) (or sometimes \(R(s,a,s')\)) specifies the reward received for a state–action pair (or for a transition). The discount factor \(\gamma \in [0,1)\) determines how much future rewards are valued compared with immediate rewards.

The defining assumption is the Markov property, meaning the probability of future states depends only on the current state and action, not on the full history. This property is central to dynamic programming methods and is closely related to Markov chains. When the state space and action sets are finite, the MDP can be analyzed using algorithms from dynamic programming.

Policies and value functions

A policy defines how the agent chooses actions. In general, a policy may be deterministic or stochastic; in stochastic policies, action selection follows a probability distribution conditioned on the current state. Common policy representations include the mapping \(\pi(a \mid s)\) for stochastic policies or \(\pi(s)\) for deterministic ones.

To evaluate how good a policy is, MDPs use value functions. The state-value function \(V^\pi(s)\) gives the expected discounted return starting from state \(s\) and following policy \(\pi\), while the action-value function \(Q^\pi(s,a)\) gives the expected discounted return after taking action \(a\) in state \(s\) and thereafter following policy \(\pi\). These functions satisfy recursive relationships often called Bellman equations, linking the concept to Bellman equation and the broader literature on optimal control.

Optimality and Bellman equations

The goal in standard MDP settings is to find an optimal policy that maximizes expected return. The optimal value functions \(V^(s)\) and \(Q^(s,a)\) are the maximum over all policies. They satisfy the Bellman optimality equations, which provide a set of consistency conditions involving the immediate reward and the expected value of successor states.

These relationships underpin many solution approaches. For example, value iteration repeatedly updates estimates of value functions until convergence, while policy improvement strategies can iteratively refine the chosen policy. When MDPs have finite state and action spaces and appropriate conditions hold, these methods can converge to the optimal value functions and an optimal policy.

Algorithms and reinforcement learning

MDPs are a core modeling tool in reinforcement learning, where an agent interacts with an environment and learns to maximize cumulative reward. Many reinforcement learning algorithms can be interpreted as solving or approximating the Bellman equations of an underlying MDP, using experience gathered from interaction rather than complete knowledge of transition probabilities.

Model-based methods assume or learn a model of the transition dynamics \(P\) and rewards \(R\), then use planning to compute policies. Model-free methods, by contrast, learn value functions or policies directly from data. For instance, Q-learning is a model-free algorithm that updates an estimate of \(Q(s,a)\) using sampled transitions; likewise, SARSA updates action values based on the action actually taken under the current behavior policy. These methods connect MDP theory to practical learning systems used in robotics, game-playing, and other sequential decision tasks.

Extensions and related models

While the classical MDP framework assumes the Markov property, many extensions relax or modify assumptions to match real-world problems. Examples include scenarios with continuous state spaces (leading to function approximation), partially observable environments (leading to partially observable Markov decision process and belief-state approaches), and multi-agent settings (leading to game-theoretic formulations). The structure of the MDP also interacts with concepts such as occupancy measures and optimal control formulations found in control theory.

MDPs have also been studied under different computational viewpoints, such as algorithmic complexity and approximation in large-scale settings. Research in this area links MDPs to topics such as stochastic optimization and planning in uncertain environments.