Markov Decision Processes (MDP)

Markov Decision Processes (MDPs) are a mathematical framework used for decision-making in environments with stochastic outcomes. They are widely applied in artificial intelligence (AI), reinforcement learning (RL), robotics, economics, and operational research. MDPs provide a structured approach to modeling problems where outcomes are partly random and partly under the control of a decision-maker.

Key Components of an MDP

An MDP is formally defined as a tuple (S, A, P, R, γ), where:

S (State Space) – A finite or infinite set of possible states in which an agent can exist.
A (Action Space) – A set of possible actions an agent can take.
P (Transition Probability Function) – A probability distribution $P(s’ | s, a)$ that determines the likelihood of moving from state s to s’ when taking action a.
R (Reward Function) – A function R(s, a, s’) that specifies the immediate reward received when transitioning from s to s’ after taking action a.
γ (Discount Factor) – A value in the range [0,1] that determines the importance of future rewards, with values closer to 1 prioritizing long-term rewards over immediate ones.

The Markov Property

MDPs rely on the Markov property, which states that the future state depends only on the present state and action, not on past states or actions. Mathematically, this is expressed as:

P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, …, s_0, a_0) = P(s_{t+1} | s_t, a_t)

This memoryless property simplifies decision-making since the optimal strategy only depends on the current state, not the history.

Policy in MDPs

A policy $pi$ defines the agent’s behavior in an MDP. It maps each state to an action (or a probability distribution over actions).

Deterministic Policy: $pi(s) = a$ specifies a single action to take in state s.
Stochastic Policy: $pi(a | s)$ gives the probability of choosing action a in state s.

Value Functions

Value functions help evaluate how good a given policy is.

State-Value Function $V^pi(s)$ : The expected return when starting in state s and following policy $pi$ :

V^pi(s) = mathbb{E} left[ sum_{t=0}^{infty} gamma^t R(s_t, a_t, s_{t+1}) mid s_0 = s, pi right]

Action-Value Function $Q^pi(s, a)$ : The expected return when taking action a in state s and following policy $pi$ :

Q^pi(s, a) = mathbb{E} left[ sum_{t=0}^{infty} gamma^t R(s_t, a_t, s_{t+1}) mid s_0 = s, a_0 = a, pi right]

These functions are used to evaluate policies and determine optimal strategies.

Bellman Equations

Bellman equations express the recursive relationship between value functions and optimal policies.

State-Value Function Bellman Equation

V^pi(s) = sum_{a in A} pi(a | s) sum_{s’} P(s’ | s, a) left[ R(s, a, s’) + gamma V^pi(s’) right]

Action-Value Function Bellman Equation

Q^pi(s, a) = sum_{s’} P(s’ | s, a) left[ R(s, a, s’) + gamma sum_{a’} pi(a’ | s’) Q^pi(s’, a’) right]

These equations help compute values iteratively in Dynamic Programming (DP) techniques such as Value Iteration and Policy Iteration.

Solving an MDP

There are several methods for solving MDPs:

Dynamic Programming (DP)
- Value Iteration: Iteratively updates V(s) using the Bellman optimality equation.
- Policy Iteration: Alternates between policy evaluation (calculating V(s)) and policy improvement (updating $pi(s)$ ).
Monte Carlo Methods
- Used when the transition dynamics are unknown.
- Estimate value functions by sampling episodes from experience.
Temporal Difference (TD) Learning
- Mixes DP and Monte Carlo by updating value estimates using sampled transitions.
- Q-Learning and SARSA are popular TD methods used in reinforcement learning.

Applications of MDPs

MDPs are widely used in real-world applications, including:

Reinforcement Learning: Foundations for algorithms like Deep Q-Networks (DQNs) and Policy Gradient methods.
Robotics: Planning and control for autonomous systems.
Finance: Portfolio optimization and risk assessment.
Healthcare: Treatment planning and decision support.
Operations Research: Inventory management, logistics, and supply chain optimization.

Conclusion

Markov Decision Processes provide a robust mathematical framework for modeling sequential decision-making under uncertainty. With applications spanning AI, business, and robotics, MDPs form the backbone of many reinforcement learning algorithms. Understanding MDP components, Bellman equations, and solution methods is essential for anyone working in AI and decision science.

Share This Page:

Key Components of an MDP

The Markov Property

Policy in MDPs

Value Functions

Bellman Equations

State-Value Function Bellman Equation

Action-Value Function Bellman Equation

Solving an MDP

Applications of MDPs

Conclusion

Check Out Our Newest Posts we wrote about

Writing Thread-Safe Memory Management in C++

Writing Tests for Animation Systems

Writing Secure C++ Code with Proper Memory Management

Writing Secure C++ Code with Proper Memory Management (1)