Road to ML Engineer #37 - Markov Decision Process

Last Edited: 1/27/2025

The blog post introduces the Markov Decision Process in reinforcement learning.

ML

So far, we have been dealing with supervised learning (where we make models learn to predict the labels we provide) and unsupervised learning (where we do not provide labels and make models learn patterns, such as in PCA). In this article, we introduce a new paradigm of machine learning: the reinforcement learning paradigm.

Note: The content of this article is largely inspired by Reinforcement Learning: An Introduction by Sutton, S. R. & Barto, G. A. (2015). I highly recommend checking it out.

What Is Reinforcement Learning?

Reinforcement learning is a unique paradigm of machine learning that studies reward-maximizing agents interacting with an environment. It is different from supervised machine learning in that there is no supervisor, and the agent has to explore the world on its own and shape its behavior. It is also different from unsupervised learning in that agents need to perform a series of actions to maximize rewards.

The uniqueness of this paradigm brings a unique set of challenges, the major one being the tradeoff between exploration and exploitation. While the agent needs to exploit the knowledge acquired through experience by performing the actions that led to rewards, it also needs to explore actions it has never performed, which might lead to better rewards. This tradeoff is irrelevant to supervised or unsupervised learning, which only focus on an isolated subproblem instead of the whole environment.

As reinforcement learning agents consider their optimal actions within the whole environment, they can be robust and applicable in real life compared to the other two approaches, which require strict control of the environment. Modern reinforcement learning has also benefited from findings in many disciplines, boosting its robustness and applicability. This is why reinforcement learning is a particularly interesting field of research in AI and ML, and why we see various applications of reinforcement learning in many fields, including LLMs.

Problem Definition

Reinforcement learning tasks can be simply expressed by the diagram below, where an environment sends the current state and reward to an agent, and the agent chooses available actions depending on the current state and reward.

RL Diagram

At each discrete time step t=1,2,3...t = 1, 2, 3..., the agent receives the current state of the environment St=sS_t=s, collects the reward from the previous action Rt=rR_t=r, and chooses the action At=aA_t=a based on the current state and the policy πt(aSt=s)\pi_t(a|S_t = s), which maps a state to the probability of taking a certain action. Then, one time step later, the environment transitions to the new state St+1=sS_{t+1}=s' based on the action aa at state ss, which has an associated reward Rt+1=rR_{t+1}=r'. This process continues until the maximum time step is reached or the agent reaches the goal states. The agent's job is to learn the best policy π\pi that maximizes the sum of rewards in a full cycle or an episode by experiencing multiple episodes.

When the probabilities of transitioning to a next state ss' depend only on the current state ss and action aa, we say that the state transition has the Markov property. This property is desirable as it allows us to easily compute the expected rewards for each action and choose the one with the highest expected reward. When all state transitions in a process satisfy the Markov property, we say that the process is a Markov decision process (MDP), and the field of reinforcement learning predominantly deals with MDPs under the assumption that most reinforcement learning tasks can be defined as MDPs. (More discussion of the plausibility of this assumption is made at the bottom of the article.)

Finite Markov Decision Process

The set of all possible states and actions is called the state space and action space, and reinforcement learning focuses mainly on MDPs with finite state and action spaces, called finite Markov decision processes (finite MDPs). A finite MDP can be defined by 4 elements: finite state space SS, finite action space AA, the probability of transitioning to the next state ss' from ss at time step t+1t+1 given action aa, PssaP_{ss'}^{a}, and the immediate reward from the transition RssaR_{ss'}^a for all possible actions and states. The objective here is to learn the policy that can maximize the expected sum of rewards or return of an episode. We often express MDPs with a graph like below.

MDP Example

The above is the example diagram of a finite MDP. The state space is all of the cities, and the action space is all the vehicles one can use. Each state transition from one state to another involves an action, which has an associated probability (the possibility of having different congestion levels, accidents, etc.) and reward (hours spent as a negative reward). The task is to find the optimal policy to travel from City A to City D with the highest return or the least time spent. Given the Markov property, the probability of each possible pair of next state and reward can be denoted as follows.

p(s,rs,a)=Pr{St+1=s,Rt+1=rSt=s,At=a}p(s', r | s, a) = \text{Pr}\{S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a\}

These quantities capture the whole dynamic of a finite MDP and allow us to compute the quantities we are interested in, such as expected rewards for state-action pairs,

r(s,a)=E[Rt+1St=s,At=a]=rRrsinSp(s,rs,a),r(s, a) = \text{E}[R_{t+1} | S_t = s, A_t = a] = \sum_{r \in R} r \sum_{s' in S}p(s', r | s, a),

the state-transition probabilities (by integrating over rewards),

p(ss,a)=Pr{St+1=sSt=s,At=a}=rRp(s,rs,a),p(s' | s, a) = \text{Pr}\{S_{t+1} = s' | S_t = s, A_t = a\} = \sum_{r \in R} p(s', r | s, a),

and expected rewards for the combinations of state, action, and next state,

r(ss,a)=E[Rt+1St=s,At=a,St+1=s]=rRrp(s,rs,a)p(ss,a).r(s' | s, a) = \text{E}[R_{t+1} | S_t = s, A_t = a, S_{t+1} = s'] = \frac{\sum_{r \in R} rp(s', r | s, a)}{p(s'|s, a)}.

For the state-transition probabilities and expected rewards, we use PssaP_{ss'}^a and RssaR_{ss'}^a, respectively. However, the notation for expected rewards only gives expectations and does not fully capture the dynamics of reward. Also, the notation uses too many subscripts and superscripts, which motivates us to use p(s,rs,a)p(s', r | s, a), p(ss,a)p(s' | s, a), and r(ss,a)r(s' | s, a) when discussing the mathematics (though we might still use PssaP_{ss'}^a and RssaR_{ss'}^a in the problem definition).

Bellman Equation

The agent's goal is to take an incremental action based on how good the next state that action leads to is—in other words, based on the future expected reward. Using the above quantities, we can define a function that computes the future expected reward under a policy, called the value function.

vπ(s)=Eπ[k=0γkRt+k+1St=s]v_{\pi}(s) = \text{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s]

Here, γ\gamma is the discount rate for future rewards. Similarly, we can define the action-value function, which computes the value of taking action aa at state ss under a policy.

qπ(s,a)=Eπ[k=0γkRt+k+1St=s,At=a]q_{\pi}(s, a) = \text{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s, A_t = a]

As the value function sums future rewards, it inherently has a recursive nature. We can expand the value function to express the recursive relationship between the values of the current and next states as follows:

vπ(s)=Eπ[k=0γkRt+k+1St=s]=Eπ[Rt+1+γk=0γkRt+k+2St=s]=aπ(as)srp(s,rs,a)[r+γEπ[k=0γkRt+k+2St+1=s]]aπ(as)srp(s,rs,a)[r+γvπ(s)]v_{\pi}(s) = \text{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s] \\ = \text{E}_{\pi}[R_{t+1} + \gamma\sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_t = s] \\ = \sum_a \pi(a|s) \sum_{s'}\sum_r p(s', r | s, a) [r + \gamma\text{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+2} | S_{t+1} = s']] \\ - \sum_a \pi(a|s) \sum_{s'}\sum_r p(s', r | s, a) [r + \gamma v_{\pi}(s')]

As we expand the equation, we must integrate over all possible next states and rewards to compute the expected reward given the current state and action in the next time step. Then, we take another integral over the action to get the value of the current state. The above equation is the Bellman equation for vπv_{\pi}, which expresses the expected reward for taking an action under some policy π\pi in a recursive fashion.

Bellman Optimality Equation

Under the optimal policy that can maximize the expected future reward, the value of a state must equal the expected future reward for the best action from that state. It is quite intuitive to see why, as values calculated inconsistently and actions chosen not to maximize rewards can result in a suboptimal policy. Given this, we can define the Bellman optimality equation vv_*, whose values are used to choose the best state to move to by the optimal policy, as follows.

v(s)=maxaA(s)qπ(s,a)=maxaEπ[k=0Rt+k+1St=s,At=a]=maxaEπ[Rt+1+γk=0Rt+k+2St=s,At=a]=maxasrp(s,rs,a)[r+γv(s)]v_{*}(s) = \max_{a \in A(s)} q_{\pi_*}(s, a) \\ = \max_{a} \text{E}_{\pi^*}[\sum_{k=0}^{\infty} R_{t+k+1} | S_t=s, A_t=a] \\ = \max_{a} \text{E}_{\pi^*}[R_{t+1}+\gamma\sum_{k=0}^{\infty} R_{t+k+2} | S_t=s, A_t=a] \\ = \max_{a} \sum_{s'}\sum_r p(s', r | s, a) [r + \gamma v_{*}(s')]

Similarly, we can define the Bellman optimality equation for qq_* as follows.

q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,A+t=a]=srp(s,rs,a)[r+γmaxaq(s,a)]q_{*}(s, a) = \text{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t=s, A+t=a] \\ = \sum_{s'}\sum_r p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]

The Bellman optimality equations are, in fact, systems of equations when we know all the quantities (p(s,rs,a)p(s', r | s, a)), which can be solved to find v(s)v_*(s) for all states ss and used to derive the optimal policy that assigns non-zero probabilities only to the actions transitioning to the nearby states with the highest values. However, it is almost always implausible to apply this solution since we assume perfect knowledge of the quantities, a perfect Markov property, and computational resources to solve the system of equations in a reasonable amount of time—all of which are unrealistic to obtain.

For example, Chess is estimated to have 1011110^{111} to 1012310^{123} board configurations, which is impossible to keep track of. We simply do not know the state-transition probabilities and expected rewards, and the board configuration is not the only observable factor that can influence the probability of the next states, such as the opponent's demeanor and remaining time, making it not perfectly Markov. Simply solving the system of equations for such an enormous problem is unrealistic. Hence, reinforcement learning concerns itself with learning to produce an approximation of the optimal policy within a reasonable amount of resources available.

Conclusion

In this article, we introduced the reinforcement learning paradigm, Markov Decision Process, Bellman equation, and Bellman optimal equations to fully understand the nature of reinforcement learning tasks intuitively and mathematically. In the next article, we will learn how to typically express reinforcement learning tasks programmatically as well before diving into the algorithms for approximating the optimal policy.

[Optional] Plausibility of Assuming Markov Property

In reality, it is hard to find a closed system such that its state transitions all satisfy the Markov property perfectly. Or rather, it is hard to define a state that contains all the relevant information that might influence the next state (Markov) so that all the state transitions can satisfy the Markov property. For example, when modeling a game of poker, we can define a state to include the agent's hand, bets made by players, and the number of cards opened. However, the state transitions can technically depend on other observable factors such as players' faces, the loudness of their voices, and the time of day, and so on. We can say it is virtually and practically impossible to quantify or discretize some of them and include them all in the state, but does this mean using MDP to train an agent for poker is implausible?

It is plausible because the other factors I mentioned only influence the probability to a small extent. Even though the state transitions might not perfectly satisfy the Markov property, the agent's hand, bets made by players, and the number of cards opened should predominantly determine the probability of the next states. For example, if the bets by other players are high, the agent can evaluate it as one being confident with their hands and fold without looking at their faces to see if they seem confident. It is almost always not possible to define a problem with all the states being perfectly Markov, but we only need the states to be largely Markov to assume MDP. Hence, we can comfortably assume that many tasks can be modeled as MDPs. (Though we are often always limited to learning to best approximate the optimal solution to MDPs.)

Resources