Road to ML Engineer #40 - Monte Carlo Methods

In the last article, we discussed the dynamic programming approach to solving Markov Decision Processes (MDPs), which required unrealistic assumptions of a perfect model and infinite computational resources. In this article, we will introduce another approach for approximating the optimal policy: Monte Carlo methods. Unlike dynamic programming, Monte Carlo methods do not require a model or use bootstrapping.

Monte Carlo Estimation

The reason dynamic programming requires a perfect model is that it utilizes estimates of state values to choose the next action that maximizes the value of the subsequent state. Instead of doing this, we can estimate state-action values purely from interactions with the environment by taking the average of the returns achieved by the agent in an episode. Then, we can adopt a greedy policy that maximizes the estimated state-action value at each time step.

This approach of simulating episodes not only avoids the assumption of a perfect model but also removes the assumption of unlimited computational resources. It focuses on important states with a high probability of being visited without updating all state values. However, this method has a fundamental challenge: exploring all possibilities. When the starting policy is deterministic in choosing a particular action, the agent cannot explore other options and may become stuck in a suboptimal policy.

This problem is often referred to as the exploration-exploitation trade-off and is a common issue in all approaches that approximate state-action values from simulated interactions with the environment. A naive way of addressing this issue is to assume "exploring starts," where the agent can start from any state-action pair, guaranteeing that it will visit all states as the number of episodes approaches infinity.

Although the assumption of exploring starts is sometimes helpful, it is often not reasonable to rely on such an unrealistic assumption. Hence, we typically resort to two practical approaches that utilize stochastic policies to allow exploration: on-policy methods and off-policy methods.

On-Policy Monte Carlo Methods

The on-policy control method utilizes $\epsilon$ -greedy policies, which assigns a non-zero probability $\frac{\epsilon}{|A(s)|}$ to non-greedy actions and the remaining $1-\epsilon+\frac{\epsilon}{|A(s)|}$ to the greedy action. Hence, the greedy action with the highest state-action value estimate will be chosen for the most part while leaving chances of all other actions to be explored. Here, the method aims to approach the best $\epsilon$ -soft policy (which assigns $\pi(a, s) \geq \frac{\epsilon}{|A(s)|}$ ) by making use of policy iteration; policy evaluation by taking the average of returns from episodes and policy improvement by assigning the largest possible probability to greedy actions.

As $q_{\pi}(s, a) = v_{\pi}(s)$ still holds, it is guaranteed that the method can reach the best among the $\epsilon$ -soft policies. In other words, we make sure that we assign an $\epsilon$ chance of choosing any of the non-greedy actions in the policy improvement phrase so that we always have a chance of exploring the other options during the policy evaluation phase. Here, we are guaranteed to end up with the best $\epsilon$ -soft ( $\epsilon$ -greedy) policy, which is synonymous with the optimal policy. The following code implementation of on-policy control method for FrozenLakeEnvironment:

def on_policy_mc_control(env, num_episodes, epsilon):
  Q = np.zeros([16, 4])
  n_s_a = np.zeros([16, 4])
  stats = 0.0
 
  for episode in range(num_episodes):
      state, _ = env.reset()
      done = False
      results_list = [] # tracks all the state action pairs visited
      result_sum = 0.0 # 0 if goal not reached else 1 in this case
 
      # Policy Evaluation
      while not done:
          # Epsilon Greedy Policy (implemented as if else)
          if np.random.rand() > epsilon:
              action = np.argmax(Q[coord_to_num(state), :])
          else:
              action = env.action_space.sample()
          new_state, reward, done, _, _ = env.step(action)
          results_list.append((coord_to_num(state), action))
          result_sum += reward
          state = new_state
      
      # Policy Improvement (by updating Q)
      for (state, action) in results_list:
          n_s_a[state, action] += 1.0
          alpha = 1.0 / n_s_a[state, action]
          Q[state, action] += alpha * (result_sum - Q[state, action]) ## running mean
 
      stats += result_sum
      if episode % 500 == 0 and episode != 0:
          print(f"episode: {episode}, success: {stats/episode}, epsilon: {epsilon}")
 
  print(f"episode: {episode}, success: {stats/episode}, epsilon: {epsilon}")
 
  env.close()
  return Q
 
num_episodes = 100000
epsilon = 0.4
Q = on_policy_mc_control(env, num_episodes, epsilon)

The function updates the approximation of the state-action value, and therefore, updates the $\epsilon$ -greedy policy. Instead of taking the true mean, we take the running mean during policy improvement for simplicity. From the state-action values, we can obtain the $\epsilon$ -greedy policy as follows.

def epsilon_greedy_policy_from_Q(Q, epsilon):
    done = False
    policy = np.ones((16, 4), dtype=float) / 4
    state, _ = env.reset()
    while not done:
        action = np.argmax(Q[coord_to_num(state), :])
        policy[coord_to_num(state)] = np.ones((4,), dtype=float) * epsilon / 4
        policy[coord_to_num(state), action] = 1 - epsilon + epsilon / 4
        state, reward, done, _, _ = env.step(action)
    return policy
 
policy = epsilon_greedy_policy_from_Q(Q, epsilon)

The on-policy control method is relatively simple to comprehend as it uses the same scheme as DP approach while relying on experiences in estimating rewards and assigning non-zero probabilities to non-greedy actions for exploration. In practice, we can gradually decrease the value of $\epsilon$ to perform less exploration and more exploitation, which is called annealing.

Off-Policy Monte Carlo Methods

The on-policy means we use the same policy during policy evaluation and policy improvement. Due to using the same policy, we are forced to end up with an $\epsilon$ -greedy policy, which might not be exactly optimal. For example, the above code results in the policy, which still assigns $\frac{\epsilon}{|A(s)|}$ to falling into a hole. The off-policy control method aims to use different policies during the two phases with importance sampling to allow for exploration with stochastic behavior policy $\mu$ while resulting in a deterministic target policy $\pi$ after improvement.

We can achieve off-policy learning by weighting returns according to the importance sampling ratio, which corresponds to the relative probability of observing the same series of states and actions under the target and behavior policies. Given the start state $S_t$ , the following is the importance sampling ratio:

\rho_t^T = \frac{\prod_{k=t}^{T-1} \pi(A_k | S_k)p(S_{k+1}| S_k, A_k)}{\prod_{k=t}^{T-1} \mu(A_k | S_k)p(S_{k+1}| S_k, A_k)} \\ = \prod_{k=t}^{T-1}\frac{\pi(A_k | S_k)}{\mu(A_k | S_k)}

We can then scale the returns by the ratio and take the simple average or weighted average, which are called ordinary importance sampling and weighted importance sampling respectively. Since ordinary importance sampling can have infinite variance when scaled returns have infinite variance, while weighted importance sampling cannot, weighted importance sampling is largely preferred in practice. We can implement the weighted sampling incrementally from the tail of the episode by computing weights $W_t$ and keeping the cumulative sum of weights $C_t$ as follows.

W_t = W_{t-1} \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \\ C_t = C_{t-1} + W_t \\

The state-action value can be updated by substituting the $\alpha$ with $\frac{W}{C}$ to apply appropriate weight importance sampling. The following is the code implementation of off-policy method with weighted importance sampling that uses an $\epsilon$ -greedy policy for the behavior policy and a deterministic greedy policy for the target policy.

def off_policy_mc_control(env, num_episodes, epsilon):
    Q = np.zeros([16, 4])
    C = np.zeros([16, 4])  # Cumulative weights for weighted importance sampling
    stats = 0.0
 
    for episode in range(num_episodes):
        state, _ = env.reset()
        done = False
        result_list = []
        result_sum = 0
        
        # Policy evaluation using behavior policy (epsilon-greedy)
        while not done:
            # Behavior policy: epsilon-greedy
            if np.random.rand() > epsilon:
                action = np.argmax(Q[coord_to_num(state), :])
            else:
                action = env.action_space.sample()
                
            new_state, reward, done, _, _ = env.step(action)
            result_list.append((coord_to_num(state), action))
            result_sum += reward
            state = new_state
            
        # Policy improvement with weighted importance sampling
        W = 1.0  # Importance sampling weight
        
        # Process episode in reverse order
        for t in range(len(result_list)-1, -1, -1):
            state, action = result_list[t]
            C[state, action] += W # Update cumulative weight
            if C[state, action] > 0:
                Q[state, action] += (W / C[state, action]) * (result_sum - Q[state, action])
            
            # Update weight using importance sampling ratio
            # Target policy is greedy: probability is 1 for best action, 0 for others
            best_action = np.argmax(Q[state, :])
            if action != best_action:
                break  # Exit if non-greedy action (target policy probability is 0)
                
            # Behavior policy probability: epsilon/num_actions for random, (1-epsilon) for greedy
            behavior_prob = epsilon/4 if action != np.argmax(Q[state, :]) else 1-epsilon+epsilon/4
            W *= 1/behavior_prob  # Weight update (target_prob = 1 since action was best)
        
        stats += result_sum
        if episode % 500 == 0 and episode != 0:
            print(f"episode: {episode}, success: {stats/episode}, epsilon: {epsilon}")
    
    print(f"episode: {episode}, success: {stats/episode}, epsilon: {epsilon}")
    
    env.close()
    return Q
 
Q = off_policy_mc_control(env, num_episodes, epsilon)

The off-policy Monte Carlo control method tends to converge much slower than the on-policy one since it can only learn from the tails of the episode after the non-greedy action (if action != best_action: break). In fact, the above algorithm can only reach one tenth of the return that the on-policy one achieves in the same number of episodes.

Conclusion

In this article, we covered the Monte Carlo control methods, which differ from the previously discussed DP approach in that it is model-free and does not use bootstrapping. The article also discussed on-policy and off-policy control methods for maintaining sufficient exploration. In the next article, we will discuss an approach that learns from experiences like Monte Carlo methods and utilizes bootstrapping like the DP approach.

Resources

cwkx. 2022. Reinforcement Learning 5: Monte Carlo methods. YouTube.
Sutton, S. R. & Barto, G. A. 2015. Reinforcement Learning: An Introduction. Stanford.