Road to ML Engineer #41 - Temporal-Difference Learning

Last Edited: 2/16/2025

The blog post introduces the temporal-difference learning in reinforcement learning.

ML

So far, we've covered two approaches to solving and approximating MDPs: dynamic programming and Monte Carlo methods. In this article, we will discuss a method that combines the strengths of these previous approaches, temporal-difference learning.

Temporal-Difference

Monte Carlo methods solved the biggest problem of DP by requiring no assumption of a perfect model. However, the method can only learn after completing an episode, as it takes the average of the return from the episode and updates the state-action value. In large environments, it often takes time to complete an episode, making the method slow to iterate. Hence, instead of using actual returns, we can bootstrap and use the estimated state-action value of the next states to update the current state-action value, similar to what is done in DP. By doing so, we can establish temporal-difference learning, which no longer requires completion of a full episode and can update the value at each time step.

Q(s,a)=Q(s,a)+α[R+γQ(s,a)Q(s,a)] Q(s, a) = Q(s, a) + \alpha[R + \gamma Q(s', a') - Q(s, a)]

The above shows the update rule of temporal-difference learning. The update rule is extremely similar to the one used in Monte Carlo methods, where α\alpha corresponds to the divisor for taking simple or weighted averages, and GG is used instead of Q(s,a)Q(s', a'). (The γ\gamma was 1 in the previous implementation.) The next state and action ss' and aa' can be informed by the behavior policy, which may or may not be the same as the target policy. The term inside the square brackets is the temporal-difference error, which corrects the current state-action value from experience with the arbitrary learning rate α\alpha multiplied. The temporal-difference learning is guaranteed to converge to the optimal policy as iterations approach \infty.

SARSA: On-Policy TD Control

SARSA (State-Action-Reward-State-Action) is an on-policy TD control that makes use of the same behavior and target policies. To allow for sufficient exploration, we use ϵ\epsilon-greedy policies just like in on-policy Monte Carlo methods. The following code implementation shows SARSA on the FrozenLakeEnvironment.

def SARSA(env, num_episodes, epsilon, alpha=0.001, gamma=1):
  Q = np.zeros([16, 4])
 
  for episode in range(num_episodes):
      state, _ = env.reset()
      done = False
      if np.random.rand() > epsilon:
          action = np.argmax(Q[coord_to_num(state), :])
      else:
          action = env.action_space.sample()
 
      while not done:
          new_state, reward, done, _, _ = env.step(action)
          # epsilon-greedy for next action
          if np.random.rand() > epsilon:
            next_action = np.argmax(Q[coord_to_num(new_state), :])
          else:
            next_action = env.action_space.sample()
          TD = reward + gamma * Q[coord_to_num(new_state), next_action] - Q[coord_to_num(state), action]
          Q[coord_to_num(state), action] += alpha * TD
          state = new_state
          action = next_action
 
      if episode % 1000 == 0 and episode != 0:
        print(f"episode: {episode}/{num_episodes}")
 
  print(f"episode: {episode}/{num_episodes}")
 
  env.close()
 
  return Q
 
Q = SARSA(env, 100000, 0.4)

The algorithm is extremely similar to the one used in Monte Carlo methods, except that it uses temporal difference errors computed by the next action. Just like value iteration, we update the state-action value every time step instead of waiting for an episode to complete. Since it is an on-policy algorithm, the next action is informed by the same ϵ\epsilon-greedy policy. You can run the above code to confirm that the policy informed by the updated Q ends up being the optimal ϵ\epsilon-greedy policy.

Q-Learning: Off-Policy TD Control

The problem with off-policy Monte Carlo control methods is that they can only learn from the tails of the episode after the first non-greedy action due to the importance sampling ratio being zero. Q-learning or off-policy TD control, however, does not have this issue since it uses arbitrary α\alpha instead of importance sampling ratios and learns from each step. The following code implementation shows Q-learning for the FrozenLakeEnvironment.

def Q_Learning(env, num_episodes, epsilon, alpha=0.001, gamma=1):
  Q = np.zeros([16, 4])
 
  for episode in range(num_episodes):
      state, _ = env.reset()
      done = False
 
      while not done:
          # epsilon greedy behavior policy
          if np.random.rand() > epsilon:
            action = np.argmax(Q[coord_to_num(state), :])
          else:
            action = env.action_space.sample()
          new_state, reward, done, _, _ = env.step(action)
          next_action = np.argmax(Q[coord_to_num(new_state), :]) # greedy target policy
          TD = reward + gamma * Q[coord_to_num(new_state), next_action] - Q[coord_to_num(state), action]
          Q[coord_to_num(state), action] += alpha * TD
          state = new_state
 
      if episode % 1000 == 0 and episode != 0:
        print(f"episode: {episode}/{num_episodes}")
 
  print(f"episode: {episode}/{num_episodes}")
 
  env.close()
 
  return Q
 
Q = Q_Learning(env, 100000, 0.4)

The only change from SARSA is the use of a greedy target policy in choosing next_action to compute TD. You can run the above code to confirm that the policy informed by the updated Q ends up being the optimal greedy policy. As it can converge to the optimal policy more reliably and smoothly than SARSA, and does not require the target policy to be an ϵ\epsilon-soft policy, it is often regarded as one of the most important breakthroughs in reinforcement learning.

Eligibility Traces

The temporal-difference learning mentioned above is a special type of TD called one-step TD or TD(0), where we only look at the previous step. Instead of only looking at one step like value iteration, we can wait for several steps, store relevant information such as the state, action, and reward in an eligibility trace, and update the values based on the eligibility trace. The TD learning augmented with eligibility trace is referred to as TD(λ\lambda), which can use more steps to update the values in one iteration and make convergence to the optimal policy faster depending on the scenario.

We can interpret TD(λ\lambda) in two ways. From a theoretical perspective, TD(λ\lambda) sits between the spectrum of TD(0) and Monte Carlo methods in that it incorporates more steps per iteration. It is typically referred to as the forward view since we perceive the approach as foreseeing more or fewer steps than TD(0) and Monte Carlo methods. In another mechanical point of view, TD(λ\lambda) looks back at the traces and updates the values of the state-action pairs in the trace, which is referred to as the backward view. As practice, you can try implementing SARSA and Q-learning with TD(λ\lambda) for FrozenLakeEnvironment based on the backward view.

Conclusion

In this article, we introduced temporal-difference learning that combines bootstrapping in DP with interactions with the environment in Monte Carlo methods. We also discussed on-policy TD control (SARSA) and off-policy TD control (Q-learning), along with their implementations. In the next article, we will finally delve into how we can harness the power of deep learning in the context of reinforcement learning.

Resources