type
status
date
slug
summary
tags
category
icon
password
Explaining the principles behind the MC and TD-Lambda RL Algorithm.
Introduction
Taxonomy of RL algorithms
The figure below provides a general, though not all-encompassing, classification of deep RL algorithm designs.
![notion image](https://s2.loli.net/2024/03/12/Sx7VTJ9hmjpsa1W.png?t=d9c863a5-d8ec-4433-9d59-f3d4ca34b2b4)
The Monte-Carlo and Temporal-Difference algorithms we will introduce in this article are both typical model-free approaches.
Monte-Carlo Algorithm
For episode , the return is defined as the total discounted reward: Our objective is to learn the value function, which is the expected return
Monte-Carlo approaches with the empirical mean sampled episode returns.
To evaluate state , we maintain the arrays and :
- The first/every time-step that state is visited in an episode,
increment counter
- Increment total return
- Value is estimated by mean return
- By law of large numbers, as
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2F6b1892cd-f2ce-4376-9599-aed84e87373d%2FUntitled.png?table=block&id=d09dbdbe-fd0a-4691-b366-565cba23d127&t=d09dbdbe-fd0a-4691-b366-565cba23d127&width=384&cache=v2)
ย
This process can also be done incrementally:
For each state with return
Often it could be useful to replace with for non-stationary problems.
TD Algorithm
Instead of the actual return , Update value toward (biased) estimated return:
- is called the target
- is called the error
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2F2d70dbad-f0d1-4ece-866b-b43752ff4ae1%2FUntitled.png?table=block&id=5a046cfe-3071-4c87-a8e8-48a3db4b3c8d&t=5a046cfe-3071-4c87-a8e8-48a3db4b3c8d&width=708&cache=v2)
ย
is not dependent on the final outcome.
Dynamic Programming Backup
prioritizes exploration in stead of exploitation.
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2Fcd827fe6-143e-4f26-b9f7-bf823b8cca1a%2FUntitled.png?table=block&id=0dae33c2-54fe-4919-8a80-d0b96352e700&t=0dae33c2-54fe-4919-8a80-d0b96352e700&width=2209&cache=v2)
Certainty Equivalence
for the observed episodes:
converges to solution of the that best fits the data.
Its estimators are:
exploits Markov property, and are usually more efficient in Markov environments.
Bootstrapping and Sampling
ใ
ค | Bootstrapping | Sampling |
n-Step TD
Dene the n-step return:
n-step :
TD(lambda)
Lambda return and forward view
We combine all using weight to compute the , and formulate the forward-view of accordingly:
It is worth noting that the weights decay by a factor of , and
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2Fe368c37c-7f7e-4494-a2b3-d589ef99c76c%2FUntitled.png?table=block&id=7c190cfd-5ee7-424a-8e79-85aee5447d3d&t=7c190cfd-5ee7-424a-8e79-85aee5447d3d&width=392&cache=v2)
The weighting function can be illustrated below:
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2Ff2db270e-1035-4503-b40b-900808cbabce%2FUntitled.png?table=block&id=710b6773-0a17-4662-af2c-fd2aa8f7cb4a&t=710b6773-0a17-4662-af2c-fd2aa8f7cb4a&width=658&cache=v2)
The weights assigned to the terminal point is the sum
Eligibility Traces
To assign a stateโs eligibility with respect to its outcome, we consider its frequency and recency heuristics by writing:
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2Fa637a43d-2b67-458a-b7d4-1743d7d022d1%2FUntitled.png?table=block&id=c30f528b-3068-48bb-a666-bb11674cb699&t=c30f528b-3068-48bb-a666-bb11674cb699&width=534&cache=v2)
e.g. for episode , the eligibility traces are:
Backward view
We keep an eligibility trace for every state . For each time step, broadcast the backwards:
![notion image](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F9d951ede-b583-43ae-b13b-ac8e13442a35%2F1d4928e5-4803-4a7c-bb11-21a1f9072224%2FUntitled.png?table=block&id=b5ee7446-8e98-4689-be8f-91c6e1ffd7e0&t=b5ee7446-8e98-4689-be8f-91c6e1ffd7e0&width=632&cache=v2)
- When , only current state is updated.
- When , total update for is the same as . Therefore, is roughly equivalent to every-visit .
Theorem
The sum of offline updates is identical for forward-view and backward-view
Error telescoping
When , sum of telescopes into ;
For general , TD errors also telescope to , :
Consider an episode where is visited once at time-step
Online and offline updates
Offline updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
Online updates
- updates are applied online at each step within episode
- Forward and backward-view are slightly different
- Exact online achieves perfect equivalence
here indicates equivalence in total update at end of episode.
References
[3] Berkeley cs285
ย
- Author:VernonWu
- URL:https://vernonwu.com/article/TD_RL
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts