type

status

date

slug

summary

tags

category

icon

password

**Explaining the principles behind the MC and TD-Lambda RL Algorithm.**

Introduction

Taxonomy of RL algorithms

The figure below provides a general, though not all-encompassing, classification of deep RL algorithm designs.

The Monte-Carlo and Temporal-Difference algorithms we will introduce in this article are both typical model-free approaches.

Monte-Carlo Algorithm

For episode , the return is defined as the total discounted reward: Our objective is to learn the value function, which is the expected return

Monte-Carlo approaches with the empirical mean sampled episode returns.

To evaluate state , we maintain the arrays and :

- The
**first/every**time-step that state is visited in an episode,

increment counter

- Increment total return

- Value is estimated by mean return

- By
**law of large numbers**, as

ย

This process can also be done incrementally:

For each state with return

Often it could be useful to replace with for non-stationary problems.

TD Algorithm

Instead of the

**actual**return , Update value toward (biased)**estimated**return:- is called the target

- is called the error

ย

is not dependent on the final

**outcome**. Dynamic Programming Backup

prioritizes

**exploration in stead of**exploitation. Certainty Equivalence

for the observed episodes:

converges to solution of the that best fits the data.

Its estimators are:

**exploits**Markov property, and are usually more efficient in Markov environments.

Bootstrapping and Sampling

ใ
ค | Bootstrapping | Sampling |

n-Step TD

Dene the

*n*-step return:*n*-step :

TD(lambda)

Lambda return and forward view

We combine all using weight to compute the , and formulate the forward-view of accordingly:

It is worth noting that the weights decay by a factor of , and

The weighting function can be illustrated below:

The weights assigned to the terminal point is the sum

Eligibility Traces

To assign a stateโs eligibility with respect to its outcome, we consider its frequency and recency heuristics by writing:

**e.g. for episode**, the eligibility traces are:

Backward view

We keep an eligibility trace for every state . For each time step, broadcast the backwards:

- When , only current state is updated.

- When , total update for is the same as . Therefore, is roughly equivalent to every-visit .

**Theorem**The sum of offline updates is identical for forward-view and backward-view

Error telescoping

When , sum of telescopes into ;

For general , TD errors also telescope to , :

Consider an episode where is visited once at time-step

Online and offline updates

**Offline**updates

- Updates are accumulated within episode

- but applied in batch at the end of episode

**Online**updates

- updates are applied online at each step within episode

- Forward and backward-view are slightly different

- Exact online achieves perfect equivalence

here indicates

**equivalence**in total update at end of episode. References

[3] Berkeley cs285

ย

**Author:**VernonWu**URL:**https://vernonwu.com/article/TD_RL**Copyright:**All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts