Introduction

This is the course summaries of UC Berkeley CS285 deep reinforcement learning. From the goal of reinforcement learning, we can derive policy gradient methods. However, it suffers the problems of high variance. In order to solve this problem, causality and baselines methods are used. Gradient policy methods is a type of on-policy methods, which is a special case of off-policy learning.

From the Goal of RL to Policy Gradient

Loss Function

Variance Reduction Methods

Reward to Go

Discounting

Baseline

On-policy VS off-policy

All learning control methods faces an dilemma: they seek to learn action values conditional on the subsequent optimal behavior, but they need to behavior un-optimal in order to explore all actions. How can we learn optimal action while according to exploratory policy ?

On-policy learning is a compromise: to learn the optimal from the sub-optimal behavior,
1. simple and considered first
2. great variance and slow to converge
while for off-policy, there are two policy: target policy is the policy that it will learn, and the behavior policy is used to generate policy.
1. requires additional concepts and notations
2. more powerful and general

Policy Gradient

Deep Reinforcement Learning Course Note (Part 3)

TOC