TOC
- TOC
- Introduction
- From the Goal of RL to Policy Gradient
- Variance Reduction Methods
- On-policy VS off-policy
- Reference
Introduction
This is the course summaries of UC Berkeley CS285 deep reinforcement learning. From the goal of reinforcement learning, we can derive policy gradient methods. However, it suffers the problems of high variance. In order to solve this problem, causality and baselines methods are used. Gradient policy methods is a type of on-policy methods, which is a special case of off-policy learning.
From the Goal of RL to Policy Gradient
Loss Function
Variance Reduction Methods
Reward to Go
Discounting
Baseline
On-policy VS off-policy
All learning control methods faces an dilemma: they seek to learn action values conditional on the subsequent optimal behavior, but they need to behavior un-optimal in order to explore all actions. How can we learn optimal action while according to exploratory policy ?
- On-policy learning is a compromise: to learn the optimal from the sub-optimal behavior,
- simple and considered first
- great variance and slow to converge
- while for off-policy, there are two policy: target policy is the policy that it will learn, and the behavior policy is used to generate policy.
- requires additional concepts and notations
- more powerful and general