Introduction

This is the course notes of UC Berkeley CS285 deep reinforcement learning.

Intuition, Definition and Notation

Instead of learning from sparse rewards or specifying a manually reward function, the agent learns the optimal policy by learning, imitating the demonstrations of the experts.

Below shows the process:

Notations:

$s_t$ state of the environment at time t
$o_t$ observation of the agent at time t
$a_t$ action of agent at time t
${\pi}_{\theta}(a_t | o_t)$: policy
${\pi}_{\theta}(a_t | s_t)$: policy(if fully observed)
$p(s_{t+1}|a_t, s_t)$: transiton fuctions to $s_{t+1}$ given $a_t$ and $s_t$

Algorithms

Type	Advantage	Disadvantage	Use When
Behavior Cloning	simple	no long-term planing error can add up	the application is simple
Direct Policy Learning	long-term planing	interavtive expert demonstrations needed	application is complex; interactive demonstration is available
Inverse Reinforcement Learning	long-term planing no need for interavtive demenstration	can be hard to train	interavtive demonstartion is no available; easier to learn reward function then expert policy

Behavior Cloning

Definition: learning the expert’s policy with supervised learning.

Direct Policy Learning (by interactive demonstrator)

Definition: behavior cloning is a special of direct policy learning. First, supervised learning, then we roll out the game with the expert, and collect data. After that, we get more data demonstrated by the experts. Then go through the loop again.

In the process of learning, the agent should ‘remembers’ all the mistakes that made. Two approches are used:

Data aggregation (DAgger): agents are trained with all the history training data (The data will be relabeled by experts, then use these data to train the agent. refer to homework code)
Policy aggregation: agents are only trained with data from the last iteration and then combined with policy from previous rounds with geometric blending.

Inverse Reinforcement Learning

Definition: learn the reward function from the expert demonstrations, the train the agent with RL algorithms.

Procedures are as follows:

Two type of IRL: model-free and model-given

Applications

Game AI
Structured prediction
etc..

Refer to imitation learning tutorial ICML 2018 video for more application practices

Supervised Learning of Behaviors - Imitation Learning

Deep Reinforcement Learning Course Note (Part 1)

TOC