Introduction

This is the course notes of UC Berkeley CS285 deep reinforcement learning.

Key Concepts and Terminology

Below is the agent-environment interaction loop:

States, Observations, Action Spaces and Trajectories

State $s$: complete description of the state of the world
Observation $o$: partial description of a state, which can be fullly observation and partially observation.
Action spaces $a$: the set of all valid actions in a given environment. There are two types of action spaces:
1. discrete action space, such as Atari and Go
2. continuous action space, like where the agent control a robot in a physical world
A trajectory $\tau$ is a sequence of states and actions in the world. It’s also called episodes or roolouts. The very first state of the world, $s_0$, is randomly sample from the start-state distribution, denoted as $\rho_0$: $s_0\sim\rho_0(\cdot)$

Markov Decision Process (MDP)

Markov properties: the transitions only depends on the recent state and action, no prior history.

Markov chain:

Fully observed Markov decision process:

Partial observed Markov decision process:

Policies

Policy: the rules that agent used to decide what action to take. Usually, $\theta$ are used to denote the paramters of policy.

Deterministic policies: $a_t=\mu_{\theta}(s_t)$
Stochastic policies: $a_t\sim\pi_{\theta}(\cdot|s_t)$

For stochastic policies, there are two kinds:

Categorical policies, which is like a classfier over discrete actions
Diagonal Gausian policies

Reward and Return

The reward functons $R$ depends on the current state of the world, the action just take and the next state of the world: $r_t=R(s_t, a_t, s_{t+1})$

The goal of the agent is to maximize some notion of cumulative reward over a trajectory.

For finite-horizon undiscounted return: $R(\tau)=\sum_{t=0}^{T}r_t$
For infinite-horizon discounted return: $R(\tau)=\sum_{t=0}^{\infty}\gamma^{t}r_t$

Return VS rewards:

reward表示的是执行完某个动作时的立即回报
return表示的一个状态动作序列当中，到达某个状态动作时的累积回报(可以理解为是到达某个状态时其之前发生的立即回报的累加)

The Goal of RL

The goal of RL is to select a policy which maximizes expected return when the agent acts according to it.

In the case when both the environemnt transition and the policy are stocastic, the probability of a T-step trajectory is:

$P(\tau| \pi) = \rho_0(s_0)\prod_{t=0}^{T-1}P({s_{t+1}|s_t,a_t})\pi(a_t|s_t)$

The expeced return is:

$J(\pi)=\int_{\tau}P(\tau|\pi)R(\tau)=E_{\tau\sim\pi}[R(\tau)]$

The central optimization problem in RL can be expressed by:

$\pi^{*}=arg max_{\pi}J(\pi)$

with $\pi^(*)$ be the optimal policy.

Bellman Equations, Value Function, Q Function and Advantage Function

The basic idea behind Bellman equation is this:

The value of your curent starting point is the reward you expect to get from being there, plus the value of wherever you land next.

The on-policy value function, $V^{\pi}(s)$, which gives the expected return if you start at this point and always act according to policy $\pi$: $V^{\pi}(s)=E_{\tau\sim\pi}[R(\tau)|s_0=s]$

The on-policy action-value function, $Q^{\pi}(s, a)$, which gives the expected return if you start in state s, take an arbitrary action a, and the forever after act according to policy $\pi$: $Q^{\pi}(s, a)=E_{\tau\sim\pi}[R(\tau)|s_0=s, a_0=a]$

We can easily know the relationship between value function and action-value function: $V^{\pi}(s)=E_{a\sim\pi}[Q^{\pi}(s, a)]$

If we have the optimal action-value function, $Q^{*}$, we can directly obtain the optimal action, $a^{*}(s)$, via $a^{*}(s) = arg max_a Q^{*}(s, a)$

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action a in state s, over randomly selecting an action according to $\pi(\cdot|s)$, assuming you act according to $\pi$ forever after: $A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$

Reinforcement Learning Basic - Concepts and Taxonomy

Deep Reinforcement Learning Course Note (Part 2)

TOC

Introduction

Key Concepts and Terminology

States, Observations, Action Spaces and Trajectories

Markov Decision Process (MDP)

Policies

Reward and Return

The Goal of RL

Bellman Equations, Value Function, Q Function and Advantage Function

RL Algorithms

Anatomy of RL and What to Learn

Taxonomy of RL

Reference

CATALOG

FEATURED TAGS