Reinforcement Learning Basic - Concepts and Taxonomy

Deep Reinforcement Learning Course Note (Part 2)

Posted by Xiaoye's Blog on May 23, 2020

TOC

  1. TOC
  2. Introduction
  3. Key Concepts and Terminology
    1. States, Observations, Action Spaces and Trajectories
    2. Markov Decision Process (MDP)
    3. Policies
    4. Reward and Return
    5. The Goal of RL
    6. Bellman Equations, Value Function, Q Function and Advantage Function
  4. RL Algorithms
    1. Anatomy of RL and What to Learn
    2. Taxonomy of RL
  5. Reference

Introduction

This is the course notes of UC Berkeley CS285 deep reinforcement learning.

Key Concepts and Terminology

Below is the agent-environment interaction loop:

States, Observations, Action Spaces and Trajectories

  1. State $s$: complete description of the state of the world
  2. Observation $o$: partial description of a state, which can be fullly observation and partially observation.
  3. Action spaces $a$: the set of all valid actions in a given environment. There are two types of action spaces:
    1. discrete action space, such as Atari and Go
    2. continuous action space, like where the agent control a robot in a physical world
  4. A trajectory $\tau$ is a sequence of states and actions in the world. It’s also called episodes or roolouts. The very first state of the world, $s_0$, is randomly sample from the start-state distribution, denoted as $\rho_0$: $s_0\sim\rho_0(\cdot)$

Markov Decision Process (MDP)

Markov properties: the transitions only depends on the recent state and action, no prior history.

Markov chain:

Fully observed Markov decision process:

Partial observed Markov decision process:

Policies

Policy: the rules that agent used to decide what action to take. Usually, $\theta$ are used to denote the paramters of policy.

  1. Deterministic policies: $a_t=\mu_{\theta}(s_t)$
  2. Stochastic policies: $a_t\sim\pi_{\theta}(\cdot|s_t)$

For stochastic policies, there are two kinds:

  1. Categorical policies, which is like a classfier over discrete actions
  2. Diagonal Gausian policies

Reward and Return

The reward functons $R$ depends on the current state of the world, the action just take and the next state of the world: $r_t=R(s_t, a_t, s_{t+1})$

The goal of the agent is to maximize some notion of cumulative reward over a trajectory.

  1. For finite-horizon undiscounted return: $R(\tau)=\sum_{t=0}^{T}r_t$
  2. For infinite-horizon discounted return: $R(\tau)=\sum_{t=0}^{\infty}\gamma^{t}r_t$

Return VS rewards:

  1. reward表示的是执行完某个动作时的立即回报
  2. return表示的一个状态动作序列当中,到达某个状态动作时的累积回报(可以理解为是到达某个状态时其之前发生的立即回报的累加)

The Goal of RL

The goal of RL is to select a policy which maximizes expected return when the agent acts according to it.

In the case when both the environemnt transition and the policy are stocastic, the probability of a T-step trajectory is:

$P(\tau| \pi) = \rho_0(s_0)\prod_{t=0}^{T-1}P({s_{t+1}|s_t,a_t})\pi(a_t|s_t)$

The expeced return is:

$J(\pi)=\int_{\tau}P(\tau|\pi)R(\tau)=E_{\tau\sim\pi}[R(\tau)]$

The central optimization problem in RL can be expressed by:

$\pi^{*}=arg max_{\pi}J(\pi)$

with $\pi^(*)$ be the optimal policy.

Bellman Equations, Value Function, Q Function and Advantage Function

The basic idea behind Bellman equation is this:

The value of your curent starting point is the reward you expect to get from being there, plus the value of wherever you land next.

The on-policy value function, $V^{\pi}(s)$, which gives the expected return if you start at this point and always act according to policy $\pi$: $V^{\pi}(s)=E_{\tau\sim\pi}[R(\tau)|s_0=s]$

The on-policy action-value function, $Q^{\pi}(s, a)$, which gives the expected return if you start in state s, take an arbitrary action a, and the forever after act according to policy $\pi$: $Q^{\pi}(s, a)=E_{\tau\sim\pi}[R(\tau)|s_0=s, a_0=a]$

We can easily know the relationship between value function and action-value function: $V^{\pi}(s)=E_{a\sim\pi}[Q^{\pi}(s, a)]$

If we have the optimal action-value function, $Q^{*}$, we can directly obtain the optimal action, $a^{*}(s)$, via $a^{*}(s) = arg max_a Q^{*}(s, a)$

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action a in state s, over randomly selecting an action according to $\pi(\cdot|s)$, assuming you act according to $\pi$ forever after: $A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$

RL Algorithms

Anatomy of RL and What to Learn

What to learn in RL algorithms:

  1. policy
  2. action-value function
  3. value function
  4. environment model

Taxonomy of RL

Reference

  1. UC Berkeley CS285: Deep Reinforcement Learning, Decision Making, and Control
  2. OpenAI Spining Up in Deep RL