Reinforcement Learning - Lec1

Policy-based Algorithm

今年1 月的目標想複習三下時學的 RL,主要的參考教材為李宏毅老師的 DRL 8 講及 Sergey 在 Berkeley 開的 CS294。

先讓我們從 Policy Gradient 開始吧!

Terminology

這裡假設讀者熟悉 RL 的基本概念 (env/action/state 的交互作用及相關名詞)。

Policy

Policy π\pi 為 given state ss,採取 action aa 的機率。而我們可以以一組參數 θ\theta 去描述。

πθ(a,s)=P[as] \pi_\theta(a,s) = \mathbb{P}[a|s]

Given 一個固定的 policy 的 agent (或者又叫做 actor),我們讓其去與環境互動,每玩一次(一個 episode) 會收集到一個序列 {s1,a1,,sT,aTs_1,a_1,\cdots,s_T,a_T},稱為 trajectory τ\tau,並會得到一個 final reward r(τ)r(\tau)

Value Function

Given policy, expected return from state ss

Vπ(s)=Eπ[r(τ)st=s] V^\pi(s) = \mathbb{E}_\pi[r(\tau) | s_t = s]

Note: π\pi 是指 Given policy,特定 trajectory 被 sample 出來的機率

Action-Value Function

Given policy, expected return when taking aa at ss ,之後稱其為 Q function

Qπ(s,a)=Eπ[r(τ)st=s,at=a] Q^\pi(s,a) = \mathbb{E}_\pi[r(\tau)|s_t = s, a_t = a]

Goal

我們想找到一個最好的 policy 去 maximize 最後得到 reward 的期望值 ( optimize 的目標)

Optimal value function

V(s)=maxπVπ(s)=maxaQ(s,a)Q(s,a)=maxπQπ(s,a)=r(s,a)+sP[sa,s]V(s) \begin{aligned} V_{\ast}(s) &= \max_\pi V_\pi(s) = \max_a Q_{\ast}(s,a)\\ Q_{\ast}(s,a) &= \max_\pi Q_\pi(s,a) = r(s,a) + \sum_{s^\prime} \mathbb{P}[s^\prime|a,s] V_{\ast}(s) \end{aligned}

Note: exploration 只在 training 的時候做,一旦摸清整個環境(i.e optimal policy is claimed),在決定時便不用再加上隨機性了 (so that's why I directly write down max\max)

Decoding Problem (how to construct such policy)

Only considered in non policy-based method, more discussion on \rightarrow Lec3

Learning Problem

而後續討論的演算法,都是圍繞著如何找出這樣符合optimal value equation 的 policy。

  • DP-based Planning: value iteration, policy iteration
    • Discrete action/state and A,S|\mathcal{A}|,|\mathcal{S}| is small (i.e tabular): \
    • MDP is known
    • Contraction Mapping Theorem 可以 prove 最佳解的存在性) \rightarrow see Lec3
  • Value-based Learning: Monte-Carlo RL, Temporal-Difference (bootstrapping) RL
    • In generate, discrete action (since we need to choose argmax from A\mathcal{A})
    • MDP is unknown
    • Tradeoff between biased or not and low/high variance
  • Policy-based Learning: REINFORCE
    • Continuous action/state (since we usually use NN as hypothesis set)
    • MDP is unknown
    • Variance reduction trick (see \rightarrow Lec2)
    • Combined with value-based algorithm: Actor-Critic (see \rightarrow Lec5)

How to update θ\theta in Policy Gradient

Optimal policy 所對應到的 θ\theta 滿足以下

θ=argmaxθEτπθr(τ)=argmaxθt[E(st,at)πθ(s,a)[r(st,at)]] \theta^{\ast} \, = \, \arg \max_\theta \mathbb{E}_{\tau \sim \pi_\theta} r(\tau) \, = \, \arg \max_\theta \sum_t [\mathbb{E}_{(s_t,a_t) \sim \pi_\theta(s,a)} [r(s_t,a_t)]]

利用 Graient Descent

$$ \begin{align} \nabla_\theta J(\theta) &= \int \color{blue}{\nabla_\theta[\pi_\theta(\tau)]} r(\tau) d\tau\\ &= \int \color{blue}{\pi_\theta \nabla_\theta[\log \pi_\theta(\tau)]}r(\tau) d\tau \\ &= \mathbb{E}_{\tau \; \sim \; \pi_\theta(\tau)} \nabla_\theta[\log\pi_\theta(\tau)]r(\tau) \\ \end{align} $$

直覺來說,相同於傳統 supervised learning 的 setting ,但更新的步伐大小,跟 reward 成線性關係\
(i.e try & error,哪個方向好就走多一點,weighted by reward)

但上述的 θJ(θ)\nabla_\theta J(\theta)無法很直覺地用於 policy 的更新,所以我們做了以下改動

  • Policy Gradient Theorem 告訴我們

For any differentialble policy πθ(a,s)\pi_{\theta}(a,s), for any of the policy objective functions J=J1,JavRJ = J_1,J \scriptstyle avR…,\

θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)]
  • 我們是對每一個 (a,s)(a,s) 做更新 (因 π\pi 能管的也只有 Given ss,選擇採取 aa 的機率)
θ[logπθ(τ)]=tθ[logπθ(at,st)] \nabla_\theta [\log \pi_\theta (\tau)] = \sum_t \nabla_\theta [\log \pi_\theta (a_t , s_t)]
  • 在大多數情況,無法窮舉所有可能的 τ\tau 去求期望值,只能用 sampling 去估計
θJ(θ)ntθ[logπθ(at,st)]r(τ)Qπθ(s,a) \boxed{ \nabla_\theta J(\theta) \approx \sum_n \sum_t \nabla_\theta[\log\pi_\theta(a_t,s_t)]\underbrace{\,r(\tau)\,}_{ Q^{\pi_\theta}(s,a)}}

Note: 在最基本的 policy gradient 中,我們用 r(τ)r(\tau) 來作為Qπθ(s,a)Q^{\pi_\theta}(s,a) 的 unbiased estimation。

Drawback

  • Online policy,need to resample when θ\theta is updated \Rightarrow Data collection 費時
  • Sample is finite \Rightarrow variance 大,難 train 。

Reference