AI

MDP

人工智能课笔记(二)

Posted by AH on April 18, 2020

MDP: Markov Decision Processes

MDP

  • a set of states $s \in S$
  • a set of actions $a \in A$ (like successors in search problems)
  • a transition function $T(s, a, s’)$ (i.e. $P(s’ \mid s, a)$)
  • a reward function $R(s, a, s’)$ (and discount factor $\gamma$)
  • a start state $s_0$
  • maybe a terminal state

MDP Quantities

  • Policy $\pi$: choice of action for each state

  • Discounted utility: sum of (discounted) rewards

    $U([r_0, r_1, r_2, …]) = \sum_{i=0}^{\infty} \gamma^{i}r_i = r_0 + \gamma r_1 + \gamma^2 r_2 + …$, where $0<\gamma<1$

Optimal Quantities

  • the value (utility) of a state $s$

    $V^*(s)$: expected utility starting in $s$ and acting optimally

  • the value (utility) of a q-state $(s, a)$

    $Q^*(s,a)$: expected utility starting out having taken action $a$ from state $s$ and (thereafter) acting optimally

  • the optimal policy

    $\pi^*(s)$: optimal action from state $s$

Recursive Definition of Value (the Bellman Equations)

  • Q:

    \[Q^\ast(s, a) = \sum_{s'} T(s, a, s')[R(s, a, s')+\gamma V^\ast(s')]\]
  • V:

    \[V^\ast(s) = \max_a Q^*(s, a) = \max_a \sum_{s'}T(s, a, s')[R(s, a, s')+\gamma V^\ast(s')]\]

Value Iteration

  • $V_0(s) = 0$
  • $V_{k+1}(s) \leftarrow \max_a \sum_{s’}T(s, a, s’)[R(s, a, s’)+\gamma V_k(s’)]$

Repeat until convergence.

Complexity of each iteration: $O(\mid S\mid ^2 \mid A \mid)$

Policy Extraction

\[\pi^\ast(s) = \arg \, \max_a \sum_{s'}T(s, a, s')[R(s, a, s')+\gamma V^\ast(s')] = \arg \max_a Q^\ast(s, a)\]

Easier to extract from q-values than values.

Problems with Value Iteration

  1. slow: $O(\mid S\mid ^2 \cdot \mid A \mid )$

  2. the “max” at each state rarely changes

  3. the policy often converges long before the values

Policy Iteration

Steps:

  1. Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence.

    \[V_{k+1}^{\pi_i}(s) \leftarrow \sum_{s'}T(s, \pi_i(s), s')[R(s, \pi_i(s), s')+\gamma V^{\pi_i}_k(s')]\]
  2. Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values.

    \[\pi_{i+1}(s) = arg\,max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^{\pi_i}(s')]\]

Repeat steps until policy converges.

  • Policy Iteration is still optimal!

  • Can converge (much) faster under some conditions

Summary

  • Value iteration or policy iteration?

    compute optimal values: either

    compute values for a particular policy: policy iteration

    turn your values into a policy: policy extraction

  • Value iteration and policy iteration are all variations of Bellman updates