MDP: Markov Decision Processes
MDP
- a set of states $s \in S$
- a set of actions $a \in A$ (like successors in search problems)
- a transition function $T(s, a, s’)$ (i.e. $P(s’ \mid s, a)$)
- a reward function $R(s, a, s’)$ (and discount factor $\gamma$)
- a start state $s_0$
- maybe a terminal state
MDP Quantities
-
Policy $\pi$: choice of action for each state
-
Discounted utility: sum of (discounted) rewards
$U([r_0, r_1, r_2, …]) = \sum_{i=0}^{\infty} \gamma^{i}r_i = r_0 + \gamma r_1 + \gamma^2 r_2 + …$, where $0<\gamma<1$
Optimal Quantities
-
the value (utility) of a state $s$
$V^*(s)$: expected utility starting in $s$ and acting optimally
-
the value (utility) of a q-state $(s, a)$
$Q^*(s,a)$: expected utility starting out having taken action $a$ from state $s$ and (thereafter) acting optimally
-
the optimal policy
$\pi^*(s)$: optimal action from state $s$
Recursive Definition of Value (the Bellman Equations)
-
Q:
\[Q^\ast(s, a) = \sum_{s'} T(s, a, s')[R(s, a, s')+\gamma V^\ast(s')]\] -
V:
\[V^\ast(s) = \max_a Q^*(s, a) = \max_a \sum_{s'}T(s, a, s')[R(s, a, s')+\gamma V^\ast(s')]\]
Value Iteration
- $V_0(s) = 0$
- $V_{k+1}(s) \leftarrow \max_a \sum_{s’}T(s, a, s’)[R(s, a, s’)+\gamma V_k(s’)]$
Repeat until convergence.
Complexity of each iteration: $O(\mid S\mid ^2 \mid A \mid)$
Policy Extraction
\[\pi^\ast(s) = \arg \, \max_a \sum_{s'}T(s, a, s')[R(s, a, s')+\gamma V^\ast(s')] = \arg \max_a Q^\ast(s, a)\]Easier to extract from q-values than values.
Problems with Value Iteration
-
slow: $O(\mid S\mid ^2 \cdot \mid A \mid )$
-
the “max” at each state rarely changes
-
the policy often converges long before the values
Policy Iteration
Steps:
-
Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence.
\[V_{k+1}^{\pi_i}(s) \leftarrow \sum_{s'}T(s, \pi_i(s), s')[R(s, \pi_i(s), s')+\gamma V^{\pi_i}_k(s')]\] -
Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values.
\[\pi_{i+1}(s) = arg\,max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^{\pi_i}(s')]\]
Repeat steps until policy converges.
-
Policy Iteration is still optimal!
-
Can converge (much) faster under some conditions
Summary
-
Value iteration or policy iteration?
compute optimal values: either
compute values for a particular policy: policy iteration
turn your values into a policy: policy extraction
-
Value iteration and policy iteration are all variations of Bellman updates