RL4:Dynamic Programming

Refer from Reinforcement Learning An Introduction (2nd) and Reinforcement Learning specialization on Coursera by University of Alberta & Alberta Machine Intelligence Institute

The key idea of DP, and reinforcement learning generally, is the use of value functions to orgnize and structure the search for good policy. The limitation of DP is that we must know the dynamic programing of the environment, which impossible to know in most of the time.

Policy Evaluation is the tasks of determining the state-value function \(v_\pi\) for a particular policy \(\pi\)

Policy Improvement is the task of improving an existed policy on given state values.

Policy Control is the iteration of combining evaluation and improvement to get the optimal policy.

Policy Evaluation(Prediction)

The aim of the policy evaluation is to calculate value functions. Its idea is to change the bellman equation into an update rule.

\[ \begin{eqnarray*} v_{k+1}(s) &\doteq&\sum_a\pi(a|s) q_k(s,a)\\ &=&\sum_a\pi(a|s) \sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s'))\tag{4.1} \end{eqnarray*} \]

We arbitrarily choose inital value for each state and the terminal state, if any, must be given value 0 and never be updated. Indeed, the sequence \(\{vk\}\) can be shown in general to converge to \(v_\pi\) as \(k\rightarrow\infty\) under the same conditions that guarantee the existence of \(v_\pi\). This algorithm is called iterative policy evaluation.

We call this kind of operation an expected update. All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.

Let's consider the \(4\times4\) gride world below.

  • Action: up, left, down, right.
  • State: shadow blocks for terminal states and white blocks with numbers for non-terminal states.
  • Reward: all transitions for a reward -1.
  • Policy: random move policy(four actions have equal probabilities to choose at each state)

With \(\theta = 0.001\), we finally get this.

Policy Improvement

It is obvious that after we calculate state-value functions, the aim of the policy improvement is to improve the current policy \(\pi\). Its idea is to change \(\pi\) towards the optimal policy a little step. We call this policy improvement theorem\((4.2)\). \(\pi'\) is the next new policy and both \(\pi\) and \(\pi'\) are deterministic policy.

\[ q_\pi(s, \pi'(s)) \ge v_\pi(s),\ \forall s\in \mathbf{S}\tag{4.2}\\ \Rightarrow v_\pi'(s)) \ge v_\pi(s) \]

The premise of improvement is

\[ q_\pi(s, \pi'(s)) > q_\pi(s, \pi(s)), \ \exists s\in\mathbf{S} \]

AS well as this condition is avaliable, the new policy can always be a strictly improvement over \(\pi\) unless \(\pi\) is already optimal. Obviously we can get \(\pi'\) by greedy selection of action values at each state.

\[ \begin{eqnarray*} \pi'(s) &=&\mathop{argmax}_a q_\pi(s,a)\\ &=&\mathop{argmax}_a \sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))\tag{4.3} \end{eqnarray*} \]

The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.

Notice that from \((4.3)\) we actually get an "optimal policy" based on current state values (formal optimal policy establish on \(v_*\)). We can imagine if all our value function can hold for optimal bellman equation, then the current policy must be the optimal policy \(\pi_*\).

The gride world example is so simple that we just get the optimal policy after once improvement.

we can notice even though we already get optimal policy but state values can still change (not converge).

Policy Iteration(Control)

Once a policy, \(\pi\), has been improved using \(v_\pi\) to yield a better policy, \(\pi'\), we can then compute \(v_\pi'\) and improve it again to yield an even better \(\pi''\). Through alternating evaluation and improvement, we can thus obtain a sequence of monotonically improving policies and value functions

\[ \pi_1 \mathop{\rightarrow}^E v_{\pi_1} \mathop{\rightarrow}^I \pi_2 \mathop{\rightarrow}^E v_{\pi_2} \mathop{\rightarrow}^I \pi_3 \mathop{\rightarrow}^E ... \mathop{\rightarrow}^I \pi_* \mathop{\rightarrow}^E v_{\pi_*} \mathop{\rightarrow}^I \pi_* \]

This way of finding an optimal policy is called policy iteration. It follows a sequcence of better and better policies and value functions until it reaches the optimal policy and accosiated optimal value function.

Here is a simple algorithnm for policy iteration with iterative policy evaluation

Generalized Policy Iteration

One big problem of above policy iteration algorithnm is that we first have to wait for the convergence of policy evaluation and then improve policy. This waste so much time and its efficience is low.

The generalized policy iteration(GPI) flex the process and still gaurantee convergence of policy iteration. In each step, we don't have to reach the boundary but just incomplete steps towards each boundary.

Value Iteration

Value iteration is a special case of GPI as well as the policy iteration. In this case, we don't have to wait for completely finishing of policy evaluation. We stop after update evaluation one sweep (one update of each state). It combines policy evaluation and improvement by changing bellman optimality equation into an update rule.

\[ \begin{eqnarray*} v_{k+1}(s) &\doteq&\mathop{max}_a q_k(s,a)\\ &=&\mathop{max}_a \sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s'))\tag{4.4} \end{eqnarray*} \]

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep.

Asynchronous DP

  • synchronous DP: in each iteration, sweep all states and update them all.
  • asynchronous DP: in each iteration, just update some of states and update in any order

When state space is too large, sometimes it's impossible to use synchronous method to update all states.

Efficiency of Dynamic Programming

DP may not be practical for very large problems, but compared with other methods for solving MDPs (like linear programing or brute search), DP methods are actually quite efficient.

If \(n\) and \(k\) denote the number of states and actions, this means that a DP method takes a number of computational operations that is less than some polynomial function of \(n\) and \(k\). A DP method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is \(k^n\).

Here comes to the curse of demensionality:

  • The size of state space grows exponentially as the number of relevant features increase.
  • This is not an issue with DP, but an inherent complexity of the problem.

One limitation of DP is that DP requires a complete and accurate model of the environment.