Refer from Reinforcement Learning An Introduction (2nd) and Reinforcement Learning specialization on Coursera by University of Alberta & Alberta Machine Intelligence Institute

## Policies and Value Functions

### Policy

Formally, a policy is a mapping from states to probabilities of selecting each possible action.

• deterministic policy, denote $\pi(s)=a$: one state correspond to one action
• stochastic policy, $\pi(a|s)$: a probability distribution over all visable actions

It's important that policies depend only on the current state, not on other things like time or previous states.

### State-value Function

Almost all reinforcement learning algorithms involve estimating value functions -- functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state), which means value function predicts reward into future.

The value function of a state $s$ under a policy $\pi$, denoted $v_\pi(s)$, is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs, we can define $v_\pi$ formally by

$v_\pi(s)\doteq \mathbb{E}_\pi[G_t|S_t=s]=\mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^kR_{t+k+1}|S_t=s],\ for\ all\ s\in \mathbf{S}$

We call the function $v_\pi$ the state-value function for policy $\pi$

### Action-value Function

Similarly, we define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted $q_\pi(s,a)$, as the expected return starting from $s$, taking the action $a$, and thereafter following policy $\pi$:

$q_\pi(s,a)\doteq \mathbb{E}_\pi[G_t|S_t=s,A_t=a]=\mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^kR_{t+k+1}|S_t=s,A_t=a]$

We call the function $q_\pi$ the **action-value function for policy *,

## Bellman Equations

### Transform Between $v_\pi$ And $q_\pi$

Through the given diagram, we can derivate an equation for $v_\pi$ in terms of $q_\pi$ and $\pi$.

$\begin{eqnarray*} v_\pi(s) &=&\mathbb{E}_\pi[G_t|S_t=s]\\ &=&\sum_a\pi(a|s)\mathbb{E}[G_t|S_t=s,A_t=a]\\ &=&\sum_a\pi(a|s)q(s,a)\tag{3.1} \end{eqnarray*}$ Recall that $G_t=R_{t+1}+\gamma G_{t+1}$. With given diagram below, we can derivate an equation for $q_\pi$ in terms of $v_\pi$ and $p(s',r|s,a)$. (use markov property to simplify original equation)

$\begin{eqnarray*} q_\pi(s,a) &=&\mathbb{E}_\pi[G_t|S_t=s,A_t=a]\\ &=&\mathbb{E}_\pi[R_{t+1}+\gamma G_{t+1}]\\ &=&\sum_{s',r}p(s',r|s,a)(r+\gamma \mathbb{E}_\pi[G_{t+1}|S_{t+1}=s'])\\ &=&\sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))\tag{3.2} \end{eqnarray*}$

### Bellman Equations for $v_\pi$

Bellamn equation allows us to relate the value of the current state to the value of future states without waiting to observe all the future rewards.
With $(3.1),(3.2)$, we can easiy get bellamn equation for $v_\pi$

$\begin{eqnarray*} v_\pi(s) &=&\sum_a\pi(a|s)q(s,a)\\ &=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))\tag{3.3} \end{eqnarray*}$

$s'$ here stands for next state. Bellamn equation expresses a relationship between the value of a state and the values of its successor states (next state). You can get a more intiutive understanding about how this equation forms though its backup diagram below: we first choose actions through policy and then transites to a successor state with corresponding reward under environment dynamics. The open circles represent state nodes and solid circles represent action nodes.

### Bellman Equations For $q_\pi$

With $(3.1),(3.2)$, we can easiy get bellamn equation for $q_\pi$

$\begin{eqnarray*} q_\pi(s,a) &=&\sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))\\ &=&\sum_{s',r}p(s',r|s,a)(r+\gamma\sum_{a'} q_\pi(s',a'))\tag{3.4} \end{eqnarray*}$

$a'$ here stands for actions under possible state $s'$. Similarly, this equation express a relationship betweeen the value of an action and the values of actions under possible successor state $s'$

### An Example of Value Function: Grideworld

• State: each grid (totally 25 states)
• Action: north, south, east, and west
• Reward: actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions results in a reward 0. Every time the agent moves any directions at position A would be transited to A' with reward +10. This is the same to B and B' except for reward +5.
• Policy: random policy, each direction has equal probability, 25%.

$\gamma=0.9$

Now, just for curious, we can verify the $v_\pi(s)$ bellman equation in this example. Let's say we choose center state.

$(0.7\times0.9+0)\times0.25+(2.3\times0.9+0)\times0.25+(0.4\times0.9+0)\times0.25\\+(-0.4\times0.9+0)\times0.25=0.69\approx0.7$

### Why Bellman Equations?

We use bellman equations to calculate $v_\pi(s)$ for all $s\in \mathbf{S}$ by solving the system of linear equations.

Still the previous example, gride world. It has 25 states and we get 25 unknown parameters. For each state we can get a bellman equation, therefore we get 25 equations. Through 25 equations, we sovle them to obtain 25 state values.

But bellman equation is not pragmatic for many reasons. One of them is that sometimes the number of states is too large and it's impossible to calculate results in a short time. It can only solve small MDPs.

## Optimality (Optimal Policies & Value Functions)

### Optimal Policy

A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states. In other words,

$\pi>\pi' \ \ iff\ v_\pi(s)\ge v_{\pi'}(s)\ for\ all\ s \in \mathbf{S}$

There is always at least one deterministic policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by $\pi_*$. They share the same state-value function, called the optimal state-value function, denoted $v_*$, and the same optimal action-value function, denoted $q_*$

$v_*(s)\doteq \mathop{max}_\pi\ v_\pi(s)\\ q_*(s,a)\doteq \mathop{max}_\pi\ q_\pi(s,a)\\ for\ all\ s\in \mathbf{S},a\in \mathbf{A}$

### Optimal Value Functions

Because the optimal policy $\pi_*$ is a deterministic policy, the optimal action $a_*$ with $\pi(a_*|s)$ equals to 1 and all other actions' probability equal to 0. We can re-write $v_*$ with bellman equation as below, called bellman optimality equation for $v_*$.

$\begin{eqnarray*} v_*(s) &=&\mathop{max}_\pi\ v_\pi(s)\\ &=&\mathop{max}_a \mathbb{E}[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]\\ &=&\mathop{max}_a\ \sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s'))\tag{3.5} \end{eqnarray*}$

Similarly, bellman optimality equation for $q_*$ is

$\begin{eqnarray*} q_*(s,a) &=&\mathop{max}_\pi\ q_\pi(s,a)\\ &=&\mathbb{E}[R_{t+1}+\gamma \mathop{max}_{a'}\ q_*(S_{t+1},a')|S_t=s,A_t=a]\\ &=&\sum_{s',r}p(s',r|s,a)(r+\gamma \mathop{max}_{a'}\ q_*(s',a'))\tag{3.6} \end{eqnarray*}$

The bellman optimality equations relate the value of a state or state-action pair, to it's possible successors under any optimal policy.

Unfortunately, the $max$ funtion here is not linear, we cannot through linear system solver to directly solve bellman optimality equation to obtain all $v_*$ in MDPs. Although, in principle, one can use a variety of methods for solving systems of nonlinear equations, it is not recommended due to its extreme computational cost.

### Get Optimal Policy

Once one has $v_*$, it is relatively easy to determine an optimal policy. Because, for each state $s$, there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. That is to say any policy that is greedy with respect to the optimal evaluation function $v_*$ is an optimal policy.

Having $q_*$ makes choosing optimal actions even easier. The action-value function effectively caches the results of all one-step-ahead searches.