Refer from Reinforcement Learning An Introduction (2nd) and Reinforcement Learning specialization on Coursera by University of Alberta & Alberta Machine Intelligence Institute

## Introduction to Markov Decision Processes

### Definition of MDP

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. Here are the key element in discribing a MDP

• agent: the learner and the decision maker
• environment: the thing a agent interacts with, comprising everything outside the agent
• action: selected by the agent to interact with the environment
• reward: special numerical values that the agent seeks to maximize over time through its choice of actions.

the whole interacting process is precisely simplified as follow

More specifically, the agent and the environment interact at each of a sequence of discrete time steps, $t=0,1,2,3,...$ At each time step $t$, the agent recives some representation of the environment's state, $S_t \in \mathbf{S}$, and on that basis selects an action, $A_t \in \mathbf{A}(s)$.

One time step later, in part as a consequence of its action, the agent receives a numerical reward, $R_{t+1} \in \mathbf{R}\subset \mathbb{R}$, and finds itself in a new state, $S_{t+1}$. The MDP and agent together thereby give rise to a sequence or trajectory that begins like this:

In a finite MDP, the sets of states, actions, and rewards ($\mathbf{S}$, $\mathbf{A}$, and $\mathbf{R}$) all have a finite number of elements. In this case, the random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent only on the preceding state and action. That is, for particular values of these random variables, $s' \in \mathbf{S}$ and $r \in \mathbf{R}$, there is a probability of those values occurring at time t, given particular values of the preceding state and action: $p(s',r|s,a)\doteq P(S_t=s', R_t=r\ | \ S_{t-1}=s, A_{t-1}=a)\tag{2.1}$ for all $s',s \in \mathbf{S}$, $r \in \mathbf{R}$, and $a \in \mathbf{A}(s)$. The function $p$ defines the dynamics of the MDP. $p:\ \mathbf{S}\times \mathbf{R}\times \mathbf{S}\times \mathbf{A}\rightarrow [0,1]$ With $p(s',r|s,a)$, note that future state and reward only depends on the current state and action. This is called the Markov property. It means that the present state is sufficient and remembering earlier states would not improve predictions about the future.

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. It can be used to formalize a wide variety of sequential decision-making problem. This also means that any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment: actions, states, rewards.

### An Example of MDP

Here is an example of finite MDP, Recycling Robot. We can write down the transition probabilities and the expected rewards, with dynamics as indicated below.

We can show it in another useful way called trasition graph

There are two kinds of nodes: state nodes and action nodes. The key point here is that the agent fisrtly chooses an action and then transits to a state with corresponding probability.

## Goal of Reinforcement Learning

### The Goal of RL

In reinforcement learning, reward captures the notion of short-term gain. The objective however, is to learn a policy that achieves the most reward in the long run. With this premise, we denote the return at time step $t$ as $G_t$

$G_t\doteq R_{t+1}+R_{t+2}+R_{t+3}+...\tag{2.2}$

We intuitively want to maxmize $G_t$ at each time step. However, $G_t$ is a random variable because the dynamics of the MDP can be stochastic. That's why we define the goal of an agent is to maxmize the expected return

$\mathbb{E}[G_t] = \mathbb{E}[R_{t+1}+R_{t+2}+R_{t+3}+...]\tag{2.3}$

### The Reward Hypothesis

The Reinforcement Learning Hypothesis can be expressed as this:

Intelligent behavior araises from the actions of an individual seeking to maximize its received reward signals in a complex and challenging world.

There are two research programs:

• Identify where reward signals come from. (what reward the agent should optimize)
• Develop algorithms that search the space of action to maximize reward signals.

Informally, the agent’s goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run. We can clearly state this informal idea as the reward hypothesis:

That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

But how can we define reward? This is a trick problem. To use goals as rewards, we have some representations:

• goal-reward representation: 1 for achieving goal, 0 otherwise
• action-penalty representation: -1 for not achieving goal, 0 once goal reached

Both of representations have its own limitation. They lack of some middel information to tell the agent how to achieve the goal as our expectation.

In an episodic task, the agent-environment interaction breaks up into episodes. Episodes are indepedent and each episode ends in a terminal state at time step $T$. In this situation, time step is finite and we change (2.2) as

$G_t \doteq R_{t+1}+R_{t+2}+R_{t+1}+...+R_T\tag{2.4}$

The interaction between agent-environment goes on continually and there is no terminal state. In this situation, time step is infinite and $G_t$ is the same in $(2.2)$

$G_t \doteq R_{t+1}+R_{t+2}+R_{t+3}+...$

But here comes to the problem: $G_t$ could be $\infty$ and impossible to calculate. Therefore, we should make $G_t$ finite, which by using discounting method.

### Discounting

Discount the rewards in the future by $\gamma$ called discount rate, where $\gamma \in [0,1]$

$\begin{eqnarray*} G_t &\doteq& R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...+\gamma^{k-1}R_{t+k}+...\\ &=&\sum_{k=0}^\infty \gamma^kR_{t+k+1}\tag{2.5} \end{eqnarray*}$

We can also re-write $G_t$ in the recursive version

$G_t = R_{t+1}+\gamma G_{t+1}\tag{2.6}$

To prove $G_t$ is finite, let's assume $R_{max}$ is the maximum reward the agent can receive

$\begin{eqnarray*} G_t=\sum_{k=0}^\infty \gamma^kR_{t+k+1}&\le& \sum_{k=0}^\infty \gamma^kR_{max}\\ &=&R_{max}\sum_{k=0}^\infty \gamma^k\\ &=&R_{max}\times \frac{1}{1-\gamma} \end{eqnarray*}$

It proves that $G_t$ now is finite. To Specify, let's talk about $\gamma=0$ and $\gamma=1$.

• $\gamma=0\Rightarrow G_t=R_t$ : agent only cares about immediate reward. (short-sighted agent)
• $\gamma\rightarrow1$ : agent takes future rewards into account more strongly. (far-sighted agent)

### Unified Episodic and Continuing Tasks

we can consider episode termination to be the entering of a special absorbing state that transitions only to itself and that generates only rewards of zero. For example, consider the state transition diagram: