RL1:K-armed Bandits Problem

Refer from Reinforcement Learning An Introduction (2nd) and Reinforcement Learning specialization on Coursera by University of Alberta & Alberta Machine Intelligence Institute

The K-Armed Bandit Problem

In the k-armed bandit problem, we have an agent who choooses between "k" action and recieves a reward based on the action it chooses. The goal of the agent is to maximize the expected total reward over some time period, or time steps.

In a word, decision making under uncertainty can be formalized by the k-armed bandit problem

Action Values

the value of an action is the expected reward when that action is taken \[ \begin{eqnarray*} q_*(a) &\doteq& \mathbb{E}[R_t\ | \ A_t = a],\ \forall a \in \mathbf{A}\\ &=&\sum_rp(r|a)r\tag{1.1} \end{eqnarray*} \] To clearly explain this, the reward of an action may not be fixed so the reward may comfirm to a uniform distribution or normal distribution. Here is an example to illustrate this process

Action Selection

Remember that the agent's goal is to maximize the expected total amount of reward it recieves. We can call this procedure the argmax (the argument "a" which maximizes function \(q_*\)) \[\mathop{argmax}_a \ q_*(a)\] It is clear that if you know the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with highest value.

Thus, the problem here is that the agent do not know the action values with certainty. It only have to estimate the action values. But how to estimate action-values?

What to Learn? Estimating Action Values


One way to estimate \(q_*\) is to compute a sample-average. Here we denote the estimated value of action \(a\) at time step \(t\) as \(Q_t(a)\). We would like \(Q_t(a)\) to be close to \(q_*(a)\).

\[Q_t(a) \doteq \frac{\sum_{i=1}^{t-1}R_i}{t-1}\tag{1.2}\]

Notice that we use \(t-1\) because we are calculating \(Q_t(a)\) with history prior to \(t\). And to be clear, \(t\) has separate corresponding value for each action.
Here is an example of medical trail.
And we can simplify \(Q_t\) by extending it in incremental update rule

\[ \begin{eqnarray*} Q_{t+1} &=&\frac{1}{t}\sum_{i=1}^tR_i\\ &=&\frac{1}{t}(R_t+(t-1)\frac{1}{t-1}\sum_{i=1}^{t-1}R_i)\\ &=&\frac{1}{t}(R_t+(t-1)Q_t)\\ &=&Q_t+\frac{1}{t}(R_t-Q_t)\tag{1.3} \end{eqnarray*} \]

To generalize it, we replace \(\frac{1}{t}\) with \(\alpha_t\) to represent stepsize and it's easy to know \(\alpha_t \in [0,1]\)


And it can also express as below

\[NewEstimate \leftarrow OldEstimate\ +\ StepSize(Target\ -\ OldEstimate)\]

The expression \((Target-OldEstimate)\) is an error in the estimate. It is reduced by taking a step toward the “Target.” The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the nth reward.

Nonstationary Problem

What if the reward distribution of each action is changing over time? In this situation, what we learn form history may not correctly represent current state due to changing of reward distribution. In such cases it makes sense to give more weight to recent rewards than to long-past rewards because recent information reflect changing more accurate. To do so, we can use a fixed stepsize,like 0.5, instead of \(\frac{1}{t}\). This can lead to decaying exponently past reward.


As \(\alpha_t\) increases, the more \(R_t\) contributes to \(Q_{t+1}\) and the less for \(Q_t\).

Greedy Action

At each time step, the agent could alway choose the action with the highest action value estimated by history. And this is called greedy action.

\[a_g = \mathop{argmax}_a\ \ Q(a)\]

Seemingly we can through greedy action to archieve our goal, maxmizing expected total reward. But there is still a problem exists: how does the agent know its estimation about each action is correctly approximating to the real distribution? Greedy action is greatly impact by initial action value. Why does the agent know other actions are worse than greedy action even trying them only a few times?

I think the agent should spend some time trying other actions instead of greedy action. And this come to the exploration and exploitation tradeoff

Exploration vs. Exploitation Tradeoff

Exploration and Exploitation Dilemma

  • Exploration: improve knowledge for long-term benefit
  • Exploitation: exploit knowledge for short-term benefit

The dilemma means: How do we choose when to exploit and when to explore? When we explore, we get more accurate estimates of our values. When we exploit, we might get more reward. We cannot however choose to do both simultaneously.

Epsilon-Greedy Action Selection

One very simple method for choosing between exploration and exploitation is to choose randomly. We could choose to exploit most of the time with a small chance of exploring. For instance, we could roll a dice.

This is called \(\epsilon\)-greedy action selection. We formulize it as below, where \(\epsilon \in [0,1]\)

\[ A_t\leftarrow \begin{cases} \mathop{argmax}_a Q_t(a)\ \ \ \ \ \ \ \ 1-\epsilon\\ \\ a\sim Uniform(\mathbf{A})\ \ \ \ \ \ \epsilon \end{cases} \]

Let's evaluate on 10-armed testbed

We run each agent with different \(\epsilon\) for 2000 independent runs, each run for 1000 time steps. Then we average 2000 runs to get the learning algorithm’s average behavior.

The advantage of \(\epsilon\)-greedy over greedy methods depends on the task. If the reward variance is large, like 10, then more explorations to detect best action is important. More detections give the agent more confidence to pick up optimal action.

On the other hand, if the reward variance is zero, then greedy method would know the true value of each action after trying it once. Seemingly, in this case, greedy method would soon find the greedy action and it might perform best. But, this can not be guaranteed under nonstationary situations. Nonstation means the true values of the actions changed over time. In this case, exploration is needed to ensure that one of the nongreedy actions has not changed to become better than the greedy one.

Optimistic Initial Values

Optimistic initial values encourage early exploration. In real programing, don't intialize all acion values defaut to 0. Here is an example to show algorithm's performance by using this method.

Limitation of optimistic inital values:

  • Optimistic intial values only drive early exploration
  • They are not will suited for nonstationary problems
  • We may not know what the optimistic values should be

Upper-Confidence Bound (UCB) Action Selection

\(\epsilon\)-greedy action selection forces the non-greedy actions to be tried, but indiscriminately and uniformly, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates.

In upper-confidence bound(UCB) action selection, we follow the principle of optimism in the face of uncertainty. This simply means that if we are uncertain about something, we should optimistically assume that it is good.

For instance, say we have these three actions with associated uncertainties. The larger confidence interval means bigger uncertainies. So the agent optimistically picks the action that has the highest upper bound, action 1.
After that, this time, we should select action 2.

The key idea in UCB is that it uses uncertainty or variance (sqrt root) in the value estimation for balancing exploration and exploitation.

\[A_t \doteq \mathop{argmax}_a [Q_t(a)+c\sqrt{\frac{\ln{t}}{N_t(a)}}\ ]\]

We can clearly see here how UCB combines exploration and exploitation. The first term in the sum represents the exploitation part, and the second term represents the exploration part. The constant parameter \(c\) is used to scale the extent of exploration. \(N_t(a)\) denotes the number of times that action \(a\) has been selected prior to time \(t\).

The limitation of UCB is not practical for very huge state spaces.


The gap between k-armed bandit problem and the full reinforcement learning problem:

  1. there is no association between actions and situations(or states). Each selection of actions is non-contextual or independent.
  2. thers is no policy, which designate actions at each situation(or state).

we summarize a complete learning curve by its average value over the 1000 steps; this value is proportional to the area under the learning curve. This kind of graph is called a parameter study.