Reinforcement learning (Shimon Whiteson)

# Reinforcement learning
[slides](https://github.com/mlss-skoltech/lectures/blob/master/reinforcement_learning/lecture1_tabular_methods.pdf)
> How can an intelligent agent learn from experience how to make decisions
that maximise its utility in the face of uncertainty?
![image](https://user-images.githubusercontent.com/18204038/64105788-ca8f5b00-cd6e-11e9-8a9e-faadb7ac0646.png)
Unlike unsupervised learning we do have feedback, but it comes in the form of reward. This feedback is however not as strong as supervised learning.

## Definition and intuition
In RL an agent tries to solve a control problem by directly interacting with an unfamiliar environment. The agent must learn by trial and error, trying out actions to learn about their sequences. 

### Difference to supervised learning
1. Agent has partial control over what data it collects in the future;
2. No right and wrong, just rewards for actions;
3. The agent must learn **on-line**: must maximise performance during learning, not afterwards.

The reward must be quantifiable -- hence the **reward design problem**: what is the definition of "good" or "bad" action? 

When the reward design is poor, the agent can have undesirable behaviour.

## K-armed bandit problem
**Setting:** you're an octopus, sitting before a slot machine (bandit) with many arms, where each arm has an unknown stochastic payoff. The goal is to maximise cumulative payoff over some period.

![image](https://user-images.githubusercontent.com/18204038/64106777-c9f7c400-cd70-11e9-9b3c-b00b20dde4e6.png)
**Formalising**:
![image](https://user-images.githubusercontent.com/18204038/64106832-e72c9280-cd70-11e9-852a-76ecf7a7a526.png)
Infinite-horizon problem is for example Google's search engine. \gamma tells you how much you care about future, i.e. how much you care about instant gratification vs. long term reward.

### Exploration and exploitation
**Explore**: explore the arms in order to learn about them and improve its chances of getting future reward;

**Exploit**: focus on the most profitable arm and get the largest reward.

**How to balance between explore and exploit mode?**
- Horizon is finite: exploration should decrease as horizon gets closer
- Horizon is infinite but \gamma < 1: exploration should decrease as the agent's uncertainty about the expected rewards go down
- Horizon is infinite and \gamma = 1: **infinitely delayed splurge**, you have infinite future ahead of you, so you always explore and delay gratification infinitely.

To address this we have ---

### Action value methods
![image](https://user-images.githubusercontent.com/18204038/64107121-a7b27600-cd71-11e9-8674-ca3c6751d8ac.png)

A few different methods:
1. **epsilon-greedy:** 
![image](https://user-images.githubusercontent.com/18204038/64107146-bf89fa00-cd71-11e9-9d64-c69a9ce69d23.png)

2. **softmax exploration:** concentrate exploration on most promising arms.
![image](https://user-images.githubusercontent.com/18204038/64107249-eea06b80-cd71-11e9-9eae-1720cae85437.png)

3. **Upper confidence bound:**
Neither epsilon-greedy nor softmax considers uncertainty in action-value estimates, while goal of exploration is to reduce uncertainty. So focus exploration on most uncertain actions. This uses the principle of **optimism in the face of uncertainty**.

Focus on arms that are promising and uncertain for exploration.

![image](https://user-images.githubusercontent.com/18204038/64107348-37f0bb00-cd72-11e9-83af-a36615e422ea.png)

## Contextual bandit problem
![image](https://user-images.githubusercontent.com/18204038/64107458-7b4b2980-cd72-11e9-85a6-bee16ca5da77.png)
So it's a traditional bandit problem, but instead of having Q conditioned on action it conditions on action and state together.

Our action doesn't just affect reward, it also affects the state of the world:
![image](https://user-images.githubusercontent.com/18204038/64107604-d1b86800-cd72-11e9-8c09-eb058dc4cdac.png)

**The credit-assignment problem:** suppose an agent takes a long sequence of actions, at the end of which is receives a large reward. How does it determine to what degree each action in that sequence is responsible for the resulting reward?

## Markov decision processes: formalising the RL problem

![image](https://user-images.githubusercontent.com/18204038/64107893-6cb14200-cd73-11e9-91b4-f2693d097050.png)
**stationary**: rule of physics isn't changing
**stochastic**: a bit random

![image](https://user-images.githubusercontent.com/18204038/64107886-6622ca80-cd73-11e9-8515-4e0b27660c35.png)

**MDP example: recycling robot**

![image](https://user-images.githubusercontent.com/18204038/64108041-bac64580-cd73-11e9-9d3b-657055b3eb57.png)

### The Markov property

![image](https://user-images.githubusercontent.com/18204038/64108074-cf0a4280-cd73-11e9-86a9-5b206bb395b0.png)

The current state is sufficient for the agent's history, conditioning actions on the rest of the history cannot possibly help. This restricts search to **reactive politcy**:

![image](https://user-images.githubusercontent.com/18204038/64108123-f3661f00-cd73-11e9-86bc-9a3a5e874c2a.png)

**Is it Markov?**
1. Robot in a maze, state is wall or no wall or 4 sides, action is up, down, left, right: **no**
2. Chess, state is board position, action is legal move: **yes** (minus the special rules)

### Return: value function and Bellman equation
We now need to reason about long term consequences, and we can do so by maximising the **expected return**, which is **the sum over the rewards received**.
![image](https://user-images.githubusercontent.com/18204038/64108403-9ae35180-cd74-11e9-891e-75c1c6cc624a.png)

**Value function**: Value functions are the primary tool for reasoning about future reward. It is the expected value of the policy.

![image](https://user-images.githubusercontent.com/18204038/64108459-bf3f2e00-cd74-11e9-9065-192189b3adee.png)
note that in **action-value** the action can deviate from the policy (doesn't have to be in the policy)

**Bellman equation**: we are not sure what the next state will be so we take the expectation over all actions as well. This is commonly known as **bootstrapping**:
![image](https://user-images.githubusercontent.com/18204038/64121832-ce83a300-cd97-11e9-89f2-a496c5062051.png)

The difference between the two equations is whether we bootstrap over state or state-action pair.

**To gain some intuitions:**
![image](https://user-images.githubusercontent.com/18204038/64121892-f8d56080-cd97-11e9-8264-0807586a3828.png)

Look at the LHS of the tree, corresponding to the first equation:
- The first extension of the tree, we look at all the action -- first summation
- The second extension of the tree, we look at all the stochastic outcome -- second summation


**Optimal value functions**

![image](https://user-images.githubusercontent.com/18204038/64122005-549fe980-cd98-11e9-951b-c6d492b20359.png)


The **Bellman optimality equations** express this recursively: we replace the Bellman equation's expectation over actions with a maximisation wrt action:

![image](https://user-images.githubusercontent.com/18204038/64122142-ae081880-cd98-11e9-9346-e3d747ece686.png)

A recap -- P: transition function, R: reward function

![image](https://user-images.githubusercontent.com/18204038/64122405-72ba1980-cd99-11e9-97c8-54744420b47a.png)

## Planning with MDP
MDPs give us a formal model of sequential decision making. Given the optimal value function, computing an optimal policy is straightforward. But how can we find V* or Q*?

Algorithms for MDP planning compute the optimal value function given a complete model of the MDP. Given a model, V* is usually sufficient.

### Dynamic programming approach
![image](https://user-images.githubusercontent.com/18204038/64122556-e8be8080-cd99-11e9-9716-6f42e88e06bc.png)

We start off with arbitrary policy (pi), then we compute the true value function for the arbitrary policy (V). Then we use the value function to figure out an incremental improvement for our policy. Repeat the process until you reach optimal point.


We can use the Bellman equation to exploit the relationship between states (instead of estimating each state independently). Initial value function is chosen arbitrarily.

### Policy evaluation update rule


![image](https://user-images.githubusercontent.com/18204038/64123077-3f788a00-cd9b-11e9-8c49-2e2325519704.png)

You estimate by summing up all the current state. Apply to every state in each **sweep** of the state space. Repeat over many sweeps, it will eventually converge to fixed point where V_k = V^pi. (we get the true value of the arbitrary policy).


![image](https://user-images.githubusercontent.com/18204038/64122960-f0caf000-cd9a-11e9-8e93-93327bc5f91b.png)

![image](https://user-images.githubusercontent.com/18204038/64123057-34bdf500-cd9b-11e9-8f44-b4876a57dbe8.png)

More intuitively:
![image](https://user-images.githubusercontent.com/18204038/64123142-659e2a00-cd9b-11e9-8077-e2450de7f726.png)

We start from a random point, first perform policy evaluation, which get us to where value is exactly that of the policy (V = V^pi). We perform policy improvement, at which point the value function diverges from true value function. Keeps interating and you converge to the optimal value.

**Guaranteed convergence: how?? (i.e. how do you know these two lines are intersecting with each other and you can always get to an optimal point)** in here we are doing closed-loop update, which means the action always have the opportunity to condition on the state.

**"A counter example"**: say you take two actions, left and right and you have two timesteps. if these are the true values corresponding to action:
- LL yields return of 5
- LR yields return of 0
- RL yields return of 0
- RR yields return of 10
So if you start from LL, will you be stuck in a local maximum?

This is not true because of exactly the  **closed loop update** thing -- we will consider all states of every action, so even when we take the first step to be left, Bellman equation also considers the state of if you had taken a step right. Then when we try to find the optimal action, it will figure out that taking two steps to the right gives the most value.

### Value interation
We do not always have to wait for the policy evaluation to complete before doing policy improvement -- so we can do 5 policy updates before doing 1 policy evaluation.

![image](https://user-images.githubusercontent.com/18204038/64123777-2d97e680-cd9d-11e9-849a-1b8b15ae778e.png)

Here we take the Bellman optimality equation and turn it into an update rule.

## MC methods
MC provides one way to perform reinforcement learning: finding
optimal policies without a priori models of MDP
MC for RL learns from complete sample returns in episodic tasks:
uses value functions but not Bellman equations

![image](https://user-images.githubusercontent.com/18204038/64123897-7ea7da80-cd9d-11e9-8806-2823b9af1994.png)

![image](https://user-images.githubusercontent.com/18204038/64123925-8e272380-cd9d-11e9-972c-1256b58ea37e.png)

So this is not a tree anymore, we just keep performing rollout and calculate the reward (this is like a depth search vs. the Bellman is a width search)

![image](https://user-images.githubusercontent.com/18204038/64124031-d8a8a000-cd9d-11e9-8272-77517c7cde7e.png)

![image](https://user-images.githubusercontent.com/18204038/64124051-e5c58f00-cd9d-11e9-9492-dd45a298f56f.png)

We just carry out the entire policy in the real world.

### On-policy MC control
![image](https://user-images.githubusercontent.com/18204038/64124156-2ae9c100-cd9e-11e9-99d6-05f9bddf839e.png)
Caveat: Converges to the best epsilon-soft policy rather than the best best policy.

### Off-policy MC control

To avoid the caveat, we can do off-policy MC, which allows us to have a different estimation policy to behaviour policy. This is done using **importance sampling**.

![image](https://user-images.githubusercontent.com/18204038/64124243-62f10400-cd9e-11e9-80de-97e9a989f862.png)
![image](https://user-images.githubusercontent.com/18204038/64124253-6edcc600-cd9e-11e9-8da6-4d5b6bc28b77.png)
![image](https://user-images.githubusercontent.com/18204038/64124350-b6fbe880-cd9e-11e9-9680-88427cf1deed.png)


The variance depends on the difference between the target policy and the actual policy, if the two policies defer too much then the capacity of the estimation is very limited.

## Temporal-difference methods

### TD(0): estimation of V
![image](https://user-images.githubusercontent.com/18204038/64124712-9b451200-cd9f-11e9-96cf-b5c00b89c0fc.png)
![image](https://user-images.githubusercontent.com/18204038/64124730-aa2bc480-cd9f-11e9-9e65-e6c2d14c38fb.png)


Pseudo code:
![image](https://user-images.githubusercontent.com/18204038/64124905-35a55580-cda0-11e9-9b00-9edf1378c6e0.png)

#### Difference between TD(0), MC and DP
![image](https://user-images.githubusercontent.com/18204038/64124927-481f8f00-cda0-11e9-9d83-8f8bcced5d7c.png)

#### Advantages
- TD methods require only experience, not a model
- TD, but not MC, methods can be fully incremental (we don't have to wait until the end of the episode to do an update)
- Learn before final outcome: less memory and peak computation
- Learn without the final outcome: from incomplete sequences
- Both MC and TD converge but TD tends to be faster

### Sarsa: estimation of Q
![image](https://user-images.githubusercontent.com/18204038/64125038-903eb180-cda0-11e9-8f65-20ab52ec2762.png)

Bootstrapping off state-action pair rather than just the next state.

It's not a policy evaluation algorithm, but an optimisation algorithm. It doesn't try to find Q^pi, it tries to find Q*:

![image](https://user-images.githubusercontent.com/18204038/64125091-b9f7d880-cda0-11e9-8fa0-e2155104039f.png)

### Expected Sarsa
Compute the expectation of the action explicitly rather than just taking one sample of it, to reduce variance in updates:
![image](https://user-images.githubusercontent.com/18204038/64125193-22df5080-cda1-11e9-95b3-8b3c83ef109d.png)

### Q-learning: off-policy TD control

In MC, we can do on-policy and off-policy, and we can also do that in temporal learning.
![image](https://user-images.githubusercontent.com/18204038/64125293-60dc7480-cda1-11e9-8284-0d1d00419e51.png)

When we do maximisation over all actions, we are no longer considering the expectation of the policy/taking a sample in the policy, so this is off-policy.
![image](https://user-images.githubusercontent.com/18204038/64125490-160f2c80-cda2-11e9-9839-a69c3241206f.png)

## Summary: a unified view
![image](https://user-images.githubusercontent.com/18204038/64125568-5373ba00-cda2-11e9-8424-ce86797e4796.png)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinforcement learning (Shimon Whiteson) #4

Reinforcement learning

Definition and intuition

Difference to supervised learning

K-armed bandit problem

Exploration and exploitation

Action value methods

Contextual bandit problem

Markov decision processes: formalising the RL problem

The Markov property

Return: value function and Bellman equation

Planning with MDP

Dynamic programming approach

Policy evaluation update rule

Value interation

MC methods

On-policy MC control

Off-policy MC control

Temporal-difference methods

TD(0): estimation of V

Difference between TD(0), MC and DP

Advantages

Sarsa: estimation of Q

Expected Sarsa

Q-learning: off-policy TD control

Summary: a unified view

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reinforcement learning (Shimon Whiteson) #4

Description

Reinforcement learning

Definition and intuition

Difference to supervised learning

K-armed bandit problem

Exploration and exploitation

Action value methods

Contextual bandit problem

Markov decision processes: formalising the RL problem

The Markov property

Return: value function and Bellman equation

Planning with MDP

Dynamic programming approach

Policy evaluation update rule

Value interation

MC methods

On-policy MC control

Off-policy MC control

Temporal-difference methods

TD(0): estimation of V

Difference between TD(0), MC and DP

Advantages

Sarsa: estimation of Q

Expected Sarsa

Q-learning: off-policy TD control

Summary: a unified view

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions