Skip to content

Reinforcement learning (Shimon Whiteson) #4

@YugeTen

Description

@YugeTen

Reinforcement learning

slides

How can an intelligent agent learn from experience how to make decisions
that maximise its utility in the face of uncertainty?
image
Unlike unsupervised learning we do have feedback, but it comes in the form of reward. This feedback is however not as strong as supervised learning.

Definition and intuition

In RL an agent tries to solve a control problem by directly interacting with an unfamiliar environment. The agent must learn by trial and error, trying out actions to learn about their sequences.

Difference to supervised learning

  1. Agent has partial control over what data it collects in the future;
  2. No right and wrong, just rewards for actions;
  3. The agent must learn on-line: must maximise performance during learning, not afterwards.

The reward must be quantifiable -- hence the reward design problem: what is the definition of "good" or "bad" action?

When the reward design is poor, the agent can have undesirable behaviour.

K-armed bandit problem

Setting: you're an octopus, sitting before a slot machine (bandit) with many arms, where each arm has an unknown stochastic payoff. The goal is to maximise cumulative payoff over some period.

image
Formalising:
image
Infinite-horizon problem is for example Google's search engine. \gamma tells you how much you care about future, i.e. how much you care about instant gratification vs. long term reward.

Exploration and exploitation

Explore: explore the arms in order to learn about them and improve its chances of getting future reward;

Exploit: focus on the most profitable arm and get the largest reward.

How to balance between explore and exploit mode?

  • Horizon is finite: exploration should decrease as horizon gets closer
  • Horizon is infinite but \gamma < 1: exploration should decrease as the agent's uncertainty about the expected rewards go down
  • Horizon is infinite and \gamma = 1: infinitely delayed splurge, you have infinite future ahead of you, so you always explore and delay gratification infinitely.

To address this we have ---

Action value methods

image

A few different methods:

  1. epsilon-greedy:
    image

  2. softmax exploration: concentrate exploration on most promising arms.
    image

  3. Upper confidence bound:
    Neither epsilon-greedy nor softmax considers uncertainty in action-value estimates, while goal of exploration is to reduce uncertainty. So focus exploration on most uncertain actions. This uses the principle of optimism in the face of uncertainty.

Focus on arms that are promising and uncertain for exploration.

image

Contextual bandit problem

image
So it's a traditional bandit problem, but instead of having Q conditioned on action it conditions on action and state together.

Our action doesn't just affect reward, it also affects the state of the world:
image

The credit-assignment problem: suppose an agent takes a long sequence of actions, at the end of which is receives a large reward. How does it determine to what degree each action in that sequence is responsible for the resulting reward?

Markov decision processes: formalising the RL problem

image
stationary: rule of physics isn't changing
stochastic: a bit random

image

MDP example: recycling robot

image

The Markov property

image

The current state is sufficient for the agent's history, conditioning actions on the rest of the history cannot possibly help. This restricts search to reactive politcy:

image

Is it Markov?

  1. Robot in a maze, state is wall or no wall or 4 sides, action is up, down, left, right: no
  2. Chess, state is board position, action is legal move: yes (minus the special rules)

Return: value function and Bellman equation

We now need to reason about long term consequences, and we can do so by maximising the expected return, which is the sum over the rewards received.
image

Value function: Value functions are the primary tool for reasoning about future reward. It is the expected value of the policy.

image
note that in action-value the action can deviate from the policy (doesn't have to be in the policy)

Bellman equation: we are not sure what the next state will be so we take the expectation over all actions as well. This is commonly known as bootstrapping:
image

The difference between the two equations is whether we bootstrap over state or state-action pair.

To gain some intuitions:
image

Look at the LHS of the tree, corresponding to the first equation:

  • The first extension of the tree, we look at all the action -- first summation
  • The second extension of the tree, we look at all the stochastic outcome -- second summation

Optimal value functions

image

The Bellman optimality equations express this recursively: we replace the Bellman equation's expectation over actions with a maximisation wrt action:

image

A recap -- P: transition function, R: reward function

image

Planning with MDP

MDPs give us a formal model of sequential decision making. Given the optimal value function, computing an optimal policy is straightforward. But how can we find V* or Q*?

Algorithms for MDP planning compute the optimal value function given a complete model of the MDP. Given a model, V* is usually sufficient.

Dynamic programming approach

image

We start off with arbitrary policy (pi), then we compute the true value function for the arbitrary policy (V). Then we use the value function to figure out an incremental improvement for our policy. Repeat the process until you reach optimal point.

We can use the Bellman equation to exploit the relationship between states (instead of estimating each state independently). Initial value function is chosen arbitrarily.

Policy evaluation update rule

image

You estimate by summing up all the current state. Apply to every state in each sweep of the state space. Repeat over many sweeps, it will eventually converge to fixed point where V_k = V^pi. (we get the true value of the arbitrary policy).

image

image

More intuitively:
image

We start from a random point, first perform policy evaluation, which get us to where value is exactly that of the policy (V = V^pi). We perform policy improvement, at which point the value function diverges from true value function. Keeps interating and you converge to the optimal value.

Guaranteed convergence: how?? (i.e. how do you know these two lines are intersecting with each other and you can always get to an optimal point) in here we are doing closed-loop update, which means the action always have the opportunity to condition on the state.

"A counter example": say you take two actions, left and right and you have two timesteps. if these are the true values corresponding to action:

  • LL yields return of 5
  • LR yields return of 0
  • RL yields return of 0
  • RR yields return of 10
    So if you start from LL, will you be stuck in a local maximum?

This is not true because of exactly the closed loop update thing -- we will consider all states of every action, so even when we take the first step to be left, Bellman equation also considers the state of if you had taken a step right. Then when we try to find the optimal action, it will figure out that taking two steps to the right gives the most value.

Value interation

We do not always have to wait for the policy evaluation to complete before doing policy improvement -- so we can do 5 policy updates before doing 1 policy evaluation.

image

Here we take the Bellman optimality equation and turn it into an update rule.

MC methods

MC provides one way to perform reinforcement learning: finding
optimal policies without a priori models of MDP
MC for RL learns from complete sample returns in episodic tasks:
uses value functions but not Bellman equations

image

image

So this is not a tree anymore, we just keep performing rollout and calculate the reward (this is like a depth search vs. the Bellman is a width search)

image

image

We just carry out the entire policy in the real world.

On-policy MC control

image
Caveat: Converges to the best �epsilon-soft policy rather than the best best policy.

Off-policy MC control

To avoid the caveat, we can do off-policy MC, which allows us to have a different estimation policy to behaviour policy. This is done using importance sampling.

image
image
image

The variance depends on the difference between the target policy and the actual policy, if the two policies defer too much then the capacity of the estimation is very limited.

Temporal-difference methods

TD(0): estimation of V

image
image

Pseudo code:
image

Difference between TD(0), MC and DP

image

Advantages

  • TD methods require only experience, not a model
  • TD, but not MC, methods can be fully incremental (we don't have to wait until the end of the episode to do an update)
  • Learn before final outcome: less memory and peak computation
  • Learn without the final outcome: from incomplete sequences
  • Both MC and TD converge but TD tends to be faster

Sarsa: estimation of Q

image

Bootstrapping off state-action pair rather than just the next state.

It's not a policy evaluation algorithm, but an optimisation algorithm. It doesn't try to find Q^pi, it tries to find Q*:

image

Expected Sarsa

Compute the expectation of the action explicitly rather than just taking one sample of it, to reduce variance in updates:
image

Q-learning: off-policy TD control

In MC, we can do on-policy and off-policy, and we can also do that in temporal learning.
image

When we do maximisation over all actions, we are no longer considering the expectation of the policy/taking a sample in the policy, so this is off-policy.
image

Summary: a unified view

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions