COMP0089 - Reinforcement Learning
- easygpaser
- Jun 9, 2022
- 9 min read

1. Consider a (stateless) bandit with m different actions {a1, . . . , am}. When we select an action At at time t, a random reward Rt is drawn i.i.d. from a stationary distribution p(Rt | At) that depends on the chosen action. These action-dependent reward distributions are unknown to the agent.
As examples, consider the pseudo code for a random and greedy policy to interact with this bandit on a sequence of discrete steps.

a. Using the format of the algorithms above, write down the pseudo code for 1) an -greedy policy and 2) a policy generated with the standard UCB algorithm (with which we mean the algorithm derived from a general Hoeffding concentration bound). [5 marks]
b. Consider a bandit with two arms, a and b. So far, we have seen the following actions and rewards, on time steps
t = 1 and t = 2: t = 1 : A1 = a ,
R1 = 0 t = 2 : A2 = b , R2 = 1
The rewards are known to be Bernoulli random variables (so Rt ∈ {0, 1}) with unknown means. Consider a Thompson sampling algorithm to select actions, with a uniform Beta prior at time t = 0 such that the probability density functions for the expected reward for both actions before seeing any data (at time t = 0) are defined by p(E[R] = x | a) = 1, for all x ∈ [0, 1], and all a.
What is the probability under Thompson sampling of A3 = a? Show your calculations, but keep it concise. [5 marks]
c. Consider the UCB algorithm from question 1.a. Suppose we have two actions, a and b. Consider the initial exploration bonus for each to be infinite, as long as we have not selected the corresponding action, so that the algorithm first selects each action at least once.
Suppose action a yields a Bernoulli random reward with p(R = 1 | a) = 1/3 and p(R = 0 | a) = 2/3. Action b also yields a Bernoulli random reward, but with p(R = 1 | b) = 2/3 and p(R = 0 | b) = 1/3.
What is the probability (before seeing any data) of selecting action a on the third time step (at which point we will have selected both a and b exactly once)? (Break ties uniformly, if relevant.) [5 marks]
d. Consider a tabular ergodic MDP. Consider the following algorithm: on each time step t we use UCB in the state we are currently in to determine the action to take. Specifically, use and update the required statistics for UCB locally, for that state. In other words, in each state we update the UCB statistics as if that state is a bandit problem. But instead of averaging the immediate rewards to use as action values, we use one-step Q-learning with a step size αt to predict action values with some discount γ ∈ [0, 1].
Is this algorithm guaranteed to converge to the optimal value function q∗ (for any finite MDP with well-defined values, e.g., γ < 1) for an appropriately chosen step size schedule {αt} ∞ t=0? Prove your answer, but be concise. (E.g., this proof should not include lengthy calculations, you don’t have to specify a concrete step size schedule, etc.) [7 marks]
[Total for Question 1: 22 marks]

2. a. Write down
1. the Bellman equation for state values vπ(s) for a finite MDP, using explicit summations over states and actions (no expectation notation),
2. the Bellman optimality equation for action values q∗(s, a) for a MDP with continuous states and actions, using explicit integration (no expectation notation), and
3. the synchronous (across all states and actions) value iteration update rule for state values vk, for a finite MDP (finite state and action space), using explicit summations (no expectation notation). [5 marks]
b. Consider the MDP depicted in Figure 1.
The agent starts at the cell marked by S. Whenever it enters a cell with a positive number, the agent receives that reward and the episode terminates. Consider X = 3.
1. Draw the graph of the optimal value v∗(S) of the starting state S, as a function of γ (on the x-axis, from 0 to 1) with v∗(S) on the y-axis.
2. When X = 3, what is the optimal policy in π∗ state S as a function of the discount? [5 marks]
c.Suppose a behavioural scientist was doing an experiment where they gave rewards to an animal. Suppose the setting was a grid just like the MDP in Figure 1, with X = 5 (e.g., 5 food pellets at that location). It turns out that, after repeatedly exploring the grid, the animal seems to prefer going up to the reward of +2.
1. In this setting (with X = 5), prove that no scalar discount γ ∈ [0, 1] exists for which the optimal policy is to go to +2.
2. Consider a Monte Carlo return Gt = Rt+1 + f(Rt+1 + f(Rt+2 + f(. . .))),
Standard discounting can be seen as applying a linear transformation f(x) = γx, by multiplying the remaining return after each step by a factor γ. Consider the following alternative where instead of multiplying with a factor γ, we raise the value to the power: f(x) = x γ . Does this mathematical model better explain the observed behaviour, in the sense that a γ exists for which the optimal policy goes to +2? If so, give such a value for γ. If not, prove why not. [7 marks]
d. Consider the true action values qπ(s, a), where these values are defined, as usual, as the expectation of the (standard) discounted cumulative rewards under a policy π. Consider the optimal values to be defined, as usual, by q∗(s, a) = maxπ qπ(s, a).
i. Is the greedy policy defined as π 0 (a|s) = argmaxa qπ(s, a), (1) always a policy improvement in the sense that qπ0(s, a) ≥ qπ(s, a) for all s, a? Prove your answer. [4 marks]
ii. When does the equality qπ0(s, a) = qπ(s, a) hold? [2 marks]
e. Consider the following dynamic-programming operator T: ∀s, a : (T q)(s, a) = r(s, a) + γ max b ESt+1∼p(·|s,a) [q(St+1, b)] , where r(s, a) = E[Rt+1 | St = s, At = a] is the expected reward (or deterministic reward) after taking action a in state s. When we apply this operator synchronously to all state-action pairs for some tabular initial action value function q, do the action values converge?
• If so, to which values does this procedure converge? Does the operator have a unique fixed point? Does it converge to the same fixed point as the standard Bellman optimality operator?
• If not, prove that it does not converge. [7 marks]
[Total for Question 2: 30 marks]
3. a.
i. Write down a one-step temporal-difference update for the actions value estimates qt(s, a) to approximate the value qπ(s, a) of a policy π, from data generated from an arbitrary policy π 0 . You can assume the generating policy at least occasionally picks any action in any state. [2 marks]
ii. Write down the forward-view target Gλ t used in the TD(λ) update for the value of state s:
• in recursive notation (using only quantities defined at time t or t+1, relating Gλ t to Gλ t+1), and
• as an explicit summation Gλ t = P∞ n=0 . . ., where the right-hand side contains rewards, discounts, λ’s and state-values, and no reference to partial returns (i.e., no Gλ t+1). [3 marks]
b. Consider a Markov reward process (MRP) with two states. The reward is zero everywhere. Each episode starts in s0. From state s0 we always transition to s1. From state s1, with probability p the episode terminates, and with a probability of 1 − p we transition to s1. The discount is γ = 1 on all non-terminal steps.
In each state, there is a single (scalar) feature φ(s), with values φ(s0) = 1 and φ(s1) = 5. We consider linear function approximation with a scalar parameter w, such that vw(s) = w × φ(s) for each s.
i. Consider starting an episode at time t = 0 in state s0 with current parameter w0 = 1. We update the parameter w online on each time step according to TD(0) with a step size of α = 0.1. The episode terminates at some (random) time T > 0.
• What is the expected value E[wT ] at termination, where the expectation is over the dynamics of the MRP?
• What is the asymptotic behaviour of the weights if we repeat this process indefinitely (with the same settings, so α = 0.1)? Do the weights converge almost surely to the optimal weights w∗ = 0? [6 marks]
ii. Suppose we do not generate experience according to the dynamics of the MRP, but instead sample transitions directly irrespective of where these samples transition to. For instance, we are now allowed to sample multiple transitions from state s0 in a row, even though each of these transitions ends up in state s1. Consider wt = 1, what is then the value of E[wt+n] after n updates (again with α = 0.1) where each transition is generated by sampling a transition from s0 with probability β or a transition from s1 with probability 1 − β? [5 marks]
iii. For which values of p and β does w converge to the optimal parameter for this MRP when n → ∞, considering appropriately decreasing step sizes? [3 marks] iv. Consider p = 0.5 (so, from state s1 there is a 50% chance of terminating). Suppose we collect n = 4 transitions in total in this environment, starting from s0. These samples are generated according to the transition dynamics of the problem (so, e.g., if we sampled a transition to s1, then the next transition must start in s1, and if we sampled a terminating transition, then the next transition must start in s0).
After collecting four samples, we repeatedly sample uniformly from these four samples and use these to learn, via TD(0). How likely is it that this process converges?
If we increase n, does that change the answer? For n = 4 we require an exact answer. To describe what happens if we would increase the sampling budget n that we are allowed to learn from, it is sufficient to qualitatively describe what will happen without concrete calculations. [4 marks]
c. Consider a finite MDP and a behaviour policy µ. The behaviour will cover the MDP (eventually), in the sense that the MDP is ergodic and µ(a|s) > 0, ∀s ∈ S, a ∈ A. Consider a learning update with temporal difference error: δt = Rt+1 + γRt+2 + γ 2 max a q(St+2, a) − q(St , a) (2)
Does updating a tabular action value function with this target converge to the optimal value function q∗?
If so, prove this convergence under infinity number of interactions with this MDP, under behaviour policy µ. If not, show what it converges to instead, or show why it diverges. [4 marks]
[Total for Question 3: 27 marks]

4. a. We consider optimising a policy πθ with parameters θ, such that πθ(a | s) gives the probability of selecting action a in state s.
Write down the average-reward policy optimisation objective (the thing we want to optimize), using expectation notation, as a function of the parameters θ. [2 marks]
b. Consider a simple MDP with states s1 and s2, as depicted in Figure 2. The agent starts each episode in one of the two states, where the start state is sampled uniformly at random. In each state there are three actions: left, right, and down. The ‘down’ action always terminates the episode with the depicted reward (−1 or +1). All other rewards are zero. There is a fixed discount γ = 0.9.
i. Give the optimal action values q∗(s, a) for each state and each action. [2 marks]
ii. For the MDP in Figure 2, suppose you cannot tell in which state you are—both s1 and s2 map to the exact same input s. Consider the following two algorithms. The first algorithm is a Q-learning algorithm that uses -greedy exploration with a slowly decaying , and just uses the observed state s as input (so no memory). The second algorithm also just uses s as input, but uses a standard policy gradient algorithm, without a baseline, and regularizes its policy towards the uniform policy. The weight of the regularisation is high at first (high enough that the policy keeps exploring), and then also slowly decays over time until, asymptotically, only the policy gradient part of the update remains. We run both these algorithms until convergence.
Which algorithm would you prefer in this problem, and why? [5 marks]
iii. Consider the set of policies that 1) do not look at the state (and therefore are the same in all states) and 2) never pick ‘left’. These policies can be parametrized with a single parameter p, such that with probability p we pick ‘right’ and with probability 1 − p we pick ‘down’ in each state. Give the episodic value of such policies J(p) = E[vπp (S0) | S0 ∼ d0], where d0 is the uniform start-state distribution. Write this value as an explicit function of p: J(p) = ... [4 marks]
c. Consider a continuous uni-variate Gaussian policy π(A) = √ 1 2πσ exp[−(A−µ) 2/(2σ 2 )], parametrised by its mean µ and a standard deviation σ. After sampling (random) action A from policy π, you observe a return G, and the episode terminates immediately after receiving the reward.
i. What is the value of the parameter µ 0 (as a function of A, G, µ, σ) after applying one update of the canonical policy-gradient algorithm, without a baseline? [4 marks]
ii. Suppose that G > 0. For which values of the chosen action A does the canonical policy gradient (without a baseline) result in an updated policy π 0 that has higher entropy than the original policy π? Express the answer as function of A, G, µ, σ. [4 marks]
[Total for Question 4: 21 marks]
Comments