(Parametrized) stochastic policy:
$$ \pi_\theta(s, a) = p(a \mid s; \theta) $$Goal: Maximization of the expected reward of the policy by learning of the parameters $\theta$.
The expectation of a function $f(X)$ is defined as: $$ \mathbb E_{X \sim p(X\mid \theta)} [f(X)] = \int_\mathcal X f(X) p(X\mid \theta) dX $$
If we want to compute the gradient of the expectation we can rearrange the formula: $$ \nabla_\theta \mathbb E_{X\sim p(X\mid \theta)} [f(X)] = \nabla_\theta \left( \int_\mathcal X f(X) p(X\mid \theta) dX \right) =\\ \int_\mathcal X f(X) \nabla_\theta \left( p(X\mid \theta) \right) dX =\\ \int_\mathcal X f(X) p(X\mid \theta) \frac{\nabla_\theta \left( p(X\mid \theta) \right) }{p(X\mid \theta)}dX =\\ \int_\mathcal X f(X) p(X\mid \theta)\nabla_\theta \left(\log p(X\mid \theta) \right) dX =\\ \mathbb E_{X\sim p(X\mid \theta)} \left[f(X) \nabla_\theta \left(\log p(X\mid \theta) \right) \right] $$
The score function is: $\nabla_\theta \log \pi_\theta(s,a)$
The score function describes how sensitive the stocastic policy $\pi$ is to changes in $\theta$ for a fixed $\theta$.
Linear model for th unnormalized log-probability: $\phi(s,a)^T \theta$,
i.e. weighting of the actions using a linear combination of features $\phi(s,a)$.
Exponentiation and normalization gives the probability for an action $a$ in state $s$:
$$ \pi_\theta (s, a) = \frac{ e^{\phi(s,a)^T \theta }}{\sum_{a' \in \mathcal A} e^{\phi(s,a')^T \theta} } $$The score function for softmax policy:
$$ \nabla_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb E_{\pi_\theta}[\phi(s,\cdot )] $$For any differentiable policy $\pi_\theta(s,a)$,
for any of the policy objective functions $J = J_1, J_{avgR}$, or $\frac{1}{1-\gamma}J_{avgV}$
the policy gradient is (proof see appendix of [Su00]):
Practical value of the theorem: the computation of the policy gradient reduces to a simple expectation.
Estimate the state-action value $Q^{\pi_\theta}(s, a)$ by the discounted reward (return) of the state $G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots + \gamma^{T-t} R_{T}$
The estimated gradient for one episode is $$ \nabla_\theta J(\theta) \approx \sum_{t=1}^T G_t \nabla_\theta \log \pi_\theta(s, a) $$
import numpy as np
import numpy.testing as npt
import matplotlib.pyplot as plt
%matplotlib inline
import gym
env = gym.make('CartPole-v0')
[2017-01-05 11:24:37,138] Making new env: CartPole-v0
Here we have just zwo actions "left"($a=0$) and "right"($a=1$), so we can use the sigmoid (instead of the softmax) for
$$ \pi_\theta (s, a=1) = \text{sigmoid}(\vec \phi(\vec s) \vec \theta + b) $$Here we don not use (special) features. We just use the states as features: $$ \vec \phi(\vec s) = \vec s $$
def sigmoid(z):
return 1/(1 + np.exp(-z))
def augmented_states(states):
aug = np.zeros
if len(states.shape)==2:
return np.concatenate((aug(states.shape[0]).reshape(-1,1), states), axis=1)
else:
return np.concatenate((aug(1), states))
def get_p(theta, states):
if len(theta)==5:
states_ = augmented_states(states) # for bias b (bias trick)
else:
states_ = states
return sigmoid(np.dot(states_, theta))
def get_action(theta, observation):
p = get_p(theta, observation)
return np.random.binomial(n=1, p=p, size=1)[0]
def roll_out(theta, max_episode_length=10000):
actions = []
states = []
observation = env.reset()
rewards = []
done = False
i = 0
while not done: # TODO max epoch length
action = get_action(theta, observation)
states.append(observation)
actions.append(action)
observation, reward, done, info = env.step(action)
rewards.append(reward)
i+=1
if i>max_episode_length:
break
return np.array(rewards), np.array(states), np.array(actions)
#roll_out(theta)
n=5
theta = np.random.randn(n)
rewards, states, actions = roll_out(theta)
#rewards, states, actions
def get_discounted_sum_of_rewards(rewards, gamma=.99):
adv = np.zeros_like(rewards)
c = 0.
for i, r in enumerate(rewards[::-1]):
c = r + gamma * c
adv[i] = c
return adv[::-1]
get_discounted_sum_of_rewards([0., 0., 0., 1.])
array([ 0.970299, 0.9801 , 0.99 , 1. ])
with
Derivative of the sigmoid:
$$ \frac{d}{dx}\text{sigmoid}(x)=\text{sigmoid}(x)(1-\text{sigmoid}(x)) $$Here: $$ \pi_\theta(\vec s, a=1) = \text{sigmoid}(\vec s^T \vec \theta) $$
$$ \pi_\theta(\vec s, a=0) = 1- \text{sigmoid}(\vec s^T \vec \theta) $$$$ \nabla_\theta \log \pi_\theta(\vec s, a=1) = \frac{1}{\text{sigmoid}(\vec s^T\vec \theta)}\text{sigmoid}(\vec s^T\vec \theta)(1-\text{sigmoid}(\vec s^T \vec \theta)) \vec s = (1-\pi_\theta(\vec s, a=1)) \vec s $$$$ \nabla_\theta \log \pi_\theta(\vec s, a=0) = - \frac{1}{1- \text{sigmoid}(\vec s^T\vec \theta)} \text{sigmoid}(\vec s^T\vec \theta)(1-\text{sigmoid}(\vec s^T\vec \theta)) = - \pi_\theta(\vec s, a=1) \vec s $$def get_delta_theta(actions, pi, states, advantages, alpha):
p = get_p(theta, states)
states_ = augmented_states(states)
grad = advantages * (-(1. - actions) * p + (actions * (1. - p) )) * states_.T
# clip gradient
# clip=25.
#if np.any(np.abs(grad)>clip):
#print (np.max(grad))
#grad = clip * grad/np.max(grad)
return alpha * np.sum(grad, axis=1)
theta = np.random.randn(n)
nb_rollouts = 10001
b_rewards = 0.
ii=0
alpha_ = alpha = 0.001
for i in range(nb_rollouts):
rewards, states, actions = roll_out(theta)
adv = get_discounted_sum_of_rewards(rewards, gamma=0.99)
if i%10==0: # learning rate decay
alpha = alpha_ * 0.9
delta_theta = get_delta_theta(actions, theta, states, adv, alpha)
theta += delta_theta
# just for progress printing
b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards)
ii+=1
if i%250==0:
print (i, theta, b_rewards)
b_rewards = 0.
ii=0
0 [-0.17344432 0.64746428 0.70026023 1.22031105 0.34207346] 18.0 250 [-0.17344432 -0.13518532 2.09379432 3.41064048 4.84379737] 124.816 500 [-0.17344432 -3.01573871 2.98143799 5.86882375 6.95513493] 298.428 750 [ -0.17344432 1.70650257 2.99633254 10.31686178 12.2067736 ] 728.516 1000 [ -0.17344432 -0.77927747 1.34002176 11.08589519 12.72645203] 296.244 1250 [ -0.17344432 -1.39062293 4.61079084 13.38255253 15.50965289] 2276.736 1500 [ -0.17344432 2.29439042 1.02429169 13.92594869 17.82305295] 620.688 1750 [ -0.17344432 -0.99215483 0.8042833 14.27091883 18.84466158] 313.4 2000 [ -0.17344432 1.86845107 -0.49193008 14.37816835 18.93449913] 358.656 2250 [ -0.17344432 -2.64385232 2.22809242 14.89598477 18.34140148] 315.512 2500 [ -0.17344432 -3.63304082 2.40439079 14.8589223 18.36400163] 232.54 2750 [ -0.17344432 1.19635986 7.71113422 15.54115281 19.31382173] 1786.332 3000 [ -0.17344432 0.40614209 6.66550218 18.37557547 20.83997478] 2913.464 3250 [ -0.17344432 -6.79415416 4.4958923 18.32622474 20.90250196] 349.152 3500 [ -0.17344432 -5.64888397 6.21307541 18.38975317 20.8351785 ] 255.148 3750 [ -0.17344432 -3.83104739 7.28885398 18.75303348 22.29447309] 450.168 4000 [ -0.17344432 1.56367768 11.59714601 18.82763179 20.53774896] 1609.012 4250 [ -0.17344432 -3.82586597 15.20375561 20.24393617 21.76375826] 7400.452 4500 [ -0.17344432 1.50381509 8.64879678 20.89032646 27.73195129] 4161.032 4750 [ -0.17344432 4.14080306 11.87169708 20.81760842 28.41168812] 1280.944 5000 [ -0.17344432 3.13668488 14.04795733 21.11576592 33.14025838] 6680.832 5250 [ -0.17344432 1.80384902 16.65673573 22.24396779 32.96400492] 8052.008 5500 [ -0.17344432 4.28514843 21.36812508 22.5697242 32.85923548] 9307.632 5750 [ -0.17344432 1.85217412 17.554174 24.13065606 39.51369869] 7623.588 6000 [ -0.17344432 1.99825464 23.45130849 24.60408862 37.42754604] 9850.584 6250 [ -0.17344432 0.59196073 35.33622866 26.41746464 28.19440256] 5888.216 6500 [ -0.17344432 0.66762881 25.02187502 26.30635793 42.16254313] 8902.704 6750 [ -0.17344432 -6.68060115 24.34457221 28.14226668 40.63795847] 4903.96 7000 [ -0.17344432 1.10764232 28.63616497 28.7667267 37.68566725] 4599.712 7250 [ -0.17344432 2.50274902 28.00571915 28.343935 39.24390679] 10001.0 7500 [ -0.17344432 4.3386294 31.28127639 28.61524992 39.02199538] 9650.452 7750 [ -0.17344432 4.36962477 27.82249491 28.84423521 40.98967973] 10001.0 8000 [ -0.17344432 -0.20095448 32.69302076 32.58432046 36.21854967] 7287.032 8250 [ -0.17344432 3.67461367 34.04059165 30.59776953 37.98667424] 10001.0 8500 [ -0.17344432 2.26556557 29.52227394 32.85573795 41.6915911 ] 9988.384 8750 [ -0.17344432 5.50582341 26.75623069 33.40331956 43.65194207] 10001.0 9000 [ -0.17344432 6.84509867 32.38648962 33.4425614 43.27093226] 8263.116 9250 [ -0.17344432 1.80252043 18.32346 35.26901269 52.02101994] 10001.0 9500 [ -0.17344432 2.15276574 17.90183915 35.88422216 51.82852184] 9741.276 9750 [ -0.17344432 2.81284184 10.58769449 36.5159464 53.94835712] 8047.124 10000 [ -0.17344432 2.69117313 2.45619903 36.86371001 54.41119968] 2177.88
After improvement the average accumulated reward drops sometimes suddenly probably due to high variance.
def render(theta, max_episode_length=5000):
observation = env.reset()
for t in range(max_episode_length):
env.render()
action = get_action(theta, observation)
#print ("random action:", action)
observation, reward, done, info = env.step(action)
# observation =
# [position of cart, velocity of cart, angle of pole, angular velocity of pole]
#print (observation, reward, done, info)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.render(close=True)
render(theta)
Episode finished after 362 timesteps
for reducing variance
REINFORCE has very high variance.
One recipe for reducing the variance is to substract a baseline (also called control variate) from the return. The baseline have to have zero mean, so it will not effect the expectation.
As baseline the state value $v(s_t)$ function is used (see [Su00]).
The estimated gradient for one episode is $$ \nabla_\theta J(\theta) \approx \sum_{t=1}^T (G_t - v(s_t)) \nabla_\theta \log \pi_\theta(s, a) $$
This can be pluged in an optimization algorithm (e.g. stochastic gradient descent) for optimizing the policy.
As baseline we use the value function of the state $s_t$. So we must estimate the baseline, too.
or with $\vec s' = (1, s_1, s_2, \dots , s_n)$ and $\vec w^T = (w_0, w_{1}, \dots w_n)$
$$ v_w(\vec s) = \vec s' \cdot \vec w $$Note: The cartpole example also works without the bias term $w_0$, which would simplyfy the code.
def augmented_states_(states, aug = np.ones):
states = np.abs(states)
if len(states.shape)==2:
return np.concatenate((aug(states.shape[0]).reshape(-1,1), states), axis=1)
else:
return np.concatenate((aug(1), states))
def get_values(s, w):
s_ = augmented_states_(s)
return np.dot(s_, w)
(see [Sut00]):
with
with the linear approximation $v_\phi(\vec s)$:
$$ \nabla_w v_w(\vec s) = \nabla_w ( \vec s' \cdot \vec w) = \vec s' $$def get_delta_w(w, rewards, states, gamma, lr):
states_ = augmented_states_(states)
approximated_values = get_values(states, w)
unbiased_estimate_of_values = get_discounted_sum_of_rewards(rewards, gamma)
grad = ((unbiased_estimate_of_values - approximated_values) * states_.T).sum(axis=1)
return lr * grad
nb_rollouts = 3001
n=5
theta = np.random.randn(n)
phi = np.zeros(n)
b_rewards = 0.
ii=0
gamma=0.99
lr = lr_ = 1.e-5
alpha = alpha_ = 0.001
for i in range(nb_rollouts):
rewards, states, actions = roll_out(theta)
adv = get_discounted_sum_of_rewards(rewards, gamma)
delta_phi = get_delta_phi(phi, rewards, states, gamma, lr)
phi += delta_phi
adv = get_discounted_sum_of_rewards(rewards, gamma) - get_values(states, phi)
if i%10==0: # learning rate decay
alpha = alpha_ * 0.95
delta_theta = get_delta_theta(actions, theta, states, adv, alpha)
theta += delta_theta
b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards)
ii+=1
if i%250==0:
print (i, theta, b_rewards)
b_rewards = 0.
ii=0
#print ("values", get_values(states, phi)[:10])
0 [ 0.01842218 -0.1359281 0.49786343 -0.12111486 1.06499871] 26.0 250 [ 0.01842218 -1.44532046 0.92049717 2.43303793 4.95868241] 137.128 500 [ 0.01842218 -2.85044215 0.99612818 4.41423014 7.44668203] 276.976 750 [ 0.01842218 1.63759757 0.43418764 9.67250223 10.62859679] 1736.328 1000 [ 0.01842218 -0.30857537 4.64979793 10.90679252 10.00355492] 6843.608 1250 [ 0.01842218 0.51386849 6.06029376 12.53461807 9.64501004] 6610.196 1500 [ 0.01842218 0.4720908 5.78730549 12.93569815 9.31714145] 9904.092 1750 [ 0.01842218 0.40232301 4.22548648 12.96596296 10.49275065] 9974.996 2000 [ 0.01842218 0.67517062 5.37792374 12.63457478 9.84750427] 10000.572 2250 [ 0.01842218 0.70275676 3.93617859 12.85015721 10.92324032] 9978.772 2500 [ 0.01842218 0.53282516 2.41369364 13.29403836 11.73746162] 9913.228 2750 [ 0.01842218 1.15124911 2.83149324 13.99078576 11.6775455 ] 9272.64 3000 [ 0.01842218 0.86806383 3.97548125 14.40496683 11.36183965] 9866.676
Note: The optimization is now faster and much more stable due to reduced variance.
render(theta)
Solution:
From $$ \nabla_\theta J(\theta) \approx \sum_{t=1}^{T} \nabla_\theta\pi(a_t \mid s_t, \theta) A_t $$
follows the optimization objective (maximization):
$$ J(\theta) \approx \sum_{t=1}^{T} \pi(a_t \mid s_t, \theta) A_t $$with
import tensorflow as tf
def sigmoid_tf(z):
return 1./(1. + tf.exp(-z))
sess = tf.InteractiveSession()
states_tf = tf.placeholder("float64", shape=[None, 4]) # for one episode
roll_out_values_tf = tf.placeholder("float64", shape=[None])
actions_tf = tf.placeholder("float64", shape=[None])
w = np.zeros(n)
w_tf = tf.Variable(w.reshape(n, -1))
w_bias = tf.Variable(np.array([0.]))
values_tf = tf.matmul(states_tf, w_tf) + w_bias
loss_w = tf.reduce_mean(tf.square(roll_out_values_tf - values_tf))
global_step_w = tf.Variable(0, trainable=False)
starter_learning_rate_w = 0.00001
learning_rate_w = tf.train.exponential_decay(starter_learning_rate, global_step,
10, 0.9, staircase=True)
alpha = 1.e-2
optimizer = tf.train.GradientDescentOptimizer(learning_rate=alpha)
#optimizer = tf.train.AdamOptimizer(alpha)
#optimizer_w = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.5, use_nesterov=False)
train_step_w = optimizer.minimize(loss_w, var_list=[w_tf, w_bias])#, global_step=global_step)
n=4
theta = np.zeros(n)
theta_tf = tf.Variable(theta.reshape(n, -1))
advantage_tf = roll_out_values_tf - tf.reshape(values_tf, [-1])
p_1 = sigmoid_tf(tf.matmul(states_tf, theta_tf))
p_0 = 1. - p_1
low = 1.e-40
log_pi_theta = (actions_tf) * tf.reshape(tf.log(p_1+low), [-1]) + ((1. - actions_tf) * tf.reshape(tf.log(p_0+low), [-1]))
#log_pi_theta = (actions_tf) * tf.reshape(tf.log(p_1), [-1]) + ((1. - actions_tf) * tf.reshape(tf.log(p_0), [-1]))
gain = tf.reduce_sum(advantage_tf * log_pi_theta)
loss = - gain # maximization!
global_step_theta = tf.Variable(0, trainable=False)
starter_learning_rate_theta = 0.001
learning_rate_theta = tf.train.exponential_decay(starter_learning_rate_theta, global_step_theta,
10, 0.9, staircase=True)
#alpha = 0.1e-6
#optimizer_theta = tf.train.GradientDescentOptimizer(learning_rate=alpha)
#optimizer = tf.train.AdamOptimizer(alpha)
optimizer_theta = tf.train.MomentumOptimizer(learning_rate=learning_rate_theta, momentum=0.9, use_nesterov=False)
train_step_theta = optimizer_theta.minimize(loss, var_list=[theta_tf])#, global_step=global_step)
#init_op = tf.initialize_all_variables()
init_op = tf.global_variables_initializer()
sess.run(init_op)
nb_rollouts = 2001
b_rewards = 0.
ii=0
for i in range(nb_rollouts):
theta = theta_tf.eval().reshape(n)
rewards, states, actions = roll_out(theta)
roll_out_values = get_discounted_sum_of_rewards(rewards)
# critic update
train_step_w.run(feed_dict={states_tf: states, roll_out_values_tf: roll_out_values})
# actor update
train_step_theta.run(feed_dict={states_tf: states, roll_out_values_tf: roll_out_values, actions_tf: actions.astype('float64')})
b_rewards = 1/(ii+1) * (rewards.sum() + ii * b_rewards)
ii+=1
if i%100==0:
print (i, theta, w_bias.eval(), b_rewards)
b_rewards = 0.
ii=0
0 [ 0. 0. 0. 0.] [ 0.11608857] 11.0 100 [ -1.56049969 2.04243144 3.57670503 10.86164029] [ 40.24051115] 121.71 200 [ 6.40160546 4.53159559 8.75059618 18.13448152] [ 50.92948672] 201.22 300 [ -6.53755688 12.2942023 13.30752632 20.52017335] [ 66.09958111] 568.5 400 [ 0.12475472 18.92140922 23.02470601 18.58555769] [ 81.80337178] 2696.77 500 [ 0.47523261 13.68496541 23.92864933 27.20355766] [ 96.52455321] 9720.4 600 [ 4.42803294 16.05978826 24.02920465 28.12144214] [ 98.68147345] 10001.0 700 [ 2.80347074 19.52179063 24.84072479 26.07660755] [ 98.9644464] 9962.47 800 [ 2.84854492 13.92065274 24.608145 29.4804424 ] [ 99.00206231] 10001.0 900 [ 4.55903587 16.56951481 24.12132129 28.15532598] [ 99.00965304] 10001.0 1000 [ 2.77052225 7.7353129 26.73923236 33.19656078] [ 98.84525669] 9674.24 1100 [ 2.35017984 7.97580121 29.17981888 34.42012505] [ 98.83918708] 9554.62 1200 [ 3.53064081 11.30330201 28.89434916 34.10424319] [ 98.97820397] 9890.4 1300 [ 3.71952565 14.26994971 28.43304329 33.18542673] [ 99.00557453] 10001.0 1400 [ 0.77805241 13.3472082 28.90803097 33.85287792] [ 99.00646988] 10001.0 1500 [ 2.72427873 16.97220724 29.46873887 31.62260521] [ 99.0076811] 9984.29 1600 [ 1.32072646 21.49520033 29.79434063 28.02529668] [ 99.00990881] 10001.0