here only TD(0)
from [SuBa] 6.1 TD Prediction
General Rule: We move the estimate towards the target:
$$ \text{newEstimate} \leftarrow \text{oldEstimate} + \alpha (\text{Target} - \text{oldEstimate} ) $$We write $s_t$ as an abbreviation for $S_t = s$ and $s'_{t+1}$ for $S_{t+1}=s'$.
$$ v(s_t) \leftarrow v(s_t) + \alpha (\mathcal R_{t} - V(s_t)) $$with
Update value $v(s_t)$ towards estimated return $R_{t+1} + \gamma V(s_{t+1})$:
$$ v(s) \leftarrow v(s) + \alpha \left(R_{t+1} + \gamma V(s'_{t+1}) - V(s_t)\right) $$with
$v(s_t)$: Value function of the state at time $t$
$R_{t+1} + \gamma V(s_{t+1})$ is called the TD target
Bootstrapping: update is based partially on an existing estimate.
TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.
import gym
env = gym.make('FrozenLake-v0')
env.action_space
[2016-11-25 10:59:55,816] Making new env: FrozenLake-v0
Discrete(4)
env.observation_space
Discrete(16)
env.step(0)
(0, 0.0, False, {'prob': 0.3333333333333333})
with
Bootstrapping: update is based partially on an existing estimate.
TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.
# SARSA: On-policy TD control algorithm
def sarsa(alpha = 0.02,
gamma = 1.,
epsilon = 0.05,
q = None,
progress = None,
env=env):
if q is None:
q = np.ones((16,4))
for i in range(nb_episodes):
done = False
s = env.reset()
a = action_epsilon_greedy(q, s, epsilon=epsilon)
while not done:
new_s, reward, done, info = env.step(a)
new_a = action_epsilon_greedy(q, new_s, epsilon=epsilon)
q[s, a] = q[s, a] + alpha * (reward + gamma * q[new_s, new_a] - q[s, a])
s = new_s
a = new_a
# only for plotting the performance, not part of the algorithm
if progress is not None and i%STEPS == 0:
progress[i//STEPS] = average_performance(get_action_epsilon_greedy(epsilon), q=q)
return q, progress
plot_performance(param_performance)
As you can see the larger $\epsilon$ the faster is the learning. But the larger the $\epsilon$ the lower is the average reward per episode.
A hack could be to use at the end the greedy policy. But keep in mind, that we have not estimated the q-values of the greedy policy. So this can be problematic in more elaborate environments.
plt.plot(STEPS*np.arange(nb_episodes//STEPS), sarsa_performance)
plt.xlabel("epochs")
plt.title("Learning progress for SARSA")
plt.ylabel("average reward of an epoch")
<matplotlib.text.Text at 0x10b015a90>
def greedy_policy(q, s):
return np.argmax(q[s])
print(average_performance(greedy_policy, q=q))
0.838
Read [SuBa] 6.5 Q-Learning: Off-Policy TD Control
Update rule:
$$ q(s_t, a_t) \leftarrow q(s_t, a_t) + \alpha \left(R_t + \gamma \cdot \text{arg}\max_{a'} \left(q\left(s'_{t+1}, a'\right)\right) - q(s_t, a_t) \right) $$Here:
# Q-Learning: Off-policy TD control algorithm
for i in range(nb_episodes):
done = False
s = env.reset()
while not done:
a = action_epsilon_greedy(q, s) # behaviour policy
new_s, reward, done, info = env.step(a)
a_max = np.argmax(q[new_s]) # estimation policy
q[s, a] = q[s, a] + alpha * (reward + gamma * q[new_s, a_max] - q[s, a])
s = new_s
# for plotting the performance
if i%STEPS == 0:
q_performance[i//STEPS] = average_performance(greedy_policy, q)
average_performance(greedy_policy, q)
0.828