Draft status
Monte-Carlo policy evaluation uses empirical mean return instead of expected return.
import numpy as np
Table of environments https://github.com/openai/gym/wiki/Table-of-environments
import gym
env = gym.make('Blackjack-v0')
env.action_space
[2017-05-22 15:01:13,988] Making new env: Blackjack-v0
Discrete(2)
There are two actions:
#random action
env.action_space.sample()
1
help(env)
Help on BlackjackEnv in module gym.envs.toy_text.blackjack object: class BlackjackEnv(gym.core.Env) | Simple blackjack environment | | Blackjack is a card game where the goal is to obtain cards that sum to as | near as possible to 21 without going over. They're playing against a fixed | dealer. | Face cards (Jack, Queen, King) have point value 10. | Aces can either count as 11 or 1, and it's called 'usable' at 11. | This game is placed with an infinite deck (or with replacement). | The game starts with each (player and dealer) having one face up and one | face down card. | | The player can request additional cards (hit=1) until they decide to stop | (stick=0) or exceed 21 (bust). | | After the player sticks, the dealer reveals their facedown card, and draws | until their sum is 17 or greater. If the dealer goes bust the player wins. | | If neither player nor dealer busts, the outcome (win, lose, draw) is | decided by whose sum is closer to 21. The reward for winning is +1, | drawing is 0, and losing is -1. | | The observation of a 3-tuple of: the players current sum, | the dealer's one showing card (1-10 where 1 is ace), | and whether or not the player holds a usable ace (0 or 1). | | This environment corresponds to the version of the blackjack problem | described in Example 5.1 in Reinforcement Learning: An Introduction | by Sutton and Barto (1998). | https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html | | Method resolution order: | BlackjackEnv | gym.core.Env | builtins.object | | Methods defined here: | | __init__(self, natural=False) | Initialize self. See help(type(self)) for accurate signature. | | ---------------------------------------------------------------------- | Methods inherited from gym.core.Env: | | __del__(self) | | __str__(self) | Return str(self). | | close(self) | Override _close in your subclass to perform any necessary cleanup. | | Environments will automatically close() themselves when | garbage collected or when the program exits. | | configure(self, *args, **kwargs) | | render(self, mode='human', close=False) | Renders the environment. | | The set of supported modes varies per environment. (And some | environments do not support rendering at all.) By convention, | if mode is: | | - human: render to the current display or terminal and | return nothing. Usually for human consumption. | - rgb_array: Return an numpy.ndarray with shape (x, y, 3), | representing RGB values for an x-by-y pixel image, suitable | for turning into a video. | - ansi: Return a string (str) or StringIO.StringIO containing a | terminal-style text representation. The text can include newlines | and ANSI escape sequences (e.g. for colors). | | Note: | Make sure that your class's metadata 'render.modes' key includes | the list of supported modes. It's recommended to call super() | in implementations to use the functionality of this method. | | Args: | mode (str): the mode to render with | close (bool): close all open renderings | | Example: | | class MyEnv(Env): | metadata = {'render.modes': ['human', 'rgb_array']} | | def render(self, mode='human'): | if mode == 'rgb_array': | return np.array(...) # return RGB frame suitable for video | elif mode is 'human': | ... # pop up a window and render | else: | super(MyEnv, self).render(mode=mode) # just raise an exception | | reset(self) | Resets the state of the environment and returns an initial observation. | | Returns: observation (object): the initial observation of the | space. | | seed(self, seed=None) | Sets the seed for this env's random number generator(s). | | Note: | Some environments use multiple pseudorandom number generators. | We want to capture all such seeds used in order to ensure that | there aren't accidental correlations between multiple generators. | | Returns: | list<bigint>: Returns the list of seeds used in this env's random | number generators. The first value in the list should be the | "main" seed, or the value which a reproducer should pass to | 'seed'. Often, the main seed equals the provided 'seed', but | this won't be true if seed=None, for example. | | step(self, action) | Run one timestep of the environment's dynamics. When end of | episode is reached, you are responsible for calling `reset()` | to reset this environment's state. | | Accepts an action and returns a tuple (observation, reward, done, info). | | Args: | action (object): an action provided by the environment | | Returns: | observation (object): agent's observation of the current environment | reward (float) : amount of reward returned after previous action | done (boolean): whether the episode has ended, in which case further step() calls will return undefined results | info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning) | | ---------------------------------------------------------------------- | Static methods inherited from gym.core.Env: | | __new__(cls, *args, **kwargs) | Create and return a new object. See help(type) for accurate signature. | | ---------------------------------------------------------------------- | Data descriptors inherited from gym.core.Env: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | monitor | | unwrapped | Completely unwrap this env. | | Returns: | gym.Env: The base non-wrapped gym.Env instance | | ---------------------------------------------------------------------- | Data and other attributes inherited from gym.core.Env: | | action_space = None | | metadata = {'render.modes': []} | | observation_space = None | | reward_range = (-inf, inf)
observation_space = env.observation_space.spaces
observation_space
(Discrete(32), Discrete(11), Discrete(2))
Observation tuple for Blackjack:
# Resets the state of the environment and returns an initial observation.
observation = env.reset()
observation
(14, 6, False)
# env.step:
# Run one timestep of the environment's dynamics. When end of
# episode is reached, you are responsible for calling `reset()`
# to reset this environment's state.
action = 0
observation, reward, done, info = env.step(action)
observation, reward, done, info
((14, 6, False), 1.0, True, {})
For each observation we need an action:
# There are invalid states, but we don't care for convenience.
state_space_size = (33, 12, 2)
# simple policy:
policy = np.zeros(state_space_size, dtype=int)
#policy[:9,:,:]=1
def observation_clean(observation):
return (observation[0], observation[1], int(observation[2]))
observation = observation_clean(observation)
policy[observation]
0
policy[observation]
0
np.random.rand()
0.6780772481806671
Read: [SuBa] 5.1 Monte Carlo Policy Evaluation
Note: That the states are different in the OpenAI gym (here) and in [SuBa]:
def run_episode(policy, env=env):
steps = []
observation = observation_clean(env.reset())
done = False
steps.append(((None, None)+ (observation, 0)))
while not done:
action = policy[observation]
observation_action = (observation, action)
observation, reward, done, info = env.step(action)
observation = observation_clean(observation)
steps.append(observation_action + (observation, int(reward)))
return steps # list of tupels: (s, a, s', R)
run_episode(policy)
[(None, None, (13, 10, 1), 0), ((13, 10, 1), 0, (13, 10, 1), 1)]
gamma = 0.99
N = np.zeros(state_space_size, dtype=int)
S = np.zeros(state_space_size)
# every visit monte carlo:
nb_of_episodes = 100000
for e in range(nb_of_episodes):
observations_reward = run_episode(policy)
G = 0.
#print (observations_reward )
for o0, a, o, r in reversed(observations_reward):
G = r + gamma * G
N[o] += 1
S[o] += G
#print (o,r,G)
Gs = np.zeros(state_space_size)
Gs[N!=0] = S[N!=0]/N[N!=0]
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
%matplotlib inline
Gs[10, 3, 1]
0.0
A = np.arange(1, 11)
B = np.arange(4, 22)
A, B = np.meshgrid(A, B)
V = Gs[4:22, 1:11, 0]
fig = plt.figure(figsize=(10,8))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(A, B, V, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_ylabel("Player sum")
ax.set_xlabel("Dealer showing")
ax.set_title("No useable Ace")
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
A = np.arange(1, 11)
B = np.arange(12, 22)
A, B = np.meshgrid(A, B)
V = Gs[12:22, 1:11, 1]
fig = plt.figure(figsize=(10,8))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(A, B, V, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_ylabel("Player sum")
ax.set_xlabel("Dealer showing")
ax.set_title("Useable Ace")
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
Read: [SuBa]: 5.3 Monte Carlo Control
Two assumptions for finding optimal policy:
For Black Jack the "exploring starts"-condition can be satisfied, if we choose the first action randomly. So let's modify the
run_episode
-function:
def run_episode_exploring_start(policy, env=env):
steps = []
observation = observation_clean(env.reset())
done = False
steps.append(((None, None)+ (observation, 0)))
start = True
while not done:
if start:
action = np.random.binomial(n=1, p=0.5)
start = False
else:
action = policy[observation]
observation_action = (observation, action)
observation, reward, done, info = env.step(action)
observation = observation_clean(observation)
steps.append(observation_action + (observation, int(reward)))
return steps # list of tupels: (s, a, s', R)
size = list(state_space_size) + [2]
Q = np.zeros(size)
Q.shape
(33, 12, 2, 2)
from collections import defaultdict
returns = defaultdict(list)
nb_of_episodes = 500000
def monte_carlo_optimal_policy(nb_of_episodes,
policy = np.random.binomial(n=1, p=0.5, size=state_space_size),
run_episode = run_episode_exploring_start):
for i in range(nb_of_episodes):
# (a) generate an episode using exploring starts and pi
observations_reward = run_episode(policy)
G = 0. # current return
# use a map for the first visit condition:
o_a = {} # map from states to (action, return)-Tuple
for o0, a, o, r in reversed(observations_reward):
G = r + gamma * G
o_a[o0] = a, G
#(b) for each pair (s,a) appearing in the episode
for o, (a, G) in o_a.items():
if o is not None:
returns[(o, a)].append(G)
re_mean = np.array(returns[(o, a)]).mean()
Q[(o) + (a,)] = re_mean
# for each s in the episode: optimize policy
policy[o] = np.argmax(Q[(o)])
return policy
policy = monte_carlo_optimal_policy(nb_of_episodes)
Q[5, 8, 0, 1]
-0.15549859009756098
policy[5,5,0]
1
import scipy.ndimage
def plot_policy_useable_ace(policy):
B = np.arange(4-0.5, 22-0.5, 0.2)
A = np.arange(1-0.5, 11-0.5, 0.2)
A, B = np.meshgrid(A, B)
Po = scipy.ndimage.zoom(policy[4:22, 1:11, 0], 5)
levels = range(-1,2)
plt.figure(figsize=(7,6))
CS = plt.contourf(A, B, Po, levels)
cbar = plt.colorbar(CS)
cbar.ax.set_ylabel('actions')
#plt.clabel(CS, inline=1, fontsize=10)
plt.title('Policy for no useable ace')
plt.xlabel("dealers showing")
plt.ylabel("Players sum")
_ = plt.xticks(range(1,11))
_ = plt.yticks(range(4,22))
# TODO improve plots
plot_policy_useable_ace(policy)
def plot_policy_no_useable_ace(policy):
B = np.arange(12-0.5, 22-0.5, 0.2)
A = np.arange(1-0.5, 11-0.5, 0.2)
A, B = np.meshgrid(A, B)
Po = scipy.ndimage.zoom(policy[12:22, 1:11, 1], 5, mode='nearest')
levels = range(-1,2)
plt.figure(figsize=(7,6))
CS = plt.contourf(A, B, Po, levels)
cbar = plt.colorbar(CS)
cbar.ax.set_ylabel('actions')
#plt.clabel(CS, inline=1, fontsize=10)
plt.title('Policy for useable ace')
plt.xlabel("dealers showing")
plt.ylabel("Players sum")
_ = plt.xticks(range(1,11))
_ = plt.yticks(range(12,22))
plot_policy_no_useable_ace(policy)
If the exploring starts condition can not be satisfied, because not every state $s \in \mathcal S$ can be a start state, we need to have an other strategy for exploration.
Read [SuBa] 5.4 On-Policy Monte Carlo Control
A simple method is $\epsilon$-greedy, i.e. with probability $\epsilon$ we choose a random action. So the policy we are following is non-deterministic: $ \pi(a \mid s) $.
In the implementation the policy of the array policy
is chosen with probability $1-\epsilon/2$ and the alternative action with probability
$\epsilon/2$.
Note that for each state $s \in \mathcal S$: $ |\mathcal A(s)| = 2$.
Note that the final policy is not the deterministic policy stored in policy
. It's a soft $\epsilon$-greedy policy:
The actions should be taken from policy
with probabiliy $1-\epsilon/2$ and the alternative action with probability $\epsilon /2$.
def run_episode_epsilon_greedy(policy, epsilon=0.05, env=env):
steps = []
observation = observation_clean(env.reset())
done = False
steps.append(((None, None)+ (observation, 0)))
while not done:
if np.random.rand() > epsilon:
action = policy[observation]
else:
action = np.random.binomial(n=1, p=0.5)
observation_action = (observation, action)
observation, reward, done, info = env.step(action)
observation = observation_clean(observation)
steps.append(observation_action + (observation, int(reward)))
return steps # list of tupels: (s, a, s', R)
policy = monte_carlo_optimal_policy(nb_of_episodes, run_episode=run_episode_epsilon_greedy)
# action taken with probabilty: 1 - epsilon/2
plot_policy_useable_ace(policy)
#action taken with probabilty: 1 - epsilon/2
plot_policy_no_useable_ace(policy)
Goal:
require:
Idea: "Importance sampling of the whole episodes", see [SuBa] 5.5 Evaluating One Policy While Following Another
So the $\epsilon$-greedy (estimation policy) policy can be evaluated and optimized while following the soft $\epsilon$-greedy policy, see [SuBa] 5.6 Off-Policy Monte Carlo Control.