semi_gradient_sarsa¶

Module semi_gradient_sarsa. Implements episodic semi-gradient SARSA for estimating the state-action value function. the im[plementation follows the algorithm at page 244 in the book by Sutton and Barto: Reinforcement Learning An Introduction second edition 2020

class semi_gradient_sarsa.SemiGradSARSAConfig(gamma: float = 1.0, alpha: float = 0.1, n_itrs_per_episode: int = 100, policy: Optional[Policy] = None)¶: Configuration class for semi-gradient SARSA algorithm

class semi_gradient_sarsa.SemiGradSARSA(config: SemiGradSARSAConfig)¶

SemiGradSARSA class. Implements the semi-gradient SARSA algorithm as described

__init__(config: SemiGradSARSAConfig) → None¶

_do_train(env: Env, episode_idx: int, **options) → EpisodeInfo¶

Train the algorithm on the episode

Parameters

env (The environment to train on) –
episode_idx (The index of the training episode) –
options (Any keyword based options passed by the client code) –

Return type

An instance of EpisodeInfo

_init() → None¶

Any initializations needed before starting the training

Return type: None

_validate() → None¶

Validate the state of the agent. Is called before any training begins to check that the starting state is sane

Return type: None

_weights_update(env: Env, state: State, action: Action, reward: float, next_state: State, next_action: Action, t: float = 1.0) → None¶

Update the weights due to the fact that the episode is finished

Parameters

env (The environment instance that the training takes place) –
state (The current state) –
action (The action we took at state) –
reward (The reward observed when taking the given action when at the given state) –
next_state (The observed new state) –
next_action (The action to be executed in next_state) –

Return type

None

_weights_update_episode_done(env: Env, state: State, action: Action, reward: float, t: float = 1.0) → None¶

Update the weights of the underlying Q-estimator

Parameters

state (The current state it is assumed to be a raw state) –
reward (The reward observed when taking the given action when at the given state) –
action (The action we took at the state) –

Return type

None

actions_after_episode_ends(env: Env, episode_idx: int, **options) → None¶

Any actions after the training episode ends

Parameters

env (The training environment) –
episode_idx (The training episode index) –
options (Any options passed by the client code) –

Return type

None

actions_before_episode_begins(env: Env, episode_idx: int, **options) → None¶

Any actions to perform before the episode begins

Parameters

env (The instance of the training environment) –
episode_idx (The training episode index) –
options (Any keyword options passed by the client code) –

Return type

None

actions_before_training(env: Env, **options) → None¶

Specify any actions necessary before training begins

Parameters

env (The environment to train on) –
options (Any key-value options passed by the client) –

Return type

None

on_episode(env: Env, episode_idx: int, **options) → EpisodeInfo¶

Train the algorithm on the episode

Parameters

env (The environment to train on) –
episode_idx (The index of the training episode) –
options (Any keyword based options passed by the client code) –

Return type

An instance of EpisodeInfo

play(env: Env, stop_criterion: Criterion) → None¶

Play the agent on the environment. This should produce a distorted dataset

Parameters

env (The environment to) –
stop_criterion (The criteria to use to stop) –

Return type

None