semi_gradient_sarsa

Module semi_gradient_sarsa. Implements episodic semi-gradient SARSA for estimating the state-action value function. the im[plementation follows the algorithm at page 244 in the book by Sutton and Barto: Reinforcement Learning An Introduction second edition 2020

class semi_gradient_sarsa.SemiGradSARSAConfig(gamma: float = 1.0, alpha: float = 0.1, n_itrs_per_episode: int = 100, policy: Optional[Policy] = None)

Configuration class for semi-gradient SARSA algorithm

class semi_gradient_sarsa.SemiGradSARSA(config: SemiGradSARSAConfig)

SemiGradSARSA class. Implements the semi-gradient SARSA algorithm as described

__init__(config: SemiGradSARSAConfig) None
_do_train(env: Env, episode_idx: int, **options) EpisodeInfo

Train the algorithm on the episode

Parameters
  • env (The environment to train on) –

  • episode_idx (The index of the training episode) –

  • options (Any keyword based options passed by the client code) –

Return type

An instance of EpisodeInfo

_init() None

Any initializations needed before starting the training

Return type

None

_validate() None

Validate the state of the agent. Is called before any training begins to check that the starting state is sane

Return type

None

_weights_update(env: Env, state: State, action: Action, reward: float, next_state: State, next_action: Action, t: float = 1.0) None

Update the weights due to the fact that the episode is finished

Parameters
  • env (The environment instance that the training takes place) –

  • state (The current state) –

  • action (The action we took at state) –

  • reward (The reward observed when taking the given action when at the given state) –

  • next_state (The observed new state) –

  • next_action (The action to be executed in next_state) –

Return type

None

_weights_update_episode_done(env: Env, state: State, action: Action, reward: float, t: float = 1.0) None

Update the weights of the underlying Q-estimator

Parameters
  • state (The current state it is assumed to be a raw state) –

  • reward (The reward observed when taking the given action when at the given state) –

  • action (The action we took at the state) –

Return type

None

actions_after_episode_ends(env: Env, episode_idx: int, **options) None

Any actions after the training episode ends

Parameters
  • env (The training environment) –

  • episode_idx (The training episode index) –

  • options (Any options passed by the client code) –

Return type

None

actions_before_episode_begins(env: Env, episode_idx: int, **options) None

Any actions to perform before the episode begins

Parameters
  • env (The instance of the training environment) –

  • episode_idx (The training episode index) –

  • options (Any keyword options passed by the client code) –

Return type

None

actions_before_training(env: Env, **options) None

Specify any actions necessary before training begins

Parameters
  • env (The environment to train on) –

  • options (Any key-value options passed by the client) –

Return type

None

on_episode(env: Env, episode_idx: int, **options) EpisodeInfo

Train the algorithm on the episode

Parameters
  • env (The environment to train on) –

  • episode_idx (The index of the training episode) –

  • options (Any keyword based options passed by the client code) –

Return type

An instance of EpisodeInfo

play(env: Env, stop_criterion: Criterion) None

Play the agent on the environment. This should produce a distorted dataset

Parameters
  • env (The environment to) –

  • stop_criterion (The criteria to use to stop) –

Return type

None