Exploration Policies

Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to a predefined policy. This is one of the most important aspects of reinforcement learning agents, and can require some tuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended with custom policies. Note that not all exploration policies are expected to work for both discrete and continuous action spaces.

Exploration Policy

Discrete Action Space

Box Action Space

AdditiveNoise

X

V

Boltzmann

V

X

Bootstrapped

V

X

Categorical

V

X

ContinuousEntropy

X

V

EGreedy

V

V

Greedy

V

V

OUProcess

X

V

ParameterNoise

V

V

TruncatedNormal

X

V

UCB

V

X

ExplorationPolicy

class rl_coach.exploration_policies.exploration_policy.ExplorationPolicy(action_space: rl_coach.spaces.ActionSpace)[source]

An exploration policy takes the predicted actions or action values from the agent, and selects the action to actually apply to the environment using some predefined algorithm.

Parameters

action_space – the action space used by the environment

change_phase(phase)[source]

Change between running phases of the algorithm :param phase: Either Heatup or Train :return: none

get_action(action_values: List[Union[int, float, numpy.ndarray, List]]) → Union[int, float, numpy.ndarray, List][source]

Given a list of values corresponding to each action, choose one actions according to the exploration policy :param action_values: A list of action values :return: The chosen action,

The probability of the action (if available, otherwise 1 for absolute certainty in the action)

requires_action_values() → bool[source]

Allows exploration policies to define if they require the action values for the current step. This can save up a lot of computation. For example in e-greedy, if the random value generated is smaller than epsilon, the action is completely random, and the action values don’t need to be calculated :return: True if the action values are required. False otherwise

reset()[source]

Used for resetting the exploration policy parameters when needed :return: None

AdditiveNoise

class rl_coach.exploration_policies.additive_noise.AdditiveNoise(action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, noise_as_percentage_from_action_space: bool = True)[source]

AdditiveNoise is an exploration policy intended for continuous action spaces. It takes the action from the agent and adds a Gaussian distributed noise to it. The amount of noise added to the action follows the noise amount that can be given in two different ways: 1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size 2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed to be the mean of the action, and 2nd is assumed to be its standard deviation.

Parameters
  • action_space – the action space used by the environment

  • noise_schedule – the schedule for the noise

  • evaluation_noise – the noise variance that will be used during evaluation phases

  • noise_as_percentage_from_action_space – a bool deciding whether the noise is absolute or as a percentage from the action space

Boltzmann

class rl_coach.exploration_policies.boltzmann.Boltzmann(action_space: rl_coach.spaces.ActionSpace, temperature_schedule: rl_coach.schedules.Schedule)[source]

The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possible actions has some value assigned to it (such as the Q value), and uses a softmax function to convert these values into a distribution over the actions. It then samples the action for playing out of the calculated distribution. An additional temperature schedule can be given by the user, and will control the steepness of the softmax function.

Parameters
  • action_space – the action space used by the environment

  • temperature_schedule – the schedule for the temperature parameter of the softmax

Bootstrapped

class rl_coach.exploration_policies.bootstrapped.Bootstrapped(action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, architecture_num_q_heads: int, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object>)[source]

Bootstrapped exploration policy is currently only used for discrete action spaces along with the Bootstrapped DQN agent. It assumes that there is an ensemble of network heads, where each one predicts the values for all the possible actions. For each episode, a single head is selected to lead the agent, according to its value predictions. In evaluation, the action is selected using a majority vote over all the heads predictions.

Note

This exploration policy will only work for Discrete action spaces with Bootstrapped DQN style agents, since it requires the agent to have a network with multiple heads.

Parameters
  • action_space – the action space used by the environment

  • epsilon_schedule – a schedule for the epsilon values

  • evaluation_epsilon – the epsilon value to use for evaluation phases

  • continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to use if the e-greedy is used for a continuous policy

  • architecture_num_q_heads – the number of q heads to select from

Categorical

class rl_coach.exploration_policies.categorical.Categorical(action_space: rl_coach.spaces.ActionSpace)[source]

Categorical exploration policy is intended for discrete action spaces. It expects the action values to represent a probability distribution over the action, from which a single action will be sampled. In evaluation, the action that has the highest probability will be selected. This is particularly useful for actor-critic schemes, where the actors output is a probability distribution over the actions.

Parameters

action_space – the action space used by the environment

ContinuousEntropy

class rl_coach.exploration_policies.continuous_entropy.ContinuousEntropy(action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, noise_as_percentage_from_action_space: bool = True)[source]

Continuous entropy is an exploration policy that is actually implemented as part of the network. The exploration policy class is only a placeholder for choosing this policy. The exploration policy is implemented by adding a regularization factor to the network loss, which regularizes the entropy of the action. This exploration policy is only intended for continuous action spaces, and assumes that the entire calculation is implemented as part of the head.

Warning

This exploration policy expects the agent or the network to implement the exploration functionality. There are only a few heads that actually are relevant and implement the entropy regularization factor.

Parameters
  • action_space – the action space used by the environment

  • noise_schedule – the schedule for the noise

  • evaluation_noise – the noise variance that will be used during evaluation phases

  • noise_as_percentage_from_action_space – a bool deciding whether the noise is absolute or as a percentage from the action space

EGreedy

class rl_coach.exploration_policies.e_greedy.EGreedy(action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object>)[source]

e-greedy is an exploration policy that is intended for both discrete and continuous action spaces.

For discrete action spaces, it assumes that each action is assigned a value, and it selects the action with the highest value with probability 1 - epsilon. Otherwise, it selects a action sampled uniformly out of all the possible actions. The epsilon value is given by the user and can be given as a schedule. In evaluation, a different epsilon value can be specified.

For continuous action spaces, it assumes that the mean action is given by the agent. With probability epsilon, it samples a random action out of the action space bounds. Otherwise, it selects the action according to a given continuous exploration policy, which is set to AdditiveNoise by default. In evaluation, the action is always selected according to the given continuous exploration policy (where its phase is set to evaluation as well).

Parameters
  • action_space – the action space used by the environment

  • epsilon_schedule – a schedule for the epsilon values

  • evaluation_epsilon – the epsilon value to use for evaluation phases

  • continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to use if the e-greedy is used for a continuous policy

Greedy

class rl_coach.exploration_policies.greedy.Greedy(action_space: rl_coach.spaces.ActionSpace)[source]

The Greedy exploration policy is intended for both discrete and continuous action spaces. For discrete action spaces, it always selects the action with the maximum value, as given by the agent. For continuous action spaces, it always return the exact action, as it was given by the agent.

Parameters

action_space – the action space used by the environment

OUProcess

class rl_coach.exploration_policies.ou_process.OUProcess(action_space: rl_coach.spaces.ActionSpace, mu: float = 0, theta: float = 0.15, sigma: float = 0.2, dt: float = 0.01)[source]

OUProcess exploration policy is intended for continuous action spaces, and selects the action according to an Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process implements the action as a Gaussian process, where the samples are correlated between consequent time steps.

Parameters

action_space – the action space used by the environment

ParameterNoise

class rl_coach.exploration_policies.parameter_noise.ParameterNoise(network_params: Dict[str, rl_coach.base_parameters.NetworkParameters], action_space: rl_coach.spaces.ActionSpace)[source]

The ParameterNoise exploration policy is intended for both discrete and continuous action spaces. It applies the exploration policy by replacing all the dense network layers with noisy layers. The noisy layers have both weight means and weight standard deviations, and for each forward pass of the network the weights are sampled from a normal distribution that follows the learned weights mean and standard deviation values.

Warning: currently supported only by DQN variants

Parameters

action_space – the action space used by the environment

TruncatedNormal

class rl_coach.exploration_policies.truncated_normal.TruncatedNormal(action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, clip_low: float, clip_high: float, noise_as_percentage_from_action_space: bool = True)[source]

The TruncatedNormal exploration policy is intended for continuous action spaces. It samples the action from a normal distribution, where the mean action is given by the agent, and the standard deviation can be given in t wo different ways: 1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size 2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed to be the mean of the action, and 2nd is assumed to be its standard deviation. When the sampled action is outside of the action bounds given by the user, it is sampled again and again, until it is within the bounds.

Parameters
  • action_space – the action space used by the environment

  • noise_schedule – the schedule for the noise variance

  • evaluation_noise – the noise variance that will be used during evaluation phases

  • noise_as_percentage_from_action_space – whether to consider the noise as a percentage of the action space or absolute value

UCB

class rl_coach.exploration_policies.ucb.UCB(action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, architecture_num_q_heads: int, lamb: int, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = <rl_coach.exploration_policies.additive_noise.AdditiveNoiseParameters object>)[source]

UCB exploration policy is following the upper confidence bound heuristic to sample actions in discrete action spaces. It assumes that there are multiple network heads that are predicting action values, and that the standard deviation between the heads predictions represents the uncertainty of the agent in each of the actions. It then updates the action value estimates to by mean(actions)+lambda*stdev(actions), where lambda is given by the user. This exploration policy aims to take advantage of the uncertainty of the agent in its predictions, and select the action according to the tradeoff between how uncertain the agent is, and how large it predicts the outcome from those actions to be.

Parameters
  • action_space – the action space used by the environment

  • epsilon_schedule – a schedule for the epsilon values

  • evaluation_epsilon – the epsilon value to use for evaluation phases

  • architecture_num_q_heads – the number of q heads to select from

  • lamb – lambda coefficient for taking the standard deviation into account

  • continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to use if the e-greedy is used for a continuous policy