Proximal Policy Optimization

Actions space: Discrete | Continuous

References: Proximal Policy Optimization Algorithms

Network Structure

../../../_images/ppo.png

Algorithm Description

Choosing an action - Continuous actions

Run the observation through the policy network, and get the mean and standard deviation vectors for this observation. While in training phase, sample from a multi-dimensional Gaussian distribution with these mean and standard deviation values. When testing, just take the mean values predicted by the network.

Training the network

  1. Collect a big chunk of experience (in the order of thousands of transitions, sampled from multiple episodes).

  2. Calculate the advantages for each transition, using the Generalized Advantage Estimation method (Schulman ‘2015).

  3. Run a single training iteration of the value network using an L-BFGS optimizer. Unlike first order optimizers, the L-BFGS optimizer runs on the entire dataset at once, without batching. It continues running until some low loss threshold is reached. To prevent overfitting to the current dataset, the value targets are updated in a soft manner, using an Exponentially Weighted Moving Average, based on the total discounted returns of each state in each episode.

  4. Run several training iterations of the policy network. This is done by using the previously calculated advantages as targets. The loss function penalizes policies that deviate too far from the old policy (the policy that was used before starting to run the current set of training iterations) using a regularization term.

  5. After training is done, the last sampled KL divergence value will be compared with the target KL divergence value, in order to adapt the penalty coefficient used in the policy loss. If the KL divergence went too high, increase the penalty, if it went too low, reduce it. Otherwise, leave it unchanged.

class rl_coach.agents.ppo_agent.PPOAlgorithmParameters[source]
Parameters
  • policy_gradient_rescaler – (PolicyGradientRescaler) This represents how the critic will be used to update the actor. The critic value function is typically used to rescale the gradients calculated by the actor. There are several ways for doing this, such as using the advantage of the action, or the generalized advantage estimation (GAE) value.

  • gae_lambda – (float) The \(\lambda\) value is used within the GAE function in order to weight different bootstrap length estimations. Typical values are in the range 0.9-1, and define an exponential decay over the different n-step estimations.

  • target_kl_divergence – (float) The target kl divergence between the current policy distribution and the new policy. PPO uses a heuristic to bring the KL divergence to this value, by adding a penalty if the kl divergence is higher.

  • initial_kl_coefficient – (float) The initial weight that will be given to the KL divergence between the current and the new policy in the regularization factor.

  • high_kl_penalty_coefficient – (float) The penalty that will be given for KL divergence values which are highes than what was defined as the target.

  • clip_likelihood_ratio_using_epsilon – (float) If not None, the likelihood ratio between the current and new policy in the PPO loss function will be clipped to the range [1-clip_likelihood_ratio_using_epsilon, 1+clip_likelihood_ratio_using_epsilon]. This is typically used in the Clipped PPO version of PPO, and should be set to None in regular PPO implementations.

  • value_targets_mix_fraction – (float) The targets for the value network are an exponential weighted moving average which uses this mix fraction to define how much of the new targets will be taken into account when calculating the loss. This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.

  • estimate_state_value_using_gae – (bool) If set to True, the state value will be estimated using the GAE technique.

  • use_kl_regularization – (bool) If set to True, the loss function will be regularized using the KL diveregence between the current and new policy, to bound the change of the policy during the network update.

  • beta_entropy – (float) An entropy regulaization term can be added to the loss function in order to control exploration. This term is weighted using the \(eta\) value defined by beta_entropy.