Clipped Proximal Policy Optimization¶
Actions space: Discrete | Continuous
References: Proximal Policy Optimization Algorithms
Network Structure¶
Algorithm Description¶
Choosing an action - Continuous action¶
Same as in PPO.
Training the network¶
Very similar to PPO, with several small (but very simplifying) changes:
Train both the value and policy networks, simultaneously, by defining a single loss function, which is the sum of each of the networks loss functions. Then, back propagate gradients only once from this unified loss function.
The unified network’s optimizer is set to Adam (instead of L-BFGS for the value network as in PPO).
Value targets are now also calculated based on the GAE advantages. In this method, the \(V\) values are predicted from the critic network, and then added to the GAE based advantages, in order to get a \(Q\) value for each action. Now, since our critic network is predicting a \(V\) value for each state, setting the \(Q\) calculated action-values as a target, will on average serve as a \(V\) state-value target.
Instead of adapting the penalizing KL divergence coefficient used in PPO, the likelihood ratio \(r_t(\theta) =\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}\) is clipped, to achieve a similar effect. This is done by defining the policy’s loss function to be the minimum between the standard surrogate loss and an epsilon clipped surrogate loss:
\(L^{CLIP}(\theta)=E_{t}[min(r_t(\theta)\cdot \hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t)]\)
-
class
rl_coach.agents.clipped_ppo_agent.
ClippedPPOAlgorithmParameters
[source]¶ - Parameters
policy_gradient_rescaler – (PolicyGradientRescaler) This represents how the critic will be used to update the actor. The critic value function is typically used to rescale the gradients calculated by the actor. There are several ways for doing this, such as using the advantage of the action, or the generalized advantage estimation (GAE) value.
gae_lambda – (float) The \(\lambda\) value is used within the GAE function in order to weight different bootstrap length estimations. Typical values are in the range 0.9-1, and define an exponential decay over the different n-step estimations.
clip_likelihood_ratio_using_epsilon – (float) If not None, the likelihood ratio between the current and new policy in the PPO loss function will be clipped to the range [1-clip_likelihood_ratio_using_epsilon, 1+clip_likelihood_ratio_using_epsilon]. This is typically used in the Clipped PPO version of PPO, and should be set to None in regular PPO implementations.
value_targets_mix_fraction – (float) The targets for the value network are an exponential weighted moving average which uses this mix fraction to define how much of the new targets will be taken into account when calculating the loss. This value should be set to the range (0,1], where 1 means that only the new targets will be taken into account.
estimate_state_value_using_gae – (bool) If set to True, the state value will be estimated using the GAE technique.
use_kl_regularization – (bool) If set to True, the loss function will be regularized using the KL diveregence between the current and new policy, to bound the change of the policy during the network update.
beta_entropy – (float) An entropy regulaization term can be added to the loss function in order to control exploration. This term is weighted using the \(eta\) value defined by beta_entropy.
optimization_epochs – (int) For each training phase, the collected dataset will be used for multiple epochs, which are defined by the optimization_epochs value.
optimization_epochs – (Schedule) Can be used to define a schedule over the clipping of the likelihood ratio.