REINFORCE++#
1. Introduction#
REINFORCE++ is a lightweight method for RL post-training. It starts from the classical REINFORCE algorithm and borrows two key ideas from PPO — per-token KL penalties and advantage normalization — while avoiding the extra policy-clipping loss that can bias gradient estimates.
Core design choices conclude:
Single-response training (
group_size = 1): one sampled answer per prompt.Per-token KL penalty (default kâ‚‚): KL is subtracted from the scalar reward instead of being added as an extra loss term.
Global advantage normalization across the whole batch.
REINFORCE++ baseline: when
group_size > 1, mean reward within each prompt group is used as a baseline before global normalization.
2. Objective Function#
Let \(q\) be the prompt, \(o_{1:T}\) the generated tokens, and \(\pi_{\theta}^{\text{RL}}\) the current policy. The per-token advantage at time step \(t\) is
where
\(\pi^{\text{SFT}}\) is the frozen supervised-finetuned reference policy and \(\beta\) controls the strength of the KL penalty.
To stabilise training, the advantages are normalised across the global batch:
The policy is then updated with the standard REINFORCE gradient \(\nabla_{\theta}\,\log\pi_{\theta}(o_t\!\mid\!q,o_{<t})\,A^{\text{norm}}_{q,o_t}\) .
3. Configuration#
REINFORCE++
algorithm:
adv_type: "reinpp" # use REINFORCE++
reinpp_kl_beta: 0.001 # KL penalty coefficient in REINFORCE++, distinct from the one used in loss computation
use_reinpp_baseline: False # no baseline
group_size: 1 # one response per prompt
kl_beta: 0.0001
data:
rollout_batch_size: 8192
REINFORCE++ baseline
algorithm:
adv_type: "reinpp"
group_size: 16 # multiple responses per prompt
kl_beta: 0.0001
data:
rollout_batch_size: 512
4. Notes#
REINFORCE++ adopts the so-called \(k_1\) KL. The GRPO algorithm uses a \(k_3\) form that mixes on-policy and reference probabilities, but that estimator is biased. Using \(k_1\) KL keeps the update unbiased while still discouraging large policy shifts.