Training Metrics#

RLinf reports metrics through the MetricLogger under a few namespaces — train/, rollout/, env/, and time/. This page defines them once; example pages link here instead of repeating the definitions.

Tip

For embodied tasks the single most useful signal is ``env/success_once`` — the unnormalized episodic success rate. Most other env/* values are hard to read directly under sparse rewards (see below).

Training metrics — train/#

Policy- and value-optimization statistics, logged every actor update.

Metric

Meaning

train/actor/approx_kl

Approximate KL divergence between the old and new policies.

train/actor/clip_fraction

Fraction of updates where the probability ratio was clipped.

train/actor/clipped_ratio

Mean of the clipped probability ratios.

train/actor/grad_norm

Gradient norm of the actor.

train/actor/lr

Current learning rate.

train/actor/policy_loss

PPO / GRPO policy loss.

train/critic/value_loss

Value-function loss.

train/critic/value_clip_ratio

Fraction of value targets whose update was clipped.

train/critic/explained_variance

Explained variance of the value predictions (closer to 1 is better).

train/entropy_loss

Policy entropy.

train/loss

Total training loss (actor + critic + entropy regularization).

Rollout metrics — rollout/#

Statistics of the advantages and rewards collected during rollout.

Metric

Meaning

rollout/advantages_max

Maximum advantage in the batch.

rollout/advantages_mean

Mean advantage in the batch.

rollout/advantages_min

Minimum advantage in the batch.

rollout/rewards

Reward of a rollout chunk.

Environment metrics — env/#

Task-level signals from the simulator.

Metric

Meaning

env/success_once

Recommended. Unnormalized episodic success rate — the truest measure of task performance.

env/episode_len

Number of environment steps elapsed in the episode.

env/return

Episode return. Under sparse rewards this is near-zero until the terminal success step, so it is not very informative during training.

env/reward

Step-level reward (0 on intermediate steps, 1 on success). The logged value is normalized by episode length, which makes it hard to read as real performance.

See also the Logger tutorial for choosing backends (TensorBoard, Weights & Biases, SwanLab) and configuring runner.logger.