STEAM: Ensemble Advantage Modeling for Offline Policy Optimization#
Run the STEAM pipeline in RLinf. STEAM is an offline policy-optimization recipe that scores existing data with a pair-classification progress critic and a deep ensemble, turning the conservative worst-of-N ensemble estimate into per-frame advantage labels. Those labels then drive the same Classifier-Free Guidance (CFG) training used by RECAP.
Like RECAP, STEAM needs no online environment interaction, so it suits real-robot settings where large-scale online sampling is impractical. The difference is the value signal: instead of regressing discounted returns, STEAM learns a temporal-progress critic from frame pairs and aggregates an ensemble of critics to suppress the advantage over-estimation a single predictor would assign to out-of-distribution rollouts.
Overview#
Improve a policy offline (no new rollouts) by scoring existing data with an ensemble progress critic and steering with classifier-free guidance.
STEAM (worst-of-N ensemble)
SigLIP + Gemma3 critic Β· Οβ.β
LeRobot datasets
Offline Β· 3 stages
Pipeline#
A STEAM run is two STEAM-specific stages followed by a CFG training stage:
ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ ββββββββββββββββββββββββ
β Step 1 β β Step 2 β β Step 3 β
β STEAM Value Model SFT ββββββΆβ Compute Ensemble ββββββΆβ CFG Training β
β β β Advantages β β β
β Train an ensemble of β β Worst-of-N ensemble β β Train the policy β
β pair-classification β β signed score -> bool β β with classifier- β
β progress critics β β advantage labels β β free guidance β
ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ ββββββββββββββββββββββββ
Core Idea
Value Model SFT: Train an ensemble of progress critics (SigLIP + Gemma3 backbone + classifier head). Each member sees a frame pair \((o_t, o_{t+k})\) and classifies the signed frame stride into bins, so the head predicts temporal progress rather than a regressed return.
Compute Ensemble Advantages: For every frame, run all ensemble members on the pair \((o_t, o_{t+k})\) and aggregate with the worst-of-N rule (\(A = \min_m A_m\)), yielding a signed score
advantage_continuous\(\in [-1, 1]\). Frames are then labelled positive/negative under a threshold or quantile rule.CFG Training: Hand the advantage labels to the CFG stage β positive (high-advantage) samples are conditional inputs and negative samples are unconditional inputs, enabling classifier-free guidance for policy optimization.
How STEAM Works#
STEAM Core Components
Advantage modeling
STEAM (Self-supervised Temporal Ensemble Advantage Modeling) learns advantages from the temporal order of expert demonstrations alone β no rewards, human labels, or external value model. For a frame pair \((f_i, f_j)\) from an expert episode, the temporal offset is the signed frame stride \(j - i\): pairing a frame with a future frame supervises forward progress, while feeding the pair in reverse gives a negative offset, exposing regressive motion from successful demos alone. Offsets are normalized by trajectory length (\(\propto L_{\max}/L_\tau\)) so the target measures temporal efficiency rather than raw step count β shorter, more efficient executions score higher, slower or suboptimal ones lower.
Each predictor (a SigLIP vision encoder + Gemma3 language model + a task-specific head) maps the frame pair and language instruction to a categorical distribution over \(N\) (
num_bins) temporal-offset bins, trained with a cross-entropy loss against the binned offset target. The per-member advantage subtracts a fixed baseline offset from the predicted expected bin, so it scores progress relative to the expected pace:\[A_m = \frac{2}{N}\left( E_{b \sim p_{\theta_m}}[b] - b_{\mathrm{ref}} \right) \in [-1, 1]\]where \(E_{b}[b]\) is the expected bin index of predictor \(m\)βs distribution and \(b_{\mathrm{ref}}\) is the deterministic reference β the length-normalized ground-truth offset for a fixed lookahead \(H\) on the longest episode. \(A_m\) is high near efficient progress and low (or negative) near stalls and regressions. (
num_bins == 2reduces to a binary progress classifier.)Advantage estimation
A single predictor can over-estimate on out-of-distribution rollout states. Members agree in-distribution but diverge in unfamiliar states, so STEAM aggregates the \(M\) predictors with the conservative worst-of-N rule β penalizing high variance to suppress false positives:
\[A_{\text{STEAM}} = \min_{m \in \{1, \dots, M\}} A_m\]\(A_{\text{STEAM}}\) is written to
advantage_continuous; per-member mean / min / variance are recorded for diagnostics. Because different data sources have different advantage distributions,advantage_continuousis turned into the booleanadvantageper source under one of twolabel_moderules:threshold:advantage = advantage_continuous > positive_thresholdfor rollout frames (a signed-score threshold in \([-1, 1]\)); sft frames are always True (success demos by construction).quantile: label the toprollout_quantilefraction of rollout frames True and, whenexpert_quantileis set, the topexpert_quantilefraction of sft frames True β the two pools are scored independently.
Classifier-Free Guidance (CFG) Training
STEAM advantage labels drive the CFG stage on the OpenPI (Οβ.β ) policy: positive (high-advantage) samples serve as conditional inputs and negative samples as unconditional inputs, enabling classifier-free guidance for policy optimization. See the CFG training stage for the full CFG mechanism (
positive_only_conditional,unconditional_prob,cfgrl_guidance_scale).
Installation#
1. Clone RLinf Repository#
# For mainland China users, you can use the following for better download speed:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf
2. Install Dependencies#
STEAM shares the OpenPI environment with RECAP.
Option 1: Docker Image
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
# For mainland China users, you can use the following for better download speed:
# docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
Please switch to the OpenPI virtual environment via the built-in switch_env utility:
source switch_env openpi
Option 2: Custom Environment
# For mainland China users, you can add the `--use-mirror` flag to the install.sh command for better download speed.
bash requirements/install.sh embodied --model openpi --env maniskill_libero
source .venv/bin/activate
Download the Model#
The STEAM value model is built from two pretrained backbones:
SigLIP-so400m (
google/siglip-so400m-patch14-384): vision encoderGemma3-270M (
google/gemma-3-270m): language model and tokenizer
# Download models (choose either method)
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/google/siglip-so400m-patch14-384
git clone https://huggingface.co/google/gemma-3-270m
# Method 2: Using huggingface-hub
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download google/siglip-so400m-patch14-384 --local-dir siglip-so400m-patch14-384
hf download google/gemma-3-270m --local-dir gemma-3-270m
Set the paths in the model config (examples/offline_rl/config/model/steam_value_model.yaml):
actor:
model:
vision_repo_id: /path/to/siglip-so400m-patch14-384
language_repo_id: /path/to/gemma-3-270m
tokenizer_path: /path/to/gemma-3-270m
Data Preparation#
STEAM uses datasets in the LeRobot format, categorized into two types:
SFT datasets: Expert-level demonstrations (successful expert trajectories).
Rollout datasets: Trajectories collected from online interaction (containing both successes and failures), plus human-intervention data.
Example dataset configuration:
data:
train_data_paths:
- dataset_path: /path/to/sft_dataset
type: sft
- dataset_path: /path/to/rollout_dataset
type: rollout
Note
Keep train_data_paths and data.k consistent between Step 1 and Step 2:
the advantage computation must score pairs at the same temporal stride the
critic was trained on.
Pipeline Tag System#
STEAM uses an advantage tag for data passing across steps. Unlike RECAP,
STEAM has no compute-returns step, so there is no returns_tag β the only tag
is the advantage_tag: written by Step 2 and read by Step 3. Ensure that
Step 2βs advantage.tag and Step 3βs data.advantage_tag are consistent so
CFG reads meta/advantages_{tag}.parquet.
Step |
Config Field |
Description |
|---|---|---|
2 |
|
Writes |
3 |
|
Reads |
Step 1: Value Model SFT#
Train the ensemble progress critic. Each member is a SigLIP + Gemma3 backbone with a classifier head; members are cloned from a shared backbone and their value heads are re-seeded so ensemble variance is a meaningful epistemic signal.
Configuration
The config is examples/offline_rl/config/steam_value_model_sft.yaml; the model
defaults live in examples/offline_rl/config/model/steam_value_model.yaml. Key fields:
data:
train_data_paths:
- dataset_path: /path/to/sft_dataset
type: sft
k: 32 # max signed stride K (pair temporal scale)
# Image (view) names the critic loads per frame; must match the views the
# checkpoint was trained on. Missing views become zero-placeholders.
camera_keys: [face_view, left_wrist_view, right_wrist_view]
actor:
micro_batch_size: 32
global_batch_size: 512
model:
num_bins: 32 # 2 = binary progress; >2 = multi-bin (even)
ensemble_size: 3 # number of critics in the ensemble
fusion_hidden_dim: 512
freeze_vision_encoder: false
freeze_language_model: false
use_gradient_checkpointing: true
optim:
lr: 5.0e-5
value_lr: 5.0e-5
Key Parameters
Parameter |
Default |
Description |
|---|---|---|
|
|
Max signed stride \(K\). In multi-bin mode |
|
|
Bin count. |
|
|
Number of ensemble members. |
Launch Command
bash examples/offline_rl/advantage_labeling/steam/run_steam_sft.sh steam_value_model_sft
# Override config fields inline:
bash examples/offline_rl/advantage_labeling/steam/run_steam_sft.sh steam_value_model_sft data.k=8
Output
Checkpoints under
logs/steam_sft/{config_name}-{timestamp}/.../checkpoints/global_step_{N}/actorTensorBoard logs
Key Metrics
train/actor/loss: cross-entropy over the signed-stride binstrain/actor/accuracy: best-bin classification accuracytrain/actor/grad_norm: gradient norm
Step 2: Compute Ensemble Advantages#
Run the trained ensemble over every frame and write per-frame advantage labels.
Configuration
The config is examples/offline_rl/config/steam_compute_advantages_ensemble.yaml:
advantage:
value_checkpoint: /path/to/steam_value_ensemble/checkpoints/global_step_N/actor
batch_size: 256
label_mode: quantile # required: "threshold" or "quantile"
rollout_quantile: 0.3 # top 30% of rollout frames labelled True
expert_quantile: 0.8 # optional: top 80% of sft frames labelled True
tag: steam_k32_ensemble3_q30
data:
k: 32 # must match Step 1 data.k
camera_keys: [face_view, left_wrist_view, right_wrist_view]
train_data_paths:
- dataset_path: /path/to/sft_dataset
type: sft
- dataset_path: /path/to/rollout_dataset
type: rollout
Key Parameters
label_mode decides which knobs are active. In threshold mode only
advantage.positive_threshold applies β a signed-score cut in \([-1, 1]\);
rollout frames scoring above it are positive and sft frames are always positive.
In quantile mode positive_threshold is ignored and the
rollout_quantile / expert_quantile fractions select the top-scoring frames
in each pool independently (omit expert_quantile to mark every sft frame
positive).
Parameter |
Default |
Description |
|---|---|---|
|
|
Path to the Step 1 ensemble checkpoint ( |
|
|
|
|
|
Signed-score threshold in \([-1, 1]\) ( |
|
|
Top fraction of rollout frames labelled True ( |
|
|
Top fraction of sft frames labelled True ( |
|
|
Output tag; writes |
|
|
Pair stride; must match the Step 1 training |
Launch Command
# Auto-detects #GPUs; single-GPU or torchrun multi-GPU both supported.
bash examples/offline_rl/advantage_labeling/steam/process/run_compute_advantages_ensemble.sh steam_compute_advantages_ensemble
# Force a GPU count:
bash examples/offline_rl/advantage_labeling/steam/process/run_compute_advantages_ensemble.sh steam_compute_advantages_ensemble --nproc 4
Output Files
meta/advantages_{tag}.parquet: per-frameadvantage(bool),advantage_continuous(signed score),ensemble_signed_score, per-member values, and ensemble entropy / variance diagnostics.meta/mixture_config.yaml: a per-tag entry recordinglabel_mode, the applied threshold,ensemble_size,num_bins, and positive counts.
Step 3: CFG Training#
Policy optimization runs the shared CFG stage directly on the STEAM advantage
parquets. Point the CFG configβs data.advantage_tag at the Step 2
advantage.tag and launch:
bash examples/offline_rl/policy_optimization/cfg_rl/run_cfg_rl.sh cfg_rl_openpi \
data.advantage_tag=steam_k32_ensemble3_q30
See the CFG training stage for the full CFG configuration and parameters.
STEAM Results#
We evaluate STEAM against behavior cloning (BC), HG-DAgger, and RECAP on four real-robot manipulation tasks. STEAM markedly raises the task success rate over the BC baseline on every task (absolute gain over BC shown as β):
Task |
BC |
HG-DAgger |
RECAP |
STEAM |
|---|---|---|---|---|
Towel Folding |
33.3 |
40 |
55.6 |
92.3 (β59) |
Chips Checkout |
39.5 |
53.3 |
53.3 |
93.8 (β54.3) |
Pick-and-Place |
63.8 |
β |
53.8 |
80 (β16.2) |
Cola Restocking |
52 |
β |
52.9 |
75 (β23) |
Task |
BC |
HG-DAgger |
RECAP |
STEAM |
|---|---|---|---|---|
Towel Folding |
42 |
48 |
39 |
58 |
Chips Checkout |
16.3 |
22.0 |
23.9 |
47.5 |
Pick-and-Place |
230 |
β |
161 |
254 |
Cola Restocking |
71 |
β |
46 |
90 |
Across the four tasks STEAM raises success rates to 75β93.8% and delivers the highest throughput, with the largest success-rate gains on Towel Folding (β59) and Chips Checkout (β54.3). (β marks the absolute gain over the BC baseline.)
Advanced Usage#
Merge Ensemble Checkpoints#
Members trained as separate single-model runs (or extracted from existing
ensembles) can be fused into one ensemble inference checkpoint. Each --member
is a checkpoint path, or PATH:idx to pull member idx from an ensemble:
python examples/offline_rl/advantage_labeling/steam/process/merge_steam_ensemble.py \
--member /path/to/seed1/checkpoints/global_step_5000/actor \
--member /path/to/seed2/checkpoints/global_step_5000/actor \
--member /path/to/ensemble/checkpoints/global_step_6000/actor:2 \
--output /path/to/merged/actor
The merge logic lives in
rlinf.models.embodiment.value_model.steam.checkpoint_merge.merge_ensemble_checkpoints.
Threshold / Quantile Relabeling#
To change the labelling threshold without rerunning GPU inference, relabel an
existing advantages parquet (pure CPU β advantage_continuous is reused):
python examples/offline_rl/advantage_labeling/steam/process/relabel_advantages.py \
--dataset_paths /path/to/sft_ds /path/to/rollout_ds \
--source_tag steam_k32_ensemble3_q30 \
--new_tag steam_k32_ensemble3_q20 \
--mode quantile --rollout_quantile 0.2
The relabel logic lives in
examples/offline_rl/advantage_labeling/steam/process/relabel_advantages.py.
Visualize Advantages#
Render distribution, per-member, uncertainty, per-episode, and episode-timeline diagnostics from an advantages parquet:
python examples/offline_rl/advantage_labeling/steam/process/visualize_advantage.py \
--dataset /path/to/dataset \
--tag steam_k32_ensemble3_q30 \
--output outputs/steam_viz
Visualization and Results#
For metric definitions, see Training metrics.
tensorboard --logdir ./logs --port 6006
File Structure#
Like RECAP, STEAM keeps its pipeline scripts self-contained under examples/
(the inference + labelling strategy that is bound to the model), the model /
dataset code under rlinf/models and rlinf/data/datasets, and shares the
model-agnostic post-processing with RECAP via rlinf/data/process/:
examples/offline_rl/
βββ config/ # shared production configs
β βββ steam_value_model_sft.yaml # Step 1
β βββ steam_compute_advantages_ensemble.yaml # Step 2
β βββ cfg_rl_openpi.yaml # Step 3 (CFG, shared with RECAP)
β βββ model/
β βββ steam_value_model.yaml # value model architecture defaults
βββ advantage_labeling/
β βββ steam/
β βββ train_steam.py # Step 1: value model SFT entry
β βββ run_steam_sft.sh # Step 1 launch script
β βββ process/ # Step 2: self-contained entries (like recap)
β βββ compute_advantages_ensemble.py # Step 2: ensemble inference + labelling (Hydra)
β βββ relabel_advantages.py # CLI: relabel advantages (CPU)
β βββ merge_steam_ensemble.py # CLI: merge ensemble checkpoints
β βββ visualize_advantage.py # advantage visualization
β βββ run_compute_advantages_ensemble.sh # Step 2 launch script
βββ policy_optimization/
βββ cfg_rl/
βββ train_cfg.py # Step 3: CFG policy training
βββ run_cfg_rl.sh # Step 3 launch script
rlinf/
βββ models/embodiment/value_model/steam/ # critic, ensemble, config, merge
β βββ modeling_steam.py / modeling_critic.py
β βββ ensemble_modeling_critic.py # worst-of-N + coerce_to_ensemble
β βββ checkpoint_merge.py # ensemble checkpoint merge
βββ data/datasets/steam/ # pair_dataset.py, mixture.py, binning.py
βββ data/process/ # shared, model-agnostic (RECAP + STEAM)
βββ advantage.py # quantile threshold + boolean label
βββ distributed.py # sharded-inference helpers
βββ mixture_config.py # meta/mixture_config.yaml tag I/O