DSRL: Diffusion Steering via Reinforcement Learning#
This document provides a guide for training a pre-trained Pi0 diffusion policy using DSRL (Diffusion Steering via Reinforcement Learning) in the RLinf framework. DSRL steers a frozen Pi0 policy by training a lightweight SAC agent in the latent noise space, achieving RL fine-tuning with minimal trainable parameters (~500K).
Paper: Steering Your Diffusion Policy with Latent Space Reinforcement Learning (CoRL 2025, Wagenmaker et al.)
Reference implementation: dsrl_pi0
The key idea is:
Lightweight SAC Agent: A small SAC agent (~500K params) with compact CNN/MLP encoders processes observations and generates noise in the latent space.
Noise Injection: The generated noise is fed into Pi0’s diffusion denoiser as the initial noise, replacing random sampling.
Frozen VLM Backbone: The pre-trained Pi0 VLM and diffusion expert remain frozen, preserving generalization capabilities.
SAC Training in Noise Space: The SAC agent is trained on the noise space using environment rewards, with a 10-Q-head ensemble critic for stable value estimation.
Environment#
LIBERO Spatial Environment
Environment: LIBERO Spatial benchmark
Task: Tabletop manipulation tasks with spatial reasoning
Observation: Robot proprioception (8-dim) + RGB images
Action Space: Continuous actions generated by Pi0 diffusion denoiser (steered by SAC noise)
Algorithm#
DSRL Pipeline
Observation Encoding: Lightweight CNN (64×64 → 64-dim) and state encoder (8-dim → 64-dim) process the observation.
Noise Generation: A
GaussianPolicy(SquashedNormal) generates 32-dim noise actions for each action horizon step.Diffusion Denoising: The noise is injected into Pi0’s
sample_actions()as the initial noise. The frozen diffusion denoiser converts noise into real actions.SAC Training: Standard SAC with automatic entropy tuning trains the noise generator:
Actor:
GaussianPolicywith 3-layer MLP (128-dim hidden)Critic:
CompactMultiQHead— 10 Q-network ensemble (~500K total params)Target Network: Float32 EMA shadow buffer for bfloat16 precision
Installation#
DSRL uses the same environment and model dependencies as Pi0. Please refer to RL on π0 and π0.5 Models for the full installation guide, including Docker image setup, dependency installation, and model download.
Running Scripts#
1. Configuration File
DSRL Training:
examples/embodiment/config/libero_spatial_dsrl_openpi.yaml
2. Key Parameter Configuration
2.1 DSRL Model Parameters
actor:
model:
openpi:
use_dsrl: True # Enable DSRL mode
dsrl_state_dim: 8 # Robot proprioception dimension
dsrl_action_noise_dim: 32 # Noise action dimension per step
dsrl_num_q_heads: 10 # Number of Q-heads in ensemble critic
dsrl_image_latent_dim: 64 # Image encoder output dimension
dsrl_state_latent_dim: 64 # State encoder output dimension
dsrl_hidden_dims: [128, 128, 128] # MLP hidden layer dimensions
2.2 Algorithm Parameters
algorithm:
adv_type: embodied_sac
loss_type: embodied_sac
gamma: 0.999 # Discount factor
tau: 0.005 # Target network soft update coefficient
update_epoch: 200 # Training steps per interaction
train_actor_steps: 10 # Delay actor training for this many critic updates
entropy_tuning:
alpha_type: softplus
initial_alpha: 1.0
target_entropy: -16
optim:
lr: 3.0e-4
2.3 Environment Parameters
env:
train:
total_num_envs: 16
use_step_penalty: True # Use -1/0 reward style (step penalty + termination bonus)
max_episode_steps: 240
eval:
total_num_envs: 500
use_step_penalty: True
3. Launch Command
bash examples/embodiment/run_embodiment.sh libero_spatial_dsrl_openpi
Visualization and Results#
1. TensorBoard Logs
# Start TensorBoard
tensorboard --logdir ./logs
2. Key Monitoring Metrics
Environment Metrics:
env/episode_len: The actual number of environment steps in the episodeenv/return: Total return of the episodeenv/reward: Step-level reward from the environmentenv/success_once: Flag indicating at least one success in the episode (0 or 1)
Training Metrics:
train/sac/critic_loss: Loss of the Q-function ensembletrain/critic/grad_norm: Gradient norm of the Q-functiontrain/sac/actor_loss: Policy loss (GaussianPolicy in noise space)train/actor/entropy: Policy entropytrain/actor/grad_norm: Gradient norm of the policytrain/sac/alpha_loss: Loss of the temperature parametertrain/sac/alpha: Value of the temperature parametertrain/replay_buffer/size: Current size of the replay buffertrain/replay_buffer/utilization: Utilization of the replay buffer