DSRL: Diffusion Steering via Reinforcement Learning#

This document provides a guide for training a pre-trained Pi0 diffusion policy using DSRL (Diffusion Steering via Reinforcement Learning) in the RLinf framework. DSRL steers a frozen Pi0 policy by training a lightweight SAC agent in the latent noise space, achieving RL fine-tuning with minimal trainable parameters (~500K).

Paper: Steering Your Diffusion Policy with Latent Space Reinforcement Learning (CoRL 2025, Wagenmaker et al.)

Reference implementation: dsrl_pi0

The key idea is:

  1. Lightweight SAC Agent: A small SAC agent (~500K params) with compact CNN/MLP encoders processes observations and generates noise in the latent space.

  2. Noise Injection: The generated noise is fed into Pi0’s diffusion denoiser as the initial noise, replacing random sampling.

  3. Frozen VLM Backbone: The pre-trained Pi0 VLM and diffusion expert remain frozen, preserving generalization capabilities.

  4. SAC Training in Noise Space: The SAC agent is trained on the noise space using environment rewards, with a 10-Q-head ensemble critic for stable value estimation.

Environment#

LIBERO Spatial Environment

  • Environment: LIBERO Spatial benchmark

  • Task: Tabletop manipulation tasks with spatial reasoning

  • Observation: Robot proprioception (8-dim) + RGB images

  • Action Space: Continuous actions generated by Pi0 diffusion denoiser (steered by SAC noise)

Algorithm#

DSRL Pipeline

  1. Observation Encoding: Lightweight CNN (64×64 → 64-dim) and state encoder (8-dim → 64-dim) process the observation.

  2. Noise Generation: A GaussianPolicy (SquashedNormal) generates 32-dim noise actions for each action horizon step.

  3. Diffusion Denoising: The noise is injected into Pi0’s sample_actions() as the initial noise. The frozen diffusion denoiser converts noise into real actions.

  4. SAC Training: Standard SAC with automatic entropy tuning trains the noise generator:

    • Actor: GaussianPolicy with 3-layer MLP (128-dim hidden)

    • Critic: CompactMultiQHead — 10 Q-network ensemble (~500K total params)

    • Target Network: Float32 EMA shadow buffer for bfloat16 precision

Installation#

DSRL uses the same environment and model dependencies as Pi0. Please refer to RL on π0 and π0.5 Models for the full installation guide, including Docker image setup, dependency installation, and model download.

Running Scripts#

1. Configuration File

  • DSRL Training: examples/embodiment/config/libero_spatial_dsrl_openpi.yaml

2. Key Parameter Configuration

2.1 DSRL Model Parameters

actor:
  model:
    openpi:
      use_dsrl: True              # Enable DSRL mode
      dsrl_state_dim: 8           # Robot proprioception dimension
      dsrl_action_noise_dim: 32   # Noise action dimension per step
      dsrl_num_q_heads: 10        # Number of Q-heads in ensemble critic
      dsrl_image_latent_dim: 64   # Image encoder output dimension
      dsrl_state_latent_dim: 64   # State encoder output dimension
      dsrl_hidden_dims: [128, 128, 128]  # MLP hidden layer dimensions

2.2 Algorithm Parameters

algorithm:
  adv_type: embodied_sac
  loss_type: embodied_sac
  gamma: 0.999             # Discount factor
  tau: 0.005               # Target network soft update coefficient
  update_epoch: 200        # Training steps per interaction
  train_actor_steps: 10    # Delay actor training for this many critic updates
  entropy_tuning:
    alpha_type: softplus
    initial_alpha: 1.0
    target_entropy: -16
    optim:
      lr: 3.0e-4

2.3 Environment Parameters

env:
  train:
    total_num_envs: 16
    use_step_penalty: True  # Use -1/0 reward style (step penalty + termination bonus)
    max_episode_steps: 240
  eval:
    total_num_envs: 500
    use_step_penalty: True

3. Launch Command

bash examples/embodiment/run_embodiment.sh libero_spatial_dsrl_openpi

Visualization and Results#

1. TensorBoard Logs

# Start TensorBoard
tensorboard --logdir ./logs

2. Key Monitoring Metrics

  • Environment Metrics:

    • env/episode_len: The actual number of environment steps in the episode

    • env/return: Total return of the episode

    • env/reward: Step-level reward from the environment

    • env/success_once: Flag indicating at least one success in the episode (0 or 1)

  • Training Metrics:

    • train/sac/critic_loss: Loss of the Q-function ensemble

    • train/critic/grad_norm: Gradient norm of the Q-function

    • train/sac/actor_loss: Policy loss (GaussianPolicy in noise space)

    • train/actor/entropy: Policy entropy

    • train/actor/grad_norm: Gradient norm of the policy

    • train/sac/alpha_loss: Loss of the temperature parameter

    • train/sac/alpha: Value of the temperature parameter

    • train/replay_buffer/size: Current size of the replay buffer

    • train/replay_buffer/utilization: Utilization of the replay buffer