RL on π0 and π0.5 Models#
This example provides a complete guide to fine-tuning the π0 and π0.5 algorithms with reinforcement learning using the RLinf framework. It covers the entire process—from environment input, core algorithms, training script configuration to evaluation and visualization—along with reproducible commands and configuration snippets.
For detailed technical report, please refer to the paper: πRL: ONLINE RL FINE-TUNING FOR FLOW-BASED VISION-LANGUAGE-ACTION MODELS.
The primary objective is to develop a model capable of performing robotic manipulation by:
Visual Understanding: Processing RGB images from the robot’s camera.
Language Comprehension: Interpreting natural-language task descriptions.
Action Generation: Producing precise robotic actions (position, rotation, gripper control).
Reinforcement Learning: Optimizing the policy via the PPO with environment feedback.
Environment#
LIBERO Environment
Environment: LIBERO simulation benchmark built on top of robosuite (MuJoCo).
Task: Command a 7-DoF robotic arm to perform a variety of household manipulation skills (pick-and-place, stacking, opening drawers, spatial rearrangement).
Observation: RGB images (typical resolutions 128 × 128 or 224 × 224) captured by off-screen cameras placed around the workspace.
Action Space: 7-dimensional continuous actions - 3D end-effector position control (x, y, z) - 3D rotation control (roll, pitch, yaw) - Gripper control (open / close)
ManiSkill3 Environment
Environment: ManiSkill3 simulation platform
Task: Control a robotic arm to grasp various objects
Observation: RGB images (224 × 224) from third-person camera
Action Space: 7-dimensional continuous actions - 3D position control (x, y, z) - 3D rotation control (roll, pitch, yaw) - Gripper control (open / close)
Task Description Format
π0 and π0.5 directly use the environment-provided natural-language task description as the language model input.
Data Structure
Images: Main-view and wrist-view RGB tensors, each of shape
[batch_size, 224, 224, 3]States: In LIBERO, states include end-effector pose (position + orientation) and gripper state. In ManiSkill3, states are robot joint angles.
Task Descriptions: Natural-language instructions
Rewards: Sparse success/failure rewards
Algorithm#
Core Algorithm Components
PPO (Proximal Policy Optimization)
Advantage estimation using GAE (Generalized Advantage Estimation)
Policy clipping with ratio limits
Value function clipping
Entropy regularization
GRPO (Group Relative Policy Optimization)
For every state / prompt the policy generates G independent actions
Compute the advantage of each action by subtracting the group’s mean reward.
Dependency Installation#
1. Clone RLinf Repository#
# For mainland China users, you can use the following for better download speed:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf
2. Install Dependencies#
Option 1: Docker Image
Use Docker image for the experiment.
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
# For mainland China users, you can use the following for better download speed:
# docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
Please switch to the corresponding virtual environment via the built-in switch_env utility in the image:
source switch_env openpi
Option 2: Custom Environment
Install dependencies directly in your environment by running the following command:
# For mainland China users, you can add the `--use-mirror` flag to the install.sh command for better download speed.
bash requirements/install.sh embodied --model openpi --env maniskill_libero
source .venv/bin/activate
Model Download#
Before starting training, you need to download the corresponding pretrained models. For example, for Spatial, Object, Goal task types in the LIBERO environment, you can download them as follows:
# Download the Spatial-Object-Goal model (choose either method)
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT
# Method 2: Using huggingface-hub
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT --local-dir RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT
Alternatively, you can download the model from ModelScope: https://www.modelscope.cn/models/RLinf/RLinf-Pi0-SFT-Spatial-Object-Goal.
Of course, RLinf also provides pretrained models for other environments. The model list is as follows:
Environment |
Task Description |
SFT Model |
Flow-SDE |
Flow-Noise |
|---|---|---|---|---|
LIBERO |
Spatial, Object, Goal |
|||
LIBERO |
Long |
|||
ManiSkill3 |
Multi-task |
|||
MetaWorld |
MT50 |
|||
CALVIN |
ABC-D |
Environment |
Task Description |
SFT Model |
Flow-SDE |
Flow-Noise |
|---|---|---|---|---|
LIBERO |
Spatial, Object, Goal, Long |
|||
ManiSkill3 |
Multi-task |
|||
MetaWorld |
MT50 |
|||
CALVIN |
ABC-D |
After downloading, please make sure to specify the model path correctly in your configuration file.
Running Scripts#
1. Key Cluster Configuration
cluster:
num_nodes: 1
component_placement:
env: 0-3
rollout: 4-7
actor: 0-7
rollout:
pipeline_stage_num: 2
Here you can flexibly configure the GPU count for env, rollout, and
actor components.
Additionally, by setting pipeline_stage_num = 2 in the
configuration, you can achieve pipeline overlap between rollout and
env, improving rollout efficiency.
cluster:
num_nodes: 1
component_placement:
env,rollout,actor: all
You can also reconfigure the placement to achieve complete sharing, where env, rollout, and actor components all share all GPUs.
cluster:
num_nodes: 1
component_placement:
env: 0-1
rollout: 2-5
actor: 6-7
You can also reconfigure the placement to achieve complete separation, where env, rollout, and actor components each use their own GPUs without interference, eliminating the need for offload functionality.
2. Model Key Parameter Configuration
2.1 Model Parameters
openpi:
noise_level: 0.5 # default noise intensity for flow_sde
noise_logvar_range: [0.08, 0.16] # default learnable noise range for flow_noise
action_chunk: ${actor.model.num_action_chunks}
num_steps: ${actor.model.num_steps}
train_expert_only: True
action_env_dim: ${actor.model.action_dim}
noise_method: "flow_sde" # flow_sde, flow_noise
add_value_head: False
pi05: False
value_after_vlm: False
Set different flow-matching steps via
num_steps.Use different noise injection methods by modifying
noise_method. We provide two options: flow_sde and flow_noise.noise_levelcontrols the noise intensity forflow_sde, andnoise_logvar_rangecontrols the learnable noise range forflow_noise.Enable π0.5 model by setting
pi05: True.Control the critic position via
value_after_vlm: when True, the critic is connected after the VLM module output; when False, the critic input is from the action expert module output.
2.2 Algorithm Configuration
In the paper, we provide two technical approaches, flow-noise and flow-sde, to fine-tune π0 and π0.5 models. Specifically, you can choose different technical approaches by switching the following configuration:
algorithm:
entropy_bonus: 0.0 # entropy regularization coefficient, set to 0.0 for flow-sde, 0.005 for flow-noise
openpi:
noise_method: "flow_sde" # [flow_sde,flow_noise] noise injection method, flow-sde introduces noise through ode-sde transformation, flow-noise introduces noise through noise network
noise_level: 0.5 # noise intensity for flow-sde
noise_logvar_range: [0.08, 0.16] # learnable noise range for flow-noise
joint_logprob: False # whether to optimize joint probability density function. For flow-sde, please set to False. For flow-noise, please set to True.
For example, for complete parameter settings of flow-sde, please refer to libero_spatial_ppo_openpi.yaml; for complete parameter settings of flow-noise, please refer to maniskill_ppo_openpi.yaml.
2.3 LoRA Settings
model:
is_lora: True
lora_rank: 8
gradient_checkpointing: False
If you want to use LoRA (Low-Rank Adaptation) to fine-tune the VLM part, please set is_lora: True and configure the lora_rank parameter. Note that gradient checkpointing is currently not supported, please keep gradient_checkpointing: False.
⭐ 2.4 Minimum Test Case ⭐
If you encounter OOM errors or want to implement a minimum test case with as few resources as possible, you can refer to libero_spatial_ppo_openpi_quickstart.yaml.
Compared to the standard task configuration, we have made the following modifications:
rollout_epoch: 8 -> 2
total_num_envs: 64 -> 32
micro_batch_size: 128 -> 64
global_batch_size: 2048 -> 256
lr: 5e-6 -> 1e-6
actor.enable_offload: False -> True
rollout.enable_offload: False -> True
On 4 H100 GPUs, we compared the results of standard parameters and minimum test parameters, and found that their performance is almost the same at the same time: (minimum test parameters optimize faster per round, but converge slower)
If you still encounter OOM issues under the minimum parameter configuration, we provide the following solutions:
If OOM occurs during the rollout stage:
Try replacing the rendering engine from
egltoosmesaFurther reduce
total_num_envsfrom 32 to 16, but increaserollout_epochfrom 2 to 4 to ensure the total number of environments per rollout round remains consistentCheck if actor’s
enable_offloadis enabled, and set it toTrueif it isFalse
If OOM occurs during the actor stage:
Try reducing
micro_batch_sizefrom 64 to 32, keepingglobal_batch_sizeat 256Check if rollout’s
enable_offloadis enabled, and set it toTrueif it isFalse
Note
If you encounter a mismatch between micro_batch_size and global_batch_size, ensure that global_batch_size is an integer multiple of micro_batch_size × number of GPUs.
2.5 Model Evaluation
For models after SFT or RL training, we provide two evaluation methods:
Use RLinf’s unified evaluation script, refer to the VLA Evaluation Documentation for evaluation. This method supports parallel environment evaluation, which is fast, but only supports outputting the success rate of the entire task.
Note
Metaworld currently do not support the evaluation mode with env.eval.auto_reset=True. It is recommended to use individual script files for model evaluation.
Use individual script files for model evaluation, refer to the example README.md. This method’s evaluation scripts are consistent with the official evaluation scripts provided by
openpi, supporting output of success rates for each subtask, but it is slower.
3. Configuration Files
Using libero-10 as an example, the configuration files for π0 and π0.5 are:
- π0+ PPO:
examples/embodiment/config/libero_10_ppo_openpi.yaml
- π0+ GRPO:
examples/embodiment/config/libero_10_grpo_openpi.yaml
- π0.5+ PPO:
examples/embodiment/config/libero_10_ppo_openpi_pi05.yaml
- π0.5+ GRPO:
examples/embodiment/config/libero_10_grpo_openpi_pi05.yaml
4. Launch Command
To start training with a chosen configuration, run the following command:
bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG
For example, to train the π0 model using the PPO algorithm in the LIBERO environment, run:
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi_quickstart
Visualization and Results#
1. TensorBoard Logging
# Launch TensorBoard
tensorboard --logdir ./logs --port 6006
2. Key Monitoring Metrics
Training Metrics
actor/loss: Policy lossactor/value_loss: Value function loss (PPO)actor/grad_norm: Gradient normactor/approx_kl: KL divergence between old and new policiesactor/pg_clipfrac: Policy clipping ratioactor/value_clip_ratio: Value loss clipping ratio (PPO)
Rollout Metrics
rollout/returns_mean: Average episode returnrollout/advantages_mean: Mean advantage value
Environment Metrics
env/episode_len: Average episode lengthenv/success_once: Task success rate
3. Video Generation
video_cfg:
save_video: True
info_on_video: True
video_base_dir: ${runner.logger.log_path}/video/train
4. WandB Integration
runner:
task_type: embodied
logger:
log_path: "../results"
project_name: rlinf
experiment_name: "libero_10_ppo_openpi"
logger_backends: ["tensorboard", "wandb"] # tensorboard, wandb, swanlab
LIBERO Results#
We trained π0 and π0.5 with PPO and GRPO in the LIBERO environment. The results achieved through RL training are shown below:
Model |
Spatial |
Object |
Goal |
Long |
Average |
Δ Avg. |
|---|---|---|---|---|---|---|
π0(few-shot) |
65.3% |
64.4% |
49.8% |
51.2% |
57.6% |
— |
+GRPO |
97.8% |
97.8% |
83.2% |
81.4% |
90.0% |
+32.4 |
+PPO |
98.4% |
99.4% |
96.2% |
90.2% |
96.0% |
+38.4 |
Model |
Spatial |
Object |
Goal |
Long |
Average |
Δ Avg. |
|---|---|---|---|---|---|---|
π0.5(few-shot) |
84.6% |
95.4% |
84.6% |
43.9% |
77.1% |
— |
+GRPO |
97.4% |
99.8% |
91.2% |
77.6% |
91.5% |
+14.4 |
+PPO |
99.6% |
100% |
98.8% |
93.0% |
97.9% |
+20.8 |
MetaWorld Results#
For MetaWorld results, please check MetaWorld Page.
CALVIN Results#
For CALVIN results, please check CALVIN Page.