RL on π₀ and π_0.5 Models#

Fine-tune the π₀ and π_0.5 flow-based VLA models with reinforcement learning (PPO / GRPO) across several simulators using RLinf. For the full method, see the paper πRL: Online RL Fine-Tuning for Flow-Based Vision-Language-Action Models.

Overview#

RL-fine-tune π₀ / π_0.5 on LIBERO, ManiSkill, MetaWorld, and CALVIN with PPO or GRPO.

Environments

LIBERO · ManiSkill · MetaWorld · CALVIN

Algorithms

PPO · GRPO

Tasks

Spatial · Object · Goal · Long

Hardware

1 node · GPUs

You’ll do: install → download an SFT checkpoint → pick a config → launch run_embodiment.sh → watch env/success_once.

Prerequisites: Installation · a π₀ / π_0.5 SFT checkpoint (steps below).

Tasks#

Select the model page by matching the environment, task family, and config or checkpoint artifact.

Environment	Task / Suite	Config / Weights	Focus
LIBERO	Spatial · Object · Goal · Long	`libero_spatial_ppo_openpi_pi05` / `libero_10_grpo_openpi_pi05`	Fine-tune π0 / π0.5 on LIBERO manipulation suites.
ManiSkill3	PickCube and related tasks	`maniskill_ppo_openpi_pi05`	Fine-tune π0.5 on ManiSkill3 robot-control tasks.
MetaWorld	MT50	`metaworld_50_ppo_openpi_pi05`	Evaluate generalization across MetaWorld manipulation tasks.
CALVIN	ABC-D	`calvin_abc_d_ppo_openpi_pi05`	Train on long-horizon language-conditioned manipulation.

Observation and Action#

Field	Description
Observation	Main-view and wrist-view RGB plus robot state from LIBERO, ManiSkill3, MetaWorld, or CALVIN.
Action	7-D continuous control for end-effector position, rotation, and gripper state.
Reward	Environment success or shaped reward used by PPO / GRPO.
Prompt	Environment-provided natural-language task description consumed by the VLA processor.

π₀ / π_0.5 train with PPO (actor-critic; GAE, ratio clipping, value clipping, entropy regularization) or GRPO (group-relative advantages over G sampled actions).

Installation#

First, clone the RLinf repository:

# Mainland China users can use a mirror for faster cloning:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf

Then set up the dependencies with one of the two methods below — a prebuilt Docker image (recommended) or a custom environment. The general setup (prerequisites, GPU drivers, the in-image switch_env helper, mirrors, and troubleshooting) is documented once in Installation; the commands in this recipe only differ in the Docker image tag and the --env value.

Option 1: Docker image — image tag agentic-rlinf0.3-maniskill_libero:

docker run -it --rm --gpus all \
   --shm-size 20g \
   --network host \
   --name rlinf \
   -v .:/workspace/RLinf \
   rlinf/rlinf:agentic-rlinf0.3-maniskill_libero
   # For mainland China users, you can use the following for better download speed:
   # docker.1ms.run/rlinf/rlinf:agentic-rlinf0.3-maniskill_libero

Please switch to the corresponding virtual environment via the built-in switch_env utility in the image:

source switch_env openpi

Option 2: Custom Environment

Install dependencies directly in your environment by running the following command:

# For mainland China users, you can add the `--use-mirror` flag to the install.sh command for better download speed.

bash requirements/install.sh embodied --model openpi --env maniskill_libero
source .venv/bin/activate

Download the Model#

Before starting training, you need to download the corresponding pretrained models. For example, for Spatial, Object, Goal task types in the LIBERO environment, you can download them as follows:

# Download the Spatial-Object-Goal model (choose either method)
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT

# Method 2: Using huggingface-hub
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT --local-dir RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT

Alternatively, you can download the model from ModelScope: https://www.modelscope.cn/models/RLinf/RLinf-Pi0-SFT-Spatial-Object-Goal.

Of course, RLinf also provides pretrained models for other environments. The model list is as follows:

π₀ **Model List**#
Environment	Task Description	SFT Model	Flow-SDE	Flow-Noise
LIBERO	Spatial, Object, Goal	SFT Model
LIBERO	Long	SFT Model
ManiSkill3	Multi-task	38.4%	78.8%	77.8%
MetaWorld	MT50	50.8%	78.1%	85.8%
CALVIN	ABC-D	57.5%	61.7%	59.9%

π_0.5 **Model List**#
Environment	Task Description	SFT Model	Flow-SDE	Flow-Noise
LIBERO	Spatial, Object, Goal, Long	SFT Model
ManiSkill3	Multi-task	40.1%	90.9%	89.7%
MetaWorld	MT50	43.8%	70.7%	66.1%
CALVIN	ABC-D	61.3%	87.0%	84.5%

After downloading, please make sure to specify the model path correctly in your configuration file.

Run It#

1. Key Cluster Configuration

cluster:
   num_nodes: 1
   component_placement:
      env: 0-3
      rollout: 4-7
      actor: 0-7

rollout:
   pipeline_stage_num: 2

Here you can flexibly configure the GPU count for env, rollout, and actor components. Additionally, by setting pipeline_stage_num = 2 in the configuration, you can achieve pipeline overlap between rollout and env, improving rollout efficiency.

cluster:
   num_nodes: 1
   component_placement:
      env,rollout,actor: all

You can also reconfigure the placement to achieve complete sharing, where env, rollout, and actor components all share all GPUs.

cluster:
   num_nodes: 1
   component_placement:
      env: 0-1
      rollout: 2-5
      actor: 6-7

You can also reconfigure the placement to achieve complete separation, where env, rollout, and actor components each use their own GPUs without interference, eliminating the need for offload functionality.

2. Model Key Parameter Configuration

2.1 Model Parameters

openpi:
  noise_level: 0.5 # default noise intensity for flow_sde
  noise_logvar_range: [0.08, 0.16] # default learnable noise range for flow_noise
  action_chunk: ${actor.model.num_action_chunks}
  num_steps: ${actor.model.num_steps}
  train_expert_only: True
  action_env_dim: ${actor.model.action_dim}
  noise_method: "flow_sde" # flow_sde, flow_noise
  add_value_head: False
  pi05: False
  value_after_vlm: False

Set different flow-matching steps via num_steps.
Use different noise injection methods by modifying noise_method. We provide two options: flow_sde and flow_noise. noise_level controls the noise intensity for flow_sde, and noise_logvar_range controls the learnable noise range for flow_noise.
Enable π_0.5 model by setting pi05: True.
Control the critic position via value_after_vlm: when True, the critic is connected after the VLM module output; when False, the critic input is from the action expert module output.

2.2 Algorithm Configuration

In the paper, we provide two technical approaches, flow-noise and flow-sde, to fine-tune π₀ and π_0.5 models. Specifically, you can choose different technical approaches by switching the following configuration:

algorithm:
   entropy_bonus: 0.0 # entropy regularization coefficient, set to 0.0 for flow-sde, 0.005 for flow-noise
openpi:
  noise_method: "flow_sde" # [flow_sde,flow_noise] noise injection method, flow-sde introduces noise through ode-sde transformation, flow-noise introduces noise through noise network
  noise_level: 0.5 # noise intensity for flow-sde
  noise_logvar_range: [0.08, 0.16] # learnable noise range for flow-noise
  joint_logprob: False # whether to optimize joint probability density function. For flow-sde, please set to False. For flow-noise, please set to True.

For example, for complete parameter settings of flow-sde, please refer to libero_spatial_ppo_openpi.yaml; for complete parameter settings of flow-noise, please refer to maniskill_ppo_openpi.yaml.

2.3 LoRA Settings

model:
  is_lora: True
  lora_rank: 8
  gradient_checkpointing: False

If you want to use LoRA (Low-Rank Adaptation) to fine-tune the VLM part, please set is_lora: True and configure the lora_rank parameter. Note that gradient checkpointing is currently not supported, please keep gradient_checkpointing: False.

⭐ 2.4 Minimum Test Case ⭐

If you encounter OOM errors or want to implement a minimum test case with as few resources as possible, you can refer to libero_spatial_ppo_openpi_quickstart.yaml. Compared to the standard task configuration, we have made the following modifications:

env.train.rollout_epoch: 8 -> 2
env.train.total_num_envs: 64 -> 32
actor.micro_batch_size: 128 -> 64
actor.global_batch_size: 2048 -> 256
actor.optim.lr: 5e-6 -> 1e-6
actor.enable_offload: False -> True
rollout.enable_offload: False -> True

On 4 H100 GPUs, we compared the results of standard parameters and minimum test parameters, and found that their performance is almost the same at the same time: (minimum test parameters optimize faster per round, but converge slower)

If you still encounter OOM issues under the minimum parameter configuration, we provide the following solutions:

If OOM occurs during the rollout stage:

Try replacing the rendering engine from egl to osmesa
Further reduce env.train.total_num_envs from 32 to 16, but increase env.train.rollout_epoch from 2 to 4 to ensure the total number of environments per rollout round remains consistent
Check if actor’s enable_offload is enabled, and set it to True if it is False

If OOM occurs during the actor stage:

Try reducing micro_batch_size from 64 to 32, keeping global_batch_size at 256
Check if rollout’s enable_offload is enabled, and set it to True if it is False

Note

If you encounter a mismatch between micro_batch_size and global_batch_size, ensure that global_batch_size is an integer multiple of micro_batch_size × number of GPUs.

2.5 Model Evaluation

For models after SFT or RL training, we provide two evaluation methods:

Use RLinf’s unified evaluation script; see evaluation for evaluation. This method supports parallel environment evaluation, which is fast, but only supports outputting the success rate of the entire task.

Note

Metaworld currently do not support the evaluation mode with env.eval.auto_reset=True. It is recommended to use individual script files for model evaluation.

Use individual script files for model evaluation, refer to the example README.md. This method’s evaluation scripts are consistent with the official evaluation scripts provided by openpi, supporting output of success rates for each subtask, but it is slower.

3. Configuration Files

Using libero-10 as an example, the configuration files for π₀ and π_0.5 are:

π₀+ PPO:
examples/embodiment/config/libero_10_ppo_openpi.yaml
π₀+ GRPO:
examples/embodiment/config/libero_10_grpo_openpi.yaml
π_0.5+ PPO:
examples/embodiment/config/libero_10_ppo_openpi_pi05.yaml
π_0.5+ GRPO:
examples/embodiment/config/libero_10_grpo_openpi_pi05.yaml

4. Launch Command

To start training with a chosen configuration, run the following command:

bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG

For example, to train the π₀ model using the PPO algorithm in the LIBERO environment, run:

bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi_quickstart

Visualization and Results#

1. TensorBoard Logging

# Launch TensorBoard
tensorboard --logdir ./logs --port 6006

2. Key Metrics

Watch ``env/success_once`` for the task success rate. For every logged metric, see Training metrics.

3. Video Generation

video_cfg:
  save_video: True
  info_on_video: True
  video_base_dir: ${runner.logger.log_path}/video/train

4. WandB Integration

runner:
  task_type: embodied
  logger:
    log_path: "../results"
    project_name: rlinf
    experiment_name: "libero_10_ppo_openpi"
    logger_backends: ["tensorboard", "wandb"] # tensorboard, wandb, swanlab

LIBERO Results#

We trained π₀ and π_0.5 with PPO and GRPO in the LIBERO environment. The results achieved through RL training are shown below:

π₀ **model results on LIBERO**#
Model	Spatial	Object	Goal	Long	Average	Δ Avg.
π₀(few-shot)	65.3%	64.4%	49.8%	51.2%	57.6%	—
+GRPO	97.8%	97.8%	83.2%	81.4%	90.0%	+32.4
+PPO	98.4%	99.4%	96.2%	90.2%	96.0%	+38.4

π_0.5 **model results on LIBERO**#
Model	Spatial	Object	Goal	Long	Average	Δ Avg.
π_0.5(few-shot)	84.6%	95.4%	84.6%	43.9%	77.1%	—
+GRPO	97.4%	99.8%	91.2%	77.6%	91.5%	+14.4
+PPO	99.6%	100%	98.8%	93.0%	97.9%	+20.8

MetaWorld Results#

For MetaWorld results, please check MetaWorld Page.

CALVIN Results#

For CALVIN results, please check CALVIN Page.

RL on π0 and π0.5 Models#