πRL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models#

Overview#

πRL provides online reinforcement learning fine-tuning for flow-based vision-language-action (VLA) models π₀ and π₀.₅ within the RLinf framework. By combining PPO/GRPO with flow matching policies, the method enables few-shot SFT models to achieve strong manipulation performance through environment feedback. It supports the LIBERO, ManiSkill3, MetaWorld, and CALVIN benchmarks.

Results#

π₀ Model#

Evaluation results of π₀ model#
Environment	Task	SFT	Flow-SDE	Flow-Noise
LIBERO	Spatial, Object, Goal	SFT	—	—
LIBERO	Long	SFT	—	—
ManiSkill3	Multi-task	38.4%	78.8%	77.8%
MetaWorld	MT50	50.8%	78.1%	85.8%
CALVIN	ABC-D	57.5%	61.7%	59.9%

π₀.₅ Model#

Evaluation results of π₀.₅ model#
Environment	Task	SFT	Flow-SDE	Flow-Noise
LIBERO	Spatial, Object, Goal, Long	SFT	—	—
ManiSkill3	Multi-task	40.1%	90.9%	89.7%
MetaWorld	MT50	43.8%	70.7%	66.1%
CALVIN	ABC-D	61.3%	87.0%	84.5%

Quick Start#

Full guide: RL on π0 and π0.5 Models

Run: bash examples/embodiment/run_embodiment.sh <CONFIG_NAME> (configs in examples/embodiment/config/)

Model Selection:

π₀: Configs without _pi05 in the name
π₀.₅: Configs with _pi05 in the name (e.g. *_openpi_pi05.yaml)

Benchmarks:

LIBERO: RL with LIBERO Benchmarks
ManiSkill3: RL with ManiSkill Benchmark
MetaWorld: RL with MetaWorld Benchmark
CALVIN: RL with CALVIN Benchmark
Real2Sim2Real (GSEnv): RL with Real2Sim2Real GSEnv

Citation#

@article{chen2025pi_rl,
  title={$$\backslash$pi\_$\backslash$texttt $\{$RL$\}$ $: Online RL Fine-tuning for Flow-based Vision-Language-Action Models},
  author={Chen, Kang and Liu, Zhihao and Zhang, Tonghe and Guo, Zhen and Xu, Si and Lin, Hao and Zang, Hongzhi and Li, Xiang and Zhang, Quanlu and Yu, Zhaofei and others},
  journal={arXiv preprint arXiv:2510.25889},
  year={2025}
}