πRL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models#

Overview#

πRL provides online reinforcement learning fine-tuning for flow-based vision-language-action (VLA) models π₀ and π₀.₅ within the RLinf framework. By combining PPO/GRPO with flow matching policies, the method enables few-shot SFT models to achieve strong manipulation performance through environment feedback. It supports LIBERO, ManiSkill3, MetaWorld, and CALVIN benchmarks, with visual understanding, language comprehension, and continuous action generation jointly optimized via RL.

Results#

π₀ Model#

Evaluation results of π₀ model#
Environment	Task	SFT	Flow-SDE	Flow-Noise
LIBERO	Spatial, Object, Goal	SFT	—	—
LIBERO	Long	SFT	—	—
ManiSkill3	Multi-task	38.4%	78.8%	77.8%
MetaWorld	MT50	50.8%	78.1%	85.8%
CALVIN	ABC-D	57.5%	61.7%	59.9%

π₀.₅ Model#

Evaluation results of π₀.₅ model#
Environment	Task	SFT	Flow-SDE	Flow-Noise
LIBERO	Spatial, Object, Goal, Long	SFT	—	—
ManiSkill3	Multi-task	40.1%	90.9%	89.7%
MetaWorld	MT50	43.8%	70.7%	66.1%
CALVIN	ABC-D	61.3%	87.0%	84.5%

Quickstart#

Full guide: RL on π0 and π0.5 Models

Run: bash examples/embodiment/run_embodiment.sh <CONFIG_NAME> (configs in examples/embodiment/config/)

Model Selection:

π₀: Configs without _pi05 in the name
π₀.₅: Configs with _pi05 in the name (e.g. *_openpi_pi05.yaml)

Benchmarks:

LIBERO: RL with LIBERO Benchmark
ManiSkill3: RL with ManiSkill Benchmark
MetaWorld: RL with MetaWorld Benchmark
CALVIN: RL with CALVIN Benchmark
Real2Sim2Real (GSEnv): RL with Real2Sim2Real GSEnv

Citation#

@article{chen2025pi_rl,
  title={$$\backslash$pi\_$\backslash$texttt $\{$RL$\}$ $: Online RL Fine-tuning for Flow-based Vision-Language-Action Models},
  author={Chen, Kang and Liu, Zhihao and Zhang, Tonghe and Guo, Zhen and Xu, Si and Lin, Hao and Zang, Hongzhi and Li, Xiang and Zhang, Quanlu and Yu, Zhaofei and others},
  journal={arXiv preprint arXiv:2510.25889},
  year={2025}
}