RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training#

Paper: arXiv:2510.06710 | Models: RLinf-OpenVLA | RLinf-OpenVLAOFT

Overview#

RLinf-VLA is a unified and efficient framework for scalable RL training of VLA models. It provides a unified interface that standardizes the integration of diverse VLA architectures, multiple RL algorithms, and heterogeneous simulators. The system uses a flexible resource allocation architecture for rendering, inference, and training; for GPU-parallelized simulators it introduces a hybrid fine-grained pipeline allocation strategy, yielding a 1.61×–1.88× training speedup. Models trained with RLinf-VLA show consistent improvements of approximately 20–85% across LIBERO, ManiSkill, and RoboTwin.

Results#

Training curves (ManiSkill)#

OpenVLA

OpenVLA-OFT

Training curves on ManiSkill “PutOnPlateInScene25Mani-v3” with OpenVLA and OpenVLA-OFT, using PPO and GRPO. PPO consistently outperforms GRPO and is more stable.

ManiSkill evaluation#

Evaluation results on ManiSkill (success rates %)#
Model	In-Dist.	Vision	Semantic	Execution	Avg.
OpenVLA (Base)	53.91	38.75	35.94	42.11	39.10
RL4VLA (PPO)	93.75	80.47	75.00	81.77	79.15
OpenVLA (RLinf-GRPO)	84.38	74.69	72.99	77.86	75.15
OpenVLA (RLinf-PPO)	96.09	82.03	78.35	85.42	81.93

OpenVLA-OFT (Base)	28.13	27.73	12.95	11.72	18.29
OpenVLA-OFT (RLinf-GRPO)	94.14	84.69	45.54	44.66	60.64
OpenVLA-OFT (RLinf-PPO)	97.66	92.11	64.84	73.57	77.05

LIBERO (unified model, five task groups)#

Evaluation results of the unified model on the five LIBERO task groups (%)#
Model	Spatial	Object	Goal	Long	90	Avg.
OpenVLA-OFT (Base)	72.18	71.48	64.06	48.44	70.97	65.43
OpenVLA-OFT (RLinf-GRPO)	99.40	99.80	98.79	93.95	98.59	98.11
Δ Improvement	+27.22	+28.32	+34.73	+45.51	+27.62	+32.68

RoboTwin (seven tasks)#

Evaluation results of OpenVLA-OFT on seven RoboTwin tasks (%)#
Task	OpenVLA-OFT (SFT)	OpenVLA-OFT (RLinf-GRPO)
beat_block_hammer	10.15%	96.09%
pick_dual_bottles	20.31%	92.96%
place_empty_cup	75.78%	94.53%
place_container_plate	54.69%	95.31%
move_can_pot	9.37%	83.59%
lift_pot	3.13%	70.31%
handover_block	28.13%	70.31%
Average	28.79%	86.16
Δ Avg.	—	+57.37%

“Base” and “SFT” refer to supervised fine-tuned models before RL training.

Quickstart#

ManiSkill: RL with ManiSkill Benchmark
LIBERO: RL with LIBERO Benchmark
RoboTwin: RL with RoboTwin Benchmark
More examples: Embodied Scenarios

Citation#

@article{zang2025rlinf,
  title={RLinf-VLA: A unified and efficient framework for VLA+ RL training},
  author={Zang, Hongzhi and Wei, Mingjie and Xu, Si and Wu, Yongji and Guo, Zhen and Wang, Yuanqing and Lin, Hao and Shi, Liangzhi and Xie, Yuqing and Xu, Zhexuan and others},
  journal={arXiv preprint arXiv:2510.06710},
  year={2025}
}