RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training#

Paper: arXiv:2510.06710 | Models: RLinf-OpenVLA | RLinf-OpenVLAOFT

Overview#

RLinf-VLA overview

RLinf-VLA is a unified and efficient framework for scalable RL training of VLA models. It provides a unified interface that standardizes the integration of diverse VLA architectures, multiple RL algorithms, and heterogeneous simulators. The system uses a flexible resource allocation architecture for rendering, inference, and training; for GPU-parallelized simulators it introduces a hybrid fine-grained pipeline allocation strategy, yielding a 1.61×–1.88× training speedup. Models trained with RLinf-VLA show consistent improvements of approximately 20–85% across LIBERO, ManiSkill, and RoboTwin.

Results#

Training curves (ManiSkill)#

mani_openvla
OpenVLA
mani_openvlaoft
OpenVLA-OFT

Training curves on ManiSkill “PutOnPlateInScene25Mani-v3” with OpenVLA and OpenVLA-OFT, using PPO and GRPO. PPO consistently outperforms GRPO and is more stable.

ManiSkill evaluation#

Evaluation results on ManiSkill (success rates %)#

Model

In-Dist.

Vision

Semantic

Execution

Avg.

OpenVLA (Base)

53.91

38.75

35.94

42.11

39.10

RL4VLA (PPO)

93.75

80.47

75.00

81.77

79.15

OpenVLA (RLinf-GRPO)

84.38

74.69

72.99

77.86

75.15

OpenVLA (RLinf-PPO)

96.09

82.03

78.35

85.42

81.93

OpenVLA-OFT (Base)

28.13

27.73

12.95

11.72

18.29

OpenVLA-OFT (RLinf-GRPO)

94.14

84.69

45.54

44.66

60.64

OpenVLA-OFT (RLinf-PPO)

97.66

92.11

64.84

73.57

77.05

LIBERO (unified model, five task groups)#

Evaluation results of the unified model on the five LIBERO task groups (%)#

Model

Spatial

Object

Goal

Long

90

Avg.

OpenVLA-OFT (Base)

72.18

71.48

64.06

48.44

70.97

65.43

OpenVLA-OFT (RLinf-GRPO)

99.40

99.80

98.79

93.95

98.59

98.11

Δ Improvement

+27.22

+28.32

+34.73

+45.51

+27.62

+32.68

RoboTwin (seven tasks)#

Evaluation results of OpenVLA-OFT on seven RoboTwin tasks (%)#

Task

OpenVLA-OFT (SFT)

OpenVLA-OFT (RLinf-GRPO)

beat_block_hammer

huggingface 10.15%

huggingface 96.09%

pick_dual_bottles

huggingface 20.31%

huggingface 92.96%

place_empty_cup

huggingface 75.78%

huggingface 94.53%

place_container_plate

huggingface 54.69%

huggingface 95.31%

move_can_pot

huggingface 9.37%

huggingface 83.59%

lift_pot

huggingface 3.13%

huggingface 70.31%

handover_block

huggingface 28.13%

huggingface 70.31%

Average

28.79%

86.16

Δ Avg.

—

+57.37%

“Base” and “SFT” refer to supervised fine-tuned models before RL training.

Quickstart#

Citation#

@article{zang2025rlinf,
  title={RLinf-VLA: A unified and efficient framework for VLA+ RL training},
  author={Zang, Hongzhi and Wei, Mingjie and Xu, Si and Wu, Yongji and Guo, Zhen and Wang, Yuanqing and Lin, Hao and Shi, Liangzhi and Xie, Yuqing and Xu, Zhexuan and others},
  journal={arXiv preprint arXiv:2510.06710},
  year={2025}
}