Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models#

Overview#

Overview of the proposed two-stage sim-real co-training framework. We establish a digital-twin setup where \(T_{\text{sim}}\) serves as a digital cousin to \(T_{\text{real}}\) despite visual discrepancies. In Stage I, we initialize the VLA policy by supervising it on a mixture of real and simulated data (ratio \(\alpha\)). This rapidly injects real-world knowledge and prepares the policy for simulation interaction. In Stage II, we perform RL fine-tuning in the simulator to explore and improve performance, simultaneously employing a real-world SFT loss as a regularizer to prevent the forgetting of real-world behaviors.

Results#

Main Results#

Comparison of real-world success rates under different training paradigms#
VLA Model	Experiment Setting	Pick and Place	Push Cube	Open Drawer	Close Drawer	Avg.
OpenVLA	Real-Only Training	6.3 ± 0.0	20.0 ± 13.3	0.0 ± 0.0	10.0 ± 10.0	16.5 ± 13.3
	SFT Co-Training	23.4 ± 4.7	51.7 ± 5.0	0.0 ± 0.0	85.0 ± 5.0	40.0 ± 3.7
	RL-Co (Ours)	58.8 ± 10.0	68.3 ± 11.7	35.0 ± 15.0	95.0 ± 5.0	64.0 ± 0.7
π₀.₅	Real-Only Training	71.9 ± 9.4	0.0 ± 0.0	0.0 ± 0.0	35.0 ± 15.0	26.7 ± 1.4
	SFT Co-Training	68.8 ± 9.4	10.0 ± 3.3	10.0 ± 0.0	95.0 ± 5.0	45.9 ± 4.4
	RL-Co (Ours)	81.3 ± 9.4	18.4 ± 1.7	65.0 ± 5.0	100.0 ± 0.0	66.2 ± 4.0

Ablation Study#

Ablation study on simulation SFT initialization. We report the simulation success rate during RL training for models trained with and without simulation SFT initialization. Each RL training run uses three independent random seeds, and results are presented as mean success rate with shaded regions indicating standard deviation.

Data Efficiency#

Effect of the number of real-world demonstrations. We vary the number of real-world demonstrations for the Open Drawer task and evaluate all training paradigms using the \(\pi_{0.5}\) model. Performance is reported as success rate, with shaded regions indicating standard deviation.

Quickstart#

Instruction: RL-based Sim-Real Co-Training

Citation#

@article{shi2026rlinf,
  title={Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models},
  author={Shi, Liangzhi and Chen, Shuaihang and Gao, Feng and Chen, Yinuo and Chen, Kang and Zhang, Tonghe and Zhang, Hongzhi and Zhang, Weinan and Yu, Chao and Wang, Yu},
  journal={arXiv preprint arXiv:2602.12628},
  year={2026}
}