Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models#
Paper: arXiv:2602.12628
Overview#
Overview of the proposed two-stage sim-real co-training framework. We establish a digital-twin setup where \(T_{\text{sim}}\) serves as a digital cousin to \(T_{\text{real}}\) despite visual discrepancies. In Stage I, we initialize the VLA policy by supervising it on a mixture of real and simulated data (ratio \(\alpha\)). This rapidly injects real-world knowledge and prepares the policy for simulation interaction. In Stage II, we perform RL fine-tuning in the simulator to explore and improve performance, simultaneously employing a real-world SFT loss as a regularizer to prevent the forgetting of real-world behaviors.
Results#
Main Results#
VLA Model |
Experiment Setting |
Pick and Place |
Push Cube |
Open Drawer |
Close Drawer |
Avg. |
|---|---|---|---|---|---|---|
OpenVLA |
Real-Only Training |
6.3 ± 0.0 |
20.0 ± 13.3 |
0.0 ± 0.0 |
10.0 ± 10.0 |
16.5 ± 13.3 |
SFT Co-Training |
23.4 ± 4.7 |
51.7 ± 5.0 |
0.0 ± 0.0 |
85.0 ± 5.0 |
40.0 ± 3.7 |
|
RL-Co (Ours) |
58.8 ± 10.0 |
68.3 ± 11.7 |
35.0 ± 15.0 |
95.0 ± 5.0 |
64.0 ± 0.7 |
|
π₀.₅ |
Real-Only Training |
71.9 ± 9.4 |
0.0 ± 0.0 |
0.0 ± 0.0 |
35.0 ± 15.0 |
26.7 ± 1.4 |
SFT Co-Training |
68.8 ± 9.4 |
10.0 ± 3.3 |
10.0 ± 0.0 |
95.0 ± 5.0 |
45.9 ± 4.4 |
|
RL-Co (Ours) |
81.3 ± 9.4 |
18.4 ± 1.7 |
65.0 ± 5.0 |
100.0 ± 0.0 |
66.2 ± 4.0 |
Ablation Study#
Ablation study on simulation SFT initialization. We report the simulation success rate during RL training for models trained with and without simulation SFT initialization. Each RL training run uses three independent random seeds, and results are presented as mean success rate with shaded regions indicating standard deviation.
Data Efficiency#
Effect of the number of real-world demonstrations. We vary the number of real-world demonstrations for the Open Drawer task and evaluate all training paradigms using the \(\pi_{0.5}\) model. Performance is reported as success rate, with shaded regions indicating standard deviation.
Quickstart#
Instruction: RL-based Sim-Real Co-Training
Citation#
@article{shi2026rlinf,
title={Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models},
author={Shi, Liangzhi and Chen, Shuaihang and Gao, Feng and Chen, Yinuo and Chen, Kang and Zhang, Tonghe and Zhang, Hongzhi and Zhang, Weinan and Yu, Chao and Wang, Yu},
journal={arXiv preprint arXiv:2602.12628},
year={2026}
}