RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation#
Paper: arXiv:2509.15965
Overview#
|
RLinf is a flexible and scalable open-source infrastructure for post-training foundation models via reinforcement learning. It supports reasoning RL (e.g., math with GRPO), embodied RL (e.g., VLAs in simulators), and other scenarios. Built on the macro-to-micro flow transformation (M2Flow) paradigm, RLinf decouples logical workflow programming from execution planning and uses elastic pipelining, context switching, and profiling-guided scheduling to maximize throughput. Evaluations show 1.07×–2.43× end-to-end training speedup over state-of-the-art systems: up to 1.7× in reasoning RL and up to 2.43× in embodied RL.
Results#
We extensively evaluate RLinf across math-reasoning and embodied RL workloads, covering four different models of different sizes (i.e., Qwen2.5, Qwen3-MoE, Open-VLA, OpenVLA-OFT), two RL algorithms (GRPO and PPO), and multiple cluster scales.
Math training performance#
RLinf consistently outperforms state-of-the-art RL systems veRL and Slime by 1.07×∼1.70× on a variety of math-reasoning RL settings. The results also show that different RL settings favor different execution modes.
Dense models#
Throughput (GRPO) |
Latency breakdown (GRPO on 7B) |
The following figures show the performance on PPO algorithm.
Throughput (PPO) |
Latency breakdown (PPO on 7B, 32 GPUs) |
MoE models#
For MoE models, we evaluate the Qwen3-30B-A3B on 32, 64, and 128 GPUs with a rollout batch size of 1536 and sequence length 20480. The following figures show the performance and latency breakdown on GRPO algorithm.
Throughput |
Latency breakdown (32 GPUs) |
Embodied training performance#
ManiSkill and LIBERO#
We evaluate on OpenVLA and OpenVLA-OFT on ManiSkill and LIBERO, respectively. On LIBERO, we compare RLinf with SimpleVLA-RL (commit d001d), which is built on veRL. On ManiSkill, no distributed RL baseline exists, so we compare different execution modes of RLinf. Training speed is reported in steps/sec, computed as the total number of environment steps divided by the iteration time.
Throughput |
Latency breakdown |
Model evaluation performance#
The following tables report evaluation performance of models trained with RLinf (and baselines) on math benchmarks. RLinf-math models are trained with RLinf and evaluated on AIME 24, AIME 25, and GPQA-diamond.
1.5B model results#
Model |
AIME 24 |
AIME 25 |
GPQA-diamond |
Average |
|---|---|---|---|---|
28.33 |
24.90 |
27.45 |
26.89 |
|
37.80 |
30.42 |
32.11 |
33.44 |
|
40.41 |
30.93 |
27.54 |
32.96 |
|
40.73 |
31.56 |
28.10 |
33.46 |
|
AReaL-1.5B-retrain* |
44.42 |
34.27 |
33.81 |
37.50 |
43.65 |
32.49 |
35.00 |
37.05 |
|
RLinf-math-1.5B (HuggingFace) |
48.44 |
35.63 |
38.46 |
40.84 |
* Retrained using default settings for 600 steps.
7B model results#
Model |
AIME 24 |
AIME 25 |
GPQA-diamond |
Average |
|---|---|---|---|---|
54.90 |
40.20 |
45.48 |
46.86 |
|
61.66 |
49.38 |
46.93 |
52.66 |
|
66.87 |
52.49 |
44.43 |
54.60 |
|
68.55 |
51.24 |
43.88 |
54.56 |
|
67.30 |
55.00 |
45.57 |
55.96 |
|
RLinf-math-7B (HuggingFace) |
68.33 |
52.19 |
48.18 |
56.23 |
RLinf achieves state-of-the-art performance on math reasoning tasks, consistently outperforming existing models across AIME 24, AIME 25, and GPQA-diamond for both 1.5B and 7B model sizes.
Quickstart#
Installation: Installation
Math (reasoning) training: Quickstart 2: GRPO Training of LLMs on MATH
Embodied training: Quickstart 1: PPO Training of VLAs on Maniskill3
Citation#
@article{yu2025rlinf,
title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and others},
journal={arXiv preprint arXiv:2509.15965},
year={2025}
}