RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation#

Overview#

RLinf is a flexible and scalable open-source infrastructure for post-training foundation models via reinforcement learning. It supports reasoning RL (e.g., math with GRPO), embodied RL (e.g., VLAs in simulators), and other scenarios. Built on the macro-to-micro flow transformation (M2Flow) paradigm, RLinf decouples logical workflow programming from execution planning and uses elastic pipelining, context switching, and profiling-guided scheduling to maximize throughput. Evaluations show 1.07×–2.43× end-to-end training speedup over state-of-the-art systems: up to 1.7× in reasoning RL and up to 2.43× in embodied RL.

Results#

We extensively evaluate RLinf across math-reasoning and embodied RL workloads, covering four different models of different sizes (i.e., Qwen2.5, Qwen3-MoE, Open-VLA, OpenVLA-OFT), two RL algorithms (GRPO and PPO), and multiple cluster scales.

Math training performance#

RLinf consistently outperforms state-of-the-art RL systems veRL and Slime by 1.07×∼1.70× on a variety of math-reasoning RL settings. The results also show that different RL settings favor different execution modes.

Dense models#

Throughput (GRPO)

Latency breakdown (GRPO on 7B)

The following figures show the performance on PPO algorithm.

Throughput (PPO)

Latency breakdown (PPO on 7B, 32 GPUs)

MoE models#

For MoE models, we evaluate the Qwen3-30B-A3B on 32, 64, and 128 GPUs with a rollout batch size of 1536 and sequence length 20480. The following figures show the performance and latency breakdown on GRPO algorithm.

Throughput

Latency breakdown (32 GPUs)

Embodied training performance#

ManiSkill and LIBERO#

We evaluate on OpenVLA and OpenVLA-OFT on ManiSkill and LIBERO, respectively. On LIBERO, we compare RLinf with SimpleVLA-RL (commit d001d), which is built on veRL. On ManiSkill, no distributed RL baseline exists, so we compare different execution modes of RLinf. Training speed is reported in steps/sec, computed as the total number of environment steps divided by the iteration time.

Throughput

Latency breakdown

Model evaluation performance#

The following tables report evaluation performance of models trained with RLinf (and baselines) on math benchmarks. RLinf-math models are trained with RLinf and evaluated on AIME 24, AIME 25, and GPQA-diamond.

1.5B model results#

1.5B model results#
Model	AIME 24	AIME 25	GPQA-diamond	Average
DeepSeek-R1-Distill-Qwen-1.5B (base)	28.33	24.90	27.45	26.89
DeepMath-1.5B	37.80	30.42	32.11	33.44
DeepScaleR-1.5B-Preview	40.41	30.93	27.54	32.96
AReaL-1.5B-Preview-Stage-3	40.73	31.56	28.10	33.46
AReaL-1.5B-retrain*	44.42	34.27	33.81	37.50
FastCuRL-1.5B-V3	43.65	32.49	35.00	37.05
RLinf-math-1.5B (HuggingFace)	48.44	35.63	38.46	40.84

* Retrained using default settings for 600 steps.

7B model results#

7B model results#
Model	AIME 24	AIME 25	GPQA-diamond	Average
DeepSeek-R1-Distill-Qwen-7B (base)	54.90	40.20	45.48	46.86
AReaL-boba-RL-7B	61.66	49.38	46.93	52.66
Skywork-OR1-7B	66.87	52.49	44.43	54.60
Polaris-7B-Preview	68.55	51.24	43.88	54.56
AceMath-RL-Nemotron-7B	67.30	55.00	45.57	55.96
RLinf-math-7B (HuggingFace)	68.33	52.19	48.18	56.23

RLinf achieves state-of-the-art performance on math reasoning tasks, consistently outperforming existing models across AIME 24, AIME 25, and GPQA-diamond for both 1.5B and 7B model sizes.

Quickstart#

Installation: Installation
Math (reasoning) training: Quickstart 2: GRPO Training of LLMs on MATH
Embodied training: Quickstart 1: PPO Training of VLAs on Maniskill3

Citation#

@article{yu2025rlinf,
  title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
  author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and others},
  journal={arXiv preprint arXiv:2509.15965},
  year={2025}
}