Release Notes#

RLinf v0.2 Release#

🎉 Introducing RLinf v0.2.

RLinf v0.2 focuses on two major directions: Real-World RL and Multi-agent RL systems. To support these goals, RLinf now supports real-world platforms including XSquare Turtle2 Arms and the Franka Arm, while offering a richer set of embodied benchmarks, simulators, models, algorithms, together with native asynchronous training designed for high-throughput workloads. This release also strengthens real-world deployment, sim-to-real, and co-training workflows, alongside more robust data and replay infrastructure and improved training stability. For multi-agent training, RLinf introduces native multi-agent support for extensible multi-agent RL algorithms and unified data interfaces, lowering the barrier to developing and scaling multi-agent workloads while enabling rapid reproduction of advanced training solutions such as WideSeek-R1.

Embodied Intelligence#

  1. Core Capability Upgrades, highlighting Real-World Robotics RL and World Models

  • Supported Real-World RL with XSquare Turtle2.

  • Supported World Models as simulators for RL training, including OpenSora, Wan, and WoVR.

  • Vision-Language Model Supervised Fine‑Tuning adds supervised fine‑tuning (SFT) capabilities for vision‑language models (VLMs), supporting efficient fine‑tuning on custom datasets. Verified on the Robo2VLM dataset, achieving approximately 95% reproduction accuracy for PR 708 and PR 781 models. See SFT VLM.

  • Supported Real2Sim2Real RL training based on GSEnv (ManiSkill-GS).

  • Supported RL-based Sim-Real Co-Training of the Ď€(0.5) model with Co-training.

  1. Model and Algorithm Ecosystem Expansion

  • Supported Dexbotic models and RL training with Dexbotic.

  • Improved support for IsaacLab, especially for GR00T+IsaacLab.

  • Supported RL training of the openpi model family on RoboTwin 2.0.

  • Supported RL with the CALVIN benchmark.

  • Supported RL with the RoboCasa benchmark.

  • Supported DSRL (Diffusion Steering via Reinforcement Learning) for pi0 with DSRL.

  • Supported SAC training for flow matching policy with SAC-Flow.

  1. Training Infrastructure Enhancements

  • Added a new wrapper in the data layer to support replay buffer collection of real-robot and simulator data as a standard module. Refer to Data collection.

  • Async Training Support introduces asynchronous training as a first‑class capability for embodied models, providing asynchronous PPO workflows and usability improvements to boost training efficiency in high‑throughput scenarios. Refer to Async PPO.

  • Data and Replay Pipeline Upgrade enhances data collection and replay pipelines, strengthens buffer preloading, updating, and checkpoint handling, and improves overall dataflow robustness. Refer to Replay buffer API.

  • Runtime Performance Optimizations add runtime features such as CUDA Graph, torch.compile, environment offloading, and FSDP path optimizations to improve execution efficiency for embodied training. Refer to YAML configuration.

  1. Stability Improvements and Usability

  • Applies multiple fixes to PPO/GRPO behavior and real‑world configuration handling, enhancing training stability and configuration correctness.

  • Added huggingface model link in yaml configuration files for easy downloading

Agentic RL#

  1. Core Capability Upgrades, highlighting Multi-Agent RL

  • Native Multi‑Agent Training Support introduces extensible multi‑agent reinforcement learning algorithms and unified data interfaces, significantly reducing the entry barrier for multi‑agent tasks. Enables rapid reproduction of complex PR 824 such as WideSeek-R1.

  • PPO Support for Reasoning Tasks: PR 771. Extends PPO algorithm support to reasoning tasks, further broadening RLinf’s applicability in complex reasoning and decision‑making scenarios with Reasoning PPO.

  • The Megatron-LM backend now supports the FUSCO communication library: PR 783. Delivers significant performance and scalability improvements for All-to-All communication during MoE model training and inference with FUSCO.

  • Supported agentic reinforcement learning on rStar2 (PR 522) and Search-R1 (PR 639).

Other improvements and bug fixes#

  • RLinf refactored init_worker and weight synchronization to improve performance, and added support for agents to compute rewards within the agent loop, eliminating the need for a separate reward worker: PR 524

  • RLinf updated the FSDP backend to support dynamic batch size, data‑parallel load balancing, gradient scaling with fp16 via an unscale_patch, and multi‑bucket weight synchronization: PR 553

  • Resolved dataset reading issues and added batch encoding during tokenizer loading to increase throughput: PR 653

  • Supported using Qdrant as the Wiki server, enabling efficient vector search and storage for wiki documents: PR 673

  • Refactored FSDP precision handling to support only fp16 and bf16, replacing the ambiguous AMP structure while preserving backward compatibility: PR 715

  • Fixed an issue in reasoning tasks where batch counts across ranks became inconsistent due to improper splitting of rollout outputs: PR 775

  • Fixed the epsilon configuration bug in pi models’ training: PR 623

  • Fixed an issue in the openpi model where gradient checkpointing previously had to be disabled manually before training: PR 843

  • Fixed the Docker image for Franka Arm: PR 862

  • Fixed send_num to use env world size instead of rollout world size in SAC actor worker: PR 882

  • Fixed an issue where the second round of rollout would receive random reset_state_ids: PR 886

  • Fixed a bug that caused environment offloading after initialization and ensured actor reserved memory is properly released during rollout: PR 897

Documentation#

  • Reorganized the structure of the examples index, classified examples into embodied scenarios, agentic scenarios, and system-level optimizations.

  • Added the FAQ document for breakpoint debugging.

  • Added awesome work and adoption section in README

Contributors#

@Hao Lin @qurakchin @guozhen1997 @zanghz21 @Bo Dai @FxxxxU @Elessar123 @LiuYiwei @Xzxuan @ysa @jzndd @zlock @xusi @Louis-J @WinstonWmj @shiletong @xzxuan @liyanghao @Iron_Wph @chenkang455 @shengyz @yimingzhou2002 @Florielle @xuxin @Yinuo Chen @nufukim @Lin-xs @zhangruize @Iron-Wph @Hongyi Zhu @red0orange @chenkang @hongzhi @thereAreDemonsNearby @Zoran Zhu @Tziy @Yimingzhou2002 @Nan Yang @AIhuaYuan @AIhuayuan @xuxin @MacBook-M3-Pro @wangxiangyuan @slzhta @Iron-Wph @fy2462 @Ning Xu @weimingjie @zlockewtg @smallcracker @gongyue teng @cc @Xin Xu @xiebin @yuyingyinya @Yun Liu @Tao Liu @renqian @Wheels Wu @Wheeeeeeeeels @Felix Zhang @pyy233 @LiuZhihao2022

RLinf v0.2 test results#

We tested most configuration files to guarantee the correctness of our provided examples in this release.

Configuration file

Model name

Result curve

maniskill_ppo_openpi.yaml

RLinf-Pi0-ManiSkill-25Main-SFT

maniskill_ppo_openpi.yaml result curve

maniskill_ppo_openpi_pi05.yaml

RLinf-Pi05-ManiSkill-25Main-SFT

maniskill_ppo_openpi_pi05.yaml result curve

maniskill_ppo_openvla.yaml

openvla-7b

maniskill_ppo_openvla.yaml result curve

maniskill_ppo_openvlaoft.yaml

Openvla-oft-SFT-libero10-trajall (LORA: RLinf-OpenVLAOFT-ManiSkill-Base-Lora)

maniskill_ppo_openvlaoft.yaml result curve

maniskill_ppo_mlp.yaml

None

maniskill_ppo_mlp.yaml result curve

maniskill_grpo_openvla.yaml

openvla-7b

maniskill_grpo_openvla.yaml result curve

maniskill_grpo_openvlaoft.yaml

Openvla-oft-SFT-libero10-trajall (LORA: RLinf-OpenVLAOFT-ManiSkill-Base-Lora)

maniskill_grpo_openvlaoft.yaml result curve

libero_goal_ppo_openpi.yaml

RLinf-Pi0-LIBERO-130-fullshot-SFT

libero_goal_ppo_openpi.yaml result curve

libero_goal_ppo_openpi_pi05.yaml

RLinf-Pi05-SFT

libero_goal_ppo_openpi_pi05.yaml result curve

calvin_abcd_d_ppo_openpi_pi05.yaml

RLinf-Pi05-CALVIN-ABC-D-SFT

calvin_abcd_d_ppo_openpi_pi05.yaml result curve

robotwin_place_empty_cup_ppo_openvlaoft.yaml

RLinf-OpenVLAOFT-RoboTwin-SFT-place_empty_cup

robotwin_place_empty_cup_ppo_openvlaoft.yaml result curve

robotwin_beat_block_hammer_grpo_openvlaoft.yaml

RLinf-OpenVLAOFT-RoboTwin-SFT-beat_block

robotwin_beat_block_hammer_grpo_openvlaoft.yaml result curve

isaaclab_franka_stack_cube_ppo_gr00t.yaml

RLinf-Gr00t-SFT-Stack-cube

isaaclab_franka_stack_cube_ppo_gr00t.yaml result curve

gsenv_ppo_openpi_pi05.yaml

RLinf-Pi05-GSEnv-PutCubeOnPlate-V0-SFT

gsenv_ppo_openpi_pi05.yaml result curve

frankasim_ppo_mlp.yaml

RLinf-ResNet10-pretrained

frankasim_ppo_mlp.yaml result curve

frankasim_sac_cnn_async.yaml

RLinf-ResNet10-pretrained

frankasim_sac_cnn_async.yaml result curve

maniskill_async_ppo_openpi.yaml

RLinf-Pi0-ManiSkill-25Main-SFT

maniskill_async_ppo_openpi.yaml result curve

maniskill_async_ppo_openpi_pi05.yaml

RLinf-Pi05-ManiSkill-25Main-SFT

maniskill_async_ppo_openpi_pi05.yaml result curve

maniskill_async_ppo_openvla.yaml

openvla-7b

maniskill_async_ppo_openvla.yaml result curve

maniskill_async_ppo_openvlaoft.yaml

Openvla-oft-SFT-libero10-trajall

maniskill_async_ppo_openvlaoft.yaml result curve

maniskill_sac_mlp.yaml

None

maniskill_sac_mlp.yaml result curve

libero_spatial_async_ppo_openpi.yaml

RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT

libero_spatial_async_ppo_openpi.yaml result curve

libero_object_async_ppo_openpi_pi05.yaml

RLinf-Pi05-LIBERO-SFT

libero_object_async_ppo_openpi_pi05.yaml result curve

libero_spatial_grpo_openpi_pi05.yaml

RLinf-Pi05-SFT

libero_spatial_grpo_openpi_pi05.yaml result curve

libero_10_grpo_openvlaoft.yaml

Openvla-oft-SFT-libero10-traj1

libero_10_grpo_openvlaoft.yaml result curve

opensora_libero_spatial_grpo_openvlaoft.yaml

Openvla-oft-SFT-libero-spatial

opensora_libero_spatial_grpo_openvlaoft.yaml result curve

wan_libero_spatial_grpo_openvlaoft.yaml

Openvla-oft-SFT-libero-spatial

wan_libero_spatial_grpo_openvlaoft.yaml result curve

examples/sft/config/qwen2_5_sft_vlm.yaml

Qwen/Qwen2.5-VL-3b-Instruct

examples/sft/config/qwen2_5_sft_vlm.yaml result curve

examples/sft/config/qwen3_sft_vlm.yaml

Qwen/Qwen3-VL-4b-Instruct

examples/sft/config/qwen3_sft_vlm.yaml result curve

examples/reasoning/config/math/qwen2.5-1.5b-ppo-megatron.yaml

Qwen/Qwen2.5-1.5B-Instruct

examples/reasoning/config/math/qwen2.5-1.5b-ppo-megatron.yaml result curve

RLinf v0.1 Release#

🎉 Introducing RLinf v0.1.

Built on robust system-level scheduling and communication components, RLinf is a scalable and flexible framework for post-training via reinforcement learning in embodiment, reasoning, and agent scenarios. The framework has been validated on popular models and tasks and achieves state-of-the-art model performance and training throughput, showcasing its extensibility, versatility, and efficiency in diverse scenarios.

Embodied Intelligence#

  • Supported end-to-end embodied RL training on multiple mainstream simulators (e.g., ManiSkill, Libero, MetaWorld, CALVIN), achieving state-of-the-art performance with multiple VLA models (e.g., OpenVLA, OpenVLA-OFT, π₀, π₀.â‚…, GR00T) and algorithms (GRPO, PPO), reaching a success rate of up to 99%.

  • Up to 143.4% faster training (2.434Ă— throughput) compared to existing frameworks, with flexibly allocated, decoupled, and hybrid execution modes, scaling effortlessly to thousands of GPUs.

  • The effectiveness of RL training has been verified in ManiSkill, LIBERO, MetaWorld, and CALVIN and reproducible best-practice scripts are provided in our embodiment examples.

Agent & Reasoning RL#

  • With 1.5B and 7B models, RLinf achieves state-of-the-art results on AIME 24, AIME 25, and GPQA-Diamond benchmarks.

  • Through pipeline parallelism (20%+) and automatic scheduling (30%+), RLinf demonstrates significant efficiency improvements and strong reasoning capabilities.

  • Supported auto scheduling and scaling of Megatron-based training. SGLang/vLLM and Megatron can automatically scale down/up during training to achieve maximum throughput, delivering 40%+ speedup compared to static placement.

  • Introduced the first open-source 1.5B online RL agent, boosting code completion accuracy by 50%+, outperforming even 32B-scale models.