RLinf Documentation#
Welcome to RLinf!
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models via reinforcement learning. The ‘inf’ in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.
RLinf is unique with:
Macro-to-Micro Flow: a new paradigm M2Flow, which executes macro-level logical flows through micro-level execution flows, decoupling logical workflow construction (programmable) from physical communication and scheduling (efficiency).
Flexible Execution Modes
Collocated mode: shares all GPUs across all workers.
Disaggregated mode: enables fine-grained pipelining.
Hybrid mode: a customizable combination of different placement modes, integrating both collocated and disaggregated modes.
Auto Scheduling
Dynamic Scheduling: dynamically schedule resource allocation, maximizing resource utilization.
Static Scheduling: automatically select the most suitable execution mode based on the training workload, without the need for manual resource allocation.
Embodied Agent Support
Fast adaptation support for mainstream VLA models: OpenVLA, OpenVLA-OFT, π₀, GR00T-N1.5
Support for mainstream CPU & GPU-based simulators via standardized RL interfaces: ManiSkill3, LIBERO, IsaacLab
Enabling the first RL fine-tuning of the π₀ model family with a flow-matching action expert.
RLinf is fast with:
Hybrid mode with fine-grained pipelining: achieves a 120%+ throughput improvement compared to other frameworks.
Automatic Online Scaling Strategy: dynamically scales training resources, with GPU switching completed within seconds, further improving efficiency by 20–40% while preserving the on-policy nature of RL algorithms.
RLinf is flexible and easy to use with:
Multiple Backend Integrations
FSDP + Hugging Face: rapid adaptation to new models and algorithms, ideal for beginners and fast prototyping.
Megatron + SGLang: optimized for large-scale training, delivering maximum efficiency for expert users with demanding workloads.
Adaptive communication via the asynchronous communication channel
Built-in support for popular RL methods, including PPO , GRPO , DAPO , Reinforce++ , and more.
- Tutorials
- Configuration
- Usage and Programming Tutorial
- Embodied Intelligence
- Agentic RL
- Supported RL Algorithms
- Proximal Policy Optimization (PPO)
- Group Relative Policy Optimization (GRPO)
- Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO)
- REINFORCE++
- Soft Actor-Critic (SAC) Algorithm
- Cross-Q
- Reinforcement Learning with Prior Data (RLPD)
- Implicit Q-Learning (IQL) Algorithm
- Async Proximal Policy Optimization (Async PPO)
- Extending the Framework
- Advanced Features
- Release Notes
- Publications
- RLinf-USER: Unified System for Real-world Online Policy Learning
- RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
- Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models
- RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation
- πRL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
- WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL
- WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning