Comparison with VeRL#

Last updated: 08/04/2025.

This document provides a comprehensive guide for benchmarking VeRL, including environment setup, configuration options, and performance results. VeRL is a high-performance framework for training large language models using reinforcement learning techniques (GRPO, PPO, etc.). However, VeRL currently supports only the collocated mode, so we compare it with RLinf in collocated mode as well to ensure a fair evaluation.

Environment Setup#

For streamlined deployment, we recommend using Docker images for training setup. This approach ensures consistent environments and reduces configuration complexity. For detailed environment configuration and alternative installation methods, please refer to the VeRL documentation.

Community Image#

VeRL provides several pre-built Docker images optimized for different inference backends and training configurations:

vLLM with FSDP and Megatron: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1, with Deep-EP support: verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1-deepep.
SGLang with FSDP and Megatron: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1 (need vLLM support, but can have some package conflicts), with Deep-EP support: verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1-deepep.
Preview version of SGLang with FSDP and Megatron, CUDA 12.6: verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1
Preview version of SGLang with FSDP and Megatron, CUDA 12.8: verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1

Docker Installation and Setup#

Follow these steps to set up your VeRL environment using Docker:

1. Launch Docker Container

# Create and start the container with GPU support
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" \
    --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash

2. Install VeRL Framework

For pre-built images, install only VeRL without dependencies:

# Install the nightly version (recommended for latest features)
git clone https://github.com/volcengine/verl && cd verl
pip3 install --no-deps -e .

Dataset Preparation#

VeRL requires datasets in Parquet format with a specific schema. The framework expects structured data that includes prompts, ground truth information, and metadata for reward modeling.

Required Data Format:

data = {
    "data_source": data_source,           # Source identifier for the dataset
    "prompt": [                           # Conversation format prompt
        {
            "role": "user",
            "content": question,          # The actual question/prompt
        }
    ],
    "ability": "math",                    # Task category (e.g., "math", "coding", "reasoning")
    "reward_model": {                     # Reward model configuration
        "style": "rule",                  # Reward calculation method
        "ground_truth": solution          # Expected correct answer
    },
    "extra_info": {                       # Additional metadata
        "split": split,                   # Dataset split (train/val/test)
        "index": idx,                     # Sample index
        "answer": answer_raw,             # Raw answer
        "question": question_raw,         # Original question text
    },
}

Data Conversion Tips:

Convert your existing datasets to this format
Ensure all required fields are present
Validate data types and formats before training

Configuration#

VeRL and our framework have many differences in parameter configuration. Here we provide an example and explain the meaning of some configurations.

Bash example#

set -x
export CUDA_DEVICE_MAX_CONNECTIONS=1

math_train_path=/path/to/dataset/boba.parquet
math_test_path=/path/to/dataset/test_mini.parquet

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files="$math_train_path" \
    data.val_files="$math_test_path" \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=27648 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=/path/to/models/DeepSeek-R1-Distill-Qwen-7B \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=4 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=30000 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=30000 \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=30000 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.rollout.n=16 \
    actor_rollout_ref.rollout.temperature=0.6 \
    actor_rollout_ref.rollout.top_k=1000000 \
    actor_rollout_ref.rollout.top_p=1.0 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","tensorboard"]' \
    trainer.project_name='verl_grpo_boba' \
    trainer.experiment_name='ds_7b_fsdp_sglang' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=8 \
    trainer.val_before_train=False \
    trainer.save_freq=50 \
    trainer.test_freq=-1 \
    trainer.total_epochs=15000 $@

Parameter Categories and Explanations#

Batch Size Configuration#

These parameters control how data flows through the training pipeline:

data.train_batch_size: Global training batch size - The global number of prompts processed in one training iteration across all GPUs
actor_rollout_ref.actor.ppo_mini_batch_size: PPO mini-batch size - The global number of prompts used for each gradient update step within a training iteration across all GPUs
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: Actor micro-batch size - Batch size of samples for one forward_backward pass per GPU
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: Reference model micro-batch size - Batch size of samples for reference model log prob calculations per GPU
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu: Rollout micro-batch size - Batch size of samples for rollout phase log prob calculations per GPU

Dynamic Batch Size Management:

actor_rollout_ref.actor.use_dynamic_bsz: Enable dynamic batch sizing for actor training
actor_rollout_ref.actor.ppo_max_token_len_per_gpu: Maximum token count per GPU for actor training
actor_rollout_ref.ref.log_prob_use_dynamic_bsz: Enable dynamic batch sizing for reference model computations
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu: Maximum token count per GPU for reference log prob calculations
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz: Enable dynamic batch sizing for rollout log prob calculations
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu: Maximum token count per GPU for rollout phase

FSDP (Fully Sharded Data Parallel) Configuration#

FSDP enables training of large models by sharding parameters across multiple GPUs:

actor_rollout_ref.model.use_remove_padding: Remove padding optimization - Eliminates padding tokens to improve computational efficiency and reduce memory usage
actor_rollout_ref.actor.ulysses_sequence_parallel_size: Sequence parallelism size - Number of GPUs to split sequence dimensions across
actor_rollout_ref.model.enable_gradient_checkpointing: Gradient checkpointing - Trade computation for memory by recomputing activations during backward pass

Memory Optimization Options:

actor_rollout_ref.ref.fsdp_config.param_offload: Offload reference model parameters to CPU memory
actor_rollout_ref.actor.fsdp_config.param_offload: Offload actor model parameters to CPU memory
actor_rollout_ref.actor.fsdp_config.optimizer_offload: Offload optimizer states to CPU memory

Model and Algorithm Configuration#

actor_rollout_ref.model.path: Base model path - HuggingFace model path or local directory containing the pre-trained model
actor_rollout_ref.actor.optim.lr: Learning rate - Learning rate for the optimizer
algorithm.adv_estimator: Advantage estimator - Algorithm type, support ["gae", "grpo", "reinforce_plus_plus", "reinforce_plus_plus_baseline", "rloo"]

KL Divergence and Regularization:

actor_rollout_ref.actor.use_kl_loss: Enable KL divergence loss to prevent the model from deviating too far from the reference policy
actor_rollout_ref.actor.kl_loss_coef: KL loss coefficient
actor_rollout_ref.actor.kl_loss_type: Type of KL loss computation ["kl (k1)", "abs", "mse (k2)", "low_var_kl (k3)", "full"]
actor_rollout_ref.actor.entropy_coeff: Entropy coefficient for exploration

Rollout and Inference Configuration#

actor_rollout_ref.rollout.name: Inference backend - Include ["hf", "sglang", "vllm]"
actor_rollout_ref.rollout.tensor_model_parallel_size: Tensor parallelism - TP size for rollout. Only effective for vllm
actor_rollout_ref.rollout.gpu_memory_utilization: GPU memory usage - Fraction of GPU memory to use for inference
actor_rollout_ref.rollout.n: Samples per prompt - Number of responses to generate for each prompt during rollout

Generation Parameters:

actor_rollout_ref.rollout.temperature: Controls randomness in generation
actor_rollout_ref.rollout.top_k: Top-k sampling parameter
actor_rollout_ref.rollout.top_p: Top-p sampling parameter

Training Control Parameters#

trainer.logger: Logging backends - Available options: ["wandb", "mlflow", "swanlab", "vemlp_wandb", "tensorboard", "console", "clearml"]
trainer.project_name: Project name for experiment tracking
trainer.experiment_name: Specific experiment identifier
trainer.n_gpus_per_node: Number of GPUs per compute node
trainer.nnodes: Number of compute nodes in the cluster
trainer.total_epochs: Maximum number of training epochs
trainer.save_freq: Model checkpoint saving frequency (every N steps)
trainer.test_freq: Validation frequency (-1 disables periodic validation)

Multi-Node Training Setup#

For large-scale training across multiple nodes, VeRL uses Ray for distributed coordination. This section covers cluster setup and management.

Ray Cluster Initialization#

Manual Ray Setup:

Start Head Node:

ray start --head --dashboard-host=0.0.0.0

Start Worker Nodes:

ray start --address=<head_node_ip:port>

For detailed multi-node setup instructions, refer to the VeRL Multi-node Documentation.

Automated Ray Cluster Script#

Use this script for automated cluster initialization across multiple nodes:

#!/bin/bash

# Parameter validation
if [ -z "$RANK" ]; then
    echo "Error: RANK environment variable not set!"
    exit 1
fi

# Configuration file path (modify according to actual requirements)
SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
REPO_PATH=$(dirname "$SCRIPT_PATH")
RAY_HEAD_IP_FILE=$REPO_PATH/ray_utils/ray_head_ip.txt
RAY_PORT=$MASTER_PORT  # Ray default port, can be modified as needed

# Head node startup logic
if [ "$RANK" -eq 0 ]; then
    # Get local IP address (assuming internal network IP)
    IP_ADDRESS=$(hostname -I | awk '{print $1}')
    # Start Ray head node
    echo "Starting Ray head node on rank 0, IP: $IP_ADDRESS"
    # export VLLM_ATTENTION_BACKEND=XFORMERS
    # export VLLM_USE_V1=0
    ray start --head --memory=461708984320 --port=29500

    # Write IP to file
    echo "$IP_ADDRESS" > $RAY_HEAD_IP_FILE
    echo "Head node IP written to $RAY_HEAD_IP_FILE"
else
    # Worker node startup logic
    echo "Waiting for head node IP file..."

    # Wait for file to appear (maximum 360 seconds)
    for i in {1..360}; do
        if [ -f $RAY_HEAD_IP_FILE ]; then
            HEAD_ADDRESS=$(cat $RAY_HEAD_IP_FILE)
            if [ -n "$HEAD_ADDRESS" ]; then
                break
            fi
        fi
        sleep 1
    done

    if [ -z "$HEAD_ADDRESS" ]; then
        echo "Error: Could not get head node address from $RAY_HEAD_IP_FILE"
        exit 1
    fi

    echo "Starting Ray worker node connecting to head at $HEAD_ADDRESS"
    # export VLLM_ATTENTION_BACKEND=XFORMERS
    export VLLM_USE_V1=0
    ray start --memory=461708984320 --address="$HEAD_ADDRESS:29500"
fi

Benchmark Results#

Performance evaluation of VeRL using the Boba mathematical reasoning dataset with DeepSeek-R1-Distill-Qwen-1.5B model. Testing conducted on Aug 4, 2025, using VeRL.

Both RLinf and VeRL are using params belows:

Params	Value
Model	DeepSeek-R1-Distill-Qwen-1.5B
Dataset	Boba math reasoning dataset
Hardware	1 nodes × 8 H100 GPUs
Tensor Parallelism	2
Data Parallelism	4
Pipeline Parallelism	1
Context Length	28672
MaxPrompt Length	1024
Batch Size Per DP	128
recompute	20 blocks

The following benchmark results compare RLinf with VeRL. Tests for VeRL were conducted using Commit ID 8fdc4d3 (v0.5.0 release).

In general, for time-related metrics, smaller values are better; for throughput-related metrics, larger values are better; and for response length, there is usually no clear conclusion. In the table below, improvements of RLinf over VeRL are highlighted in red, while regressions are highlighted in green.

Metric	RLinf	VeRL	RLinf vs VeRL	Unit
response length	13975.00	14254.84		tokens
generation time	266.08	260.92	↑ 1.98%	seconds
prev logprob time	17.78	17.51	↑ 1.54%	seconds
training time	61.12	66.53	↓ 8.13%	seconds
step time	346.33	363.55	↓ 4.74%	seconds
gen throughput	3361.35	3533.27	↓ 4.87%	per-GPU tokens/s
prev logprob throughput	50835.06	52635.84	↓ 3.42%	per-GPU tokens/s
step throughput	19850.13	20022.92	↓ 0.87%	total tokens/s

Note: RLinf results below does not count ref logprob time.

In conclusion, the overall training efficiency is comparable, but our approach achieves a significant reduction in training time compared to VeRL.