Comparison with VeRL#
Last updated: 08/04/2025.
This document provides a comprehensive guide for benchmarking VeRL, including environment setup, configuration options, and performance results. VeRL is a high-performance framework for training large language models using reinforcement learning techniques (GRPO, PPO, etc.). However, VeRL currently supports only the collocated mode, so we compare it with RLinf in collocated mode as well to ensure a fair evaluation.
Environment Setup#
For streamlined deployment, we recommend using Docker images for training setup. This approach ensures consistent environments and reduces configuration complexity. For detailed environment configuration and alternative installation methods, please refer to the VeRL documentation.
Community Image#
VeRL provides several pre-built Docker images optimized for different inference backends and training configurations:
vLLM with FSDP and Megatron:
verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1, with Deep-EP support:verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1-deepep.SGLang with FSDP and Megatron:
verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1(need vLLM support, but can have some package conflicts), with Deep-EP support:verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.1-deepep.Preview version of SGLang with FSDP and Megatron, CUDA 12.6:
verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.1Preview version of SGLang with FSDP and Megatron, CUDA 12.8:
verlai/verl:app-preview-verl0.5-sglang0.4.8-mcore0.12.1
Docker Installation and Setup#
Follow these steps to set up your VeRL environment using Docker:
1. Launch Docker Container
# Create and start the container with GPU support
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" \
--cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash
2. Install VeRL Framework
For pre-built images, install only VeRL without dependencies:
# Install the nightly version (recommended for latest features)
git clone https://github.com/volcengine/verl && cd verl
pip3 install --no-deps -e .
Dataset Preparation#
VeRL requires datasets in Parquet format with a specific schema. The framework expects structured data that includes prompts, ground truth information, and metadata for reward modeling.
Required Data Format:
data = {
"data_source": data_source, # Source identifier for the dataset
"prompt": [ # Conversation format prompt
{
"role": "user",
"content": question, # The actual question/prompt
}
],
"ability": "math", # Task category (e.g., "math", "coding", "reasoning")
"reward_model": { # Reward model configuration
"style": "rule", # Reward calculation method
"ground_truth": solution # Expected correct answer
},
"extra_info": { # Additional metadata
"split": split, # Dataset split (train/val/test)
"index": idx, # Sample index
"answer": answer_raw, # Raw answer
"question": question_raw, # Original question text
},
}
Data Conversion Tips:
Convert your existing datasets to this format
Ensure all required fields are present
Validate data types and formats before training
Configuration#
VeRL and our framework have many differences in parameter configuration. Here we provide an example and explain the meaning of some configurations.
Bash example#
set -x
export CUDA_DEVICE_MAX_CONNECTIONS=1
math_train_path=/path/to/dataset/boba.parquet
math_test_path=/path/to/dataset/test_mini.parquet
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files="$math_train_path" \
data.val_files="$math_test_path" \
data.train_batch_size=128 \
data.max_prompt_length=1024 \
data.max_response_length=27648 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=/path/to/models/DeepSeek-R1-Distill-Qwen-7B \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=4 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=30000 \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=30000 \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=30000 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=16 \
actor_rollout_ref.rollout.temperature=0.6 \
actor_rollout_ref.rollout.top_k=1000000 \
actor_rollout_ref.rollout.top_p=1.0 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","tensorboard"]' \
trainer.project_name='verl_grpo_boba' \
trainer.experiment_name='ds_7b_fsdp_sglang' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=8 \
trainer.val_before_train=False \
trainer.save_freq=50 \
trainer.test_freq=-1 \
trainer.total_epochs=15000 $@
Parameter Categories and Explanations#
Batch Size Configuration#
These parameters control how data flows through the training pipeline:
data.train_batch_size: Global training batch size - The global number of prompts processed in one training iteration across all GPUsactor_rollout_ref.actor.ppo_mini_batch_size: PPO mini-batch size - The global number of prompts used for each gradient update step within a training iteration across all GPUsactor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: Actor micro-batch size - Batch size of samples for one forward_backward pass per GPUactor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: Reference model micro-batch size - Batch size of samples for reference model log prob calculations per GPUactor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu: Rollout micro-batch size - Batch size of samples for rollout phase log prob calculations per GPU
Dynamic Batch Size Management:
actor_rollout_ref.actor.use_dynamic_bsz: Enable dynamic batch sizing for actor trainingactor_rollout_ref.actor.ppo_max_token_len_per_gpu: Maximum token count per GPU for actor trainingactor_rollout_ref.ref.log_prob_use_dynamic_bsz: Enable dynamic batch sizing for reference model computationsactor_rollout_ref.ref.log_prob_max_token_len_per_gpu: Maximum token count per GPU for reference log prob calculationsactor_rollout_ref.rollout.log_prob_use_dynamic_bsz: Enable dynamic batch sizing for rollout log prob calculationsactor_rollout_ref.rollout.log_prob_max_token_len_per_gpu: Maximum token count per GPU for rollout phase
FSDP (Fully Sharded Data Parallel) Configuration#
FSDP enables training of large models by sharding parameters across multiple GPUs:
actor_rollout_ref.model.use_remove_padding: Remove padding optimization - Eliminates padding tokens to improve computational efficiency and reduce memory usageactor_rollout_ref.actor.ulysses_sequence_parallel_size: Sequence parallelism size - Number of GPUs to split sequence dimensions acrossactor_rollout_ref.model.enable_gradient_checkpointing: Gradient checkpointing - Trade computation for memory by recomputing activations during backward pass
Memory Optimization Options:
actor_rollout_ref.ref.fsdp_config.param_offload: Offload reference model parameters to CPU memoryactor_rollout_ref.actor.fsdp_config.param_offload: Offload actor model parameters to CPU memoryactor_rollout_ref.actor.fsdp_config.optimizer_offload: Offload optimizer states to CPU memory
Model and Algorithm Configuration#
actor_rollout_ref.model.path: Base model path - HuggingFace model path or local directory containing the pre-trained modelactor_rollout_ref.actor.optim.lr: Learning rate - Learning rate for the optimizeralgorithm.adv_estimator: Advantage estimator - Algorithm type, support["gae", "grpo", "reinforce_plus_plus", "reinforce_plus_plus_baseline", "rloo"]
KL Divergence and Regularization:
actor_rollout_ref.actor.use_kl_loss: Enable KL divergence loss to prevent the model from deviating too far from the reference policyactor_rollout_ref.actor.kl_loss_coef: KL loss coefficientactor_rollout_ref.actor.kl_loss_type: Type of KL loss computation["kl (k1)", "abs", "mse (k2)", "low_var_kl (k3)", "full"]actor_rollout_ref.actor.entropy_coeff: Entropy coefficient for exploration
Rollout and Inference Configuration#
actor_rollout_ref.rollout.name: Inference backend - Include["hf", "sglang", "vllm]"actor_rollout_ref.rollout.tensor_model_parallel_size: Tensor parallelism - TP size for rollout. Only effective for vllmactor_rollout_ref.rollout.gpu_memory_utilization: GPU memory usage - Fraction of GPU memory to use for inferenceactor_rollout_ref.rollout.n: Samples per prompt - Number of responses to generate for each prompt during rollout
Generation Parameters:
actor_rollout_ref.rollout.temperature: Controls randomness in generationactor_rollout_ref.rollout.top_k: Top-k sampling parameteractor_rollout_ref.rollout.top_p: Top-p sampling parameter
Training Control Parameters#
trainer.logger: Logging backends - Available options:["wandb", "mlflow", "swanlab", "vemlp_wandb", "tensorboard", "console", "clearml"]trainer.project_name: Project name for experiment trackingtrainer.experiment_name: Specific experiment identifiertrainer.n_gpus_per_node: Number of GPUs per compute nodetrainer.nnodes: Number of compute nodes in the clustertrainer.total_epochs: Maximum number of training epochstrainer.save_freq: Model checkpoint saving frequency (every N steps)trainer.test_freq: Validation frequency (-1 disables periodic validation)
Multi-Node Training Setup#
For large-scale training across multiple nodes, VeRL uses Ray for distributed coordination. This section covers cluster setup and management.
Ray Cluster Initialization#
Manual Ray Setup:
Start Head Node:
ray start --head --dashboard-host=0.0.0.0
Start Worker Nodes:
ray start --address=<head_node_ip:port>
For detailed multi-node setup instructions, refer to the VeRL Multi-node Documentation.
Automated Ray Cluster Script#
Use this script for automated cluster initialization across multiple nodes:
#!/bin/bash
# Parameter validation
if [ -z "$RANK" ]; then
echo "Error: RANK environment variable not set!"
exit 1
fi
# Configuration file path (modify according to actual requirements)
SCRIPT_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
REPO_PATH=$(dirname "$SCRIPT_PATH")
RAY_HEAD_IP_FILE=$REPO_PATH/ray_utils/ray_head_ip.txt
RAY_PORT=$MASTER_PORT # Ray default port, can be modified as needed
# Head node startup logic
if [ "$RANK" -eq 0 ]; then
# Get local IP address (assuming internal network IP)
IP_ADDRESS=$(hostname -I | awk '{print $1}')
# Start Ray head node
echo "Starting Ray head node on rank 0, IP: $IP_ADDRESS"
# export VLLM_ATTENTION_BACKEND=XFORMERS
# export VLLM_USE_V1=0
ray start --head --memory=461708984320 --port=29500
# Write IP to file
echo "$IP_ADDRESS" > $RAY_HEAD_IP_FILE
echo "Head node IP written to $RAY_HEAD_IP_FILE"
else
# Worker node startup logic
echo "Waiting for head node IP file..."
# Wait for file to appear (maximum 360 seconds)
for i in {1..360}; do
if [ -f $RAY_HEAD_IP_FILE ]; then
HEAD_ADDRESS=$(cat $RAY_HEAD_IP_FILE)
if [ -n "$HEAD_ADDRESS" ]; then
break
fi
fi
sleep 1
done
if [ -z "$HEAD_ADDRESS" ]; then
echo "Error: Could not get head node address from $RAY_HEAD_IP_FILE"
exit 1
fi
echo "Starting Ray worker node connecting to head at $HEAD_ADDRESS"
# export VLLM_ATTENTION_BACKEND=XFORMERS
export VLLM_USE_V1=0
ray start --memory=461708984320 --address="$HEAD_ADDRESS:29500"
fi
Benchmark Results#
Performance evaluation of VeRL using the Boba mathematical reasoning dataset with DeepSeek-R1-Distill-Qwen-1.5B model. Testing conducted on Aug 4, 2025, using VeRL.
Both RLinf and VeRL are using params belows:
Params |
Value |
|---|---|
Model |
DeepSeek-R1-Distill-Qwen-1.5B |
Dataset |
Boba math reasoning dataset |
Hardware |
1 nodes × 8 H100 GPUs |
Tensor Parallelism |
2 |
Data Parallelism |
4 |
Pipeline Parallelism |
1 |
Context Length |
28672 |
MaxPrompt Length |
1024 |
Batch Size Per DP |
128 |
recompute |
20 blocks |
The following benchmark results compare RLinf with VeRL. Tests for VeRL were conducted using Commit ID 8fdc4d3 (v0.5.0 release).
In general, for time-related metrics, smaller values are better; for throughput-related metrics, larger values are better; and for response length, there is usually no clear conclusion. In the table below, improvements of RLinf over VeRL are highlighted in red, while regressions are highlighted in green.
Metric |
RLinf |
VeRL |
RLinf vs VeRL |
Unit |
|---|---|---|---|---|
response length |
13975.00 |
14254.84 |
tokens |
|
generation time |
266.08 |
260.92 |
↑ 1.98% |
seconds |
prev logprob time |
17.78 |
17.51 |
↑ 1.54% |
seconds |
training time |
61.12 |
66.53 |
↓ 8.13% |
seconds |
step time |
346.33 |
363.55 |
↓ 4.74% |
seconds |
gen throughput |
3361.35 |
3533.27 |
↓ 4.87% |
per-GPU tokens/s |
prev logprob throughput |
50835.06 |
52635.84 |
↓ 3.42% |
per-GPU tokens/s |
step throughput |
19850.13 |
20022.92 |
↓ 0.87% |
total tokens/s |
Note: RLinf results below does not count ref logprob time.
In conclusion, the overall training efficiency is comparable, but our approach achieves a significant reduction in training time compared to VeRL.