Switch SGLang Versions#

RLinf can plug different generation backends into its reinforcement-learning pipeline. For the current release SGLang and vLLM is supported;

Note

RLinf is compatible with SGLang 0.4.4 → 0.5.4, vLLM 0.8.5 → 0.8.5.post1. No manual patching is required – the framework detects the installed version and loads the matching shim automatically.

Installation Requirements#

CUDA ≥ 11.8 (or 12.x matching your PyTorch build)
Python ≥ 3.8
Sufficient GPU memory for the chosen model
Compatible versions of PyTorch and transformers

Note

Mismatched CUDA / PyTorch wheels are the most common installation issue. Verify both before installing SGLang.

Install via pip#

# Reference version
pip install sglang==0.4.4

# Recommended for production
pip install sglang==0.4.8

# Latest supported
pip install sglang==0.5.4

# Install vLLM
pip install vllm==0.8.5

Install from Source#

# Install SGLang
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout v0.4.8          # pick the tag you need
pip install -e "python[all]"

# Install vLLM
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.8.5          # pick the tag you need
pip install -e .

Note

Building from source can be time-consuming and heavy on disk space; prefer the pre-built wheels unless you need bleeding-edge fixes.

....
rollout:
    group_name: "RolloutGroup" # SGLang Generation Group Name, used for communication

    gpu_memory_utilization: 0.55 # SGLang's parameter, which decides how much vram is used for static memory pool

    model:
       model_path: /model/path # model path
       model_type: qwen2.5 # model type
    enforce_eager: False         # if False, rollout engine will capture cuda graph, which will take more time to initialize.
    distributed_executor_backend: mp   # ray or mp
    disable_log_stats: False     # if true will log sglang's output
    detokenize: False            # Whether to detokenize the output. During RL we actually don't need to detokenize it. Can be set to True for debugging.
    padding: null               # will be tokenizer.pad_token_id if null. it is used to filter megatron's padding for rollout engine
    eos: null                   # will be tokenizer.eos_token_id if null.

    rollout_backend: sglang     # [sglang, vllm] here to choose which rollout backend to use.

    sglang: # used when rollout_backend is sglang
        attention_backend: triton # [flashinfer, triton] for more, see sglang's doc
        decode_log_interval: 500000 # the interval for SGLang to log the decode time and other stats.
        use_torch_compile: False # enable torch_compile in SGLang for rollout.
        torch_compile_max_bs: 128 # the maximum batch size for torch compile. If the batch size is larger than this, torch compile will not be used.

    vllm: # used when rollout_backend is vllm
        attention_backend: FLASH_ATTN # [FLASH_ATTN,XFORMERS] attention backend used by vLLM, for more info,see vLLM's doc
        enable_chunked_prefill: True  # enable vllm to use chunked_prefill.
        enable_prefix_caching: True  # enable vllm to use prefix_caching.
        enable_flash_infer_sampler: True #  # if True, vllm will use flashinfer to do sampling.

    tensor_parallel_size: 1 # tp_size
    pipeline_parallel_size: 1 # pp_size

    validate_weight: False # whether to send all weights at first for weight comparison.
    validate_save_dir: null # the directory to save the weights for comparison. If validate_weight is True, this will be used to save the weights for comparison.
    print_outputs: False         # whether to print the outputs (token ids, texts, etc.) of rollout engine.

    max_running_requests: 64 # the maximum number of running requests in the rollout engine.
    cuda_graph_max_bs: 128 # the maximum batch size for cuda graph. If the batch size is larger than this, cuda graph will not be used.

...