Reinforcement Learning Training for rStar2#

Multi-turn RL combined with tool invocation has been proven to extend the interaction boundaries of Large Language Models (LLMs) to the real world. This document introduces how to reproduce the experiments from the paper rStar2-Agent: Agentic Reasoning Technical Report under the RLinf framework, using Reinforcement Learning (RL) to train Large Language Models (LLMs) to answer questions by invoking code execution tools.

Environment#

RLinf Environment#

For RLinf environment configuration, please refer to RLinf Installation

Code Judge Runtime Environment#

We use the code judge tool from the rStar2 example. For installation instructions, refer to rStar2 & veRL-SGLang

cd examples/agent/rstar2

# install code judge
sudo apt-get update -y && sudo apt-get install redis -y
git clone https://github.com/0xWJ/code-judge
pip install -r code-judge/requirements.txt
pip install -e code-judge

# install rstar2_agent requirements
pip install -r requirements.txt

cd code-judge

Code Judge Server Setup#

rStar2-Agent uses Code Judge as a tool invocation server to execute Python code generated by the model.

1. Start Redis Server

sudo apt-get update -y && sudo apt-get install redis -y
redis-server --daemonize yes --protected-mode no --bind 0.0.0.0

2. Start Code Judge Server

# Start the main server (master node only)
# Environment variables can be configured as per: https://github.com/0xWJ/code-judge/blob/main/app/config.py
# Replace $WORKSPACE and $MASTER_ADDR with your actual paths

tmux new-session -d -s server \
'cd $WORKSPACE/examples/agent/rstar2/code-judge && \
   MAX_EXECUTION_TIME=4 \
   REDIS_URI="redis://$MASTER_ADDR:6379" \
   RUN_WORKERS=0 \
   uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 16 \
   2>&1 | tee server.log'

3. Start Code Judge Workers

# Launch workers (can be deployed on multiple nodes for increased parallelism)
# Adjust MAX_WORKERS based on your CPU count per node

tmux new-session -d -s worker \
'cd $WORKSPACE/examples/agent/rstar2/code-judge && \
   MAX_EXECUTION_TIME=4 \
   REDIS_URI="redis://$MASTER_ADDR:6379" \
   MAX_WORKERS=64 \
   python run_workers.py \
   2>&1 | tee worker.log'

Reward Computation Tool#

We use Math-Verify to assist in reward computation, which needs to be installed via pip

pip install math-verify

We also use simple rules to ensure the correctness of reward calculation, which requires installing dependencies.

pip install sympy
pip install pylatexenc

Training on 8*H100#

Download the training dataset via examples/agent/rstar2/data_process/process_train_dataset.py and write the path to examples/agent/rstar2/config/rstar2-qwen2.5-7b-megatron.yaml

data:
  # ...
  train_data_paths: ["/path/to/train.jsonl"]
  val_data_paths: ["/path/to/train.jsonl"]

Modify the rollout.model.model_path path in examples/agent/rstar2/config/rstar2-qwen2.5-7b-megatron.yaml

rollout:
  group_name: "RolloutGroup"

  gpu_memory_utilization: 0.5
  model:
    model_path: /path/to/model/Qwen2.5-7B-Instruct
    model_type: qwen2.5

Since the down sample logic is not compatible with the current inference logic, recompute_logprobs should be set to False

algorithm:
   # ...
   recompute_logprobs: False
   shuffle_rollout: False

Start Training#

Run examples/agent/rstar2/run_rstar2.sh to start training.

Training Curves#

Below shows the comparison of reward curves and response length curves between RLinf and Verl.

Qwen2.5-7B-Instruct in RLinf

Qwen2.5-7B-Instruct in RLinf#

Qwen2.5-7B-Instruct in Verl

Qwen2.5-7B-Instruct in Verl#

* We retrain the model using the default settings for steps.

7 B model results#

Engine

AIME 24

AIME 25

Math 500

Average

RLinf

33.65

24.11

79.60

45.79

Verl

31.77

25.94

76.20

44.64

References#