快速上手 2：使用 GRPO 训练 LLM 进行 MATH 推理#

本快速教程将带你使用 RLinf 在数学推理数据集 AReaL-boba 上训练 DeepSeek-R1-Distill-Qwen-1.5B 模型。

为简化流程，你可以在单卡 GPU 上直接运行以下脚本完成训练。

数据集简介#

AReaL-boba 涵盖了多种数学与逻辑推理问题。以下是一个示例：

Question
--------
What is the unit digit of the product
\[
  (5+1)\,(5^{3}+1)\,(5^{6}+1)\,(5^{12}+1)
\]?
(a) 0   (b) 1   (c) 2   (d) 5   (e) 6
Please reason step-by-step and put your final answer within \boxed{}.

Answer
------
[ "\\boxed{e}" ]

启动训练#

步骤 1：下载模型和数据集

# 下载模型
hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--local-dir /path/to/model/DeepSeek-R1-Distill-Qwen-1.5B

# 下载数据集
hf download inclusionAI/AReaL-boba-Data --repo-type=dataset \
--local-dir /path/to/dataset/boba

步骤 2：修改配置文件

在运行脚本之前，请根据你的模型和数据集下载路径，修改 ./examples/reasoning/config/math/qwen2.5-1.5b-single-gpu.yaml 文件。

具体而言，将model配置设置为 DeepSeek-R1-Distill-Qwen-1.5B 检查点所在路径，数据配置设置为 AReaL-boba-106k.jsonl 数据集所在路径。

rollout.model.model_path
data.train_data_paths
data.val_data_paths
actor.tokenizer.tokenizer_model

步骤 3：启动训练

完成以上修改后，运行以下脚本即可启动训练：

bash examples/reasoning/run_main_grpo_math.sh qwen2.5-1.5b-single-gpu

查看训练结果#

最终模型与指标文件位于：../results
TensorBoard 日志位于：../results/grpo-1.5b/tensorboard/ 启动方式如下：
```
tensorboard --logdir ../results/grpo-1.5b/tensorboard/ --port 6006
```

打开 TensorBoard 后，你会看到如下界面：推荐关注的关键指标包括：

rollout/response_length
rollout/reward_scores

备注

为方便用户，我们提供的配置文件默认支持单卡训练。如果你拥有多张 GPU 并希望加快训练过程，我们推荐你修改配置文件中的参数 cluster.component_placement。

你可以根据实际资源将该项设置为 0-1， 0-3 或 0-7 来使用 2/4/8 张 GPU。查看基础配置以获取有关 Placement 配置的更详细说明。

cluster:
num_nodes: 1
component_placement:
   actor,rollout,reward: 0-3