GPU Profiling#

本文介绍 RLinf 中的 cluster.profiling 配置，用于对 Ray worker 进程进行系统级 Profiling。

RLinf 支持使用特定 profiling 工具包装指定的 worker group。通过必填字段 backend 来选择后端：

nsight — NVIDIA Nsight Systems（nsys profile）
rocprof_sys — AMD ROCm Systems Profiler（rocprof-sys-python）

所有后端共享相同的公共字段（enabled、worker_groups、steps、 output_dir）。后端专属选项各自独立。

如何启用#

在 YAML 的 defaults 中引入 profiling 预设：

defaults:
  - training_backend/fsdp@actor.fsdp_config
  - weight_syncer/patch_syncer@weight_syncer
  - profile/default@cluster.profiling

对应的配置文件是 examples/embodiment/config/profile/default.yaml。如需切换后端，在主 YAML 或 Hydra CLI 中覆盖 cluster.profiling.backend 即可（详见下方各后端专属章节）。

公共字段#

以下字段适用于所有 profiling 后端：

字段	默认值	说明
`backend`	(必填)	后端标识：`"nsight"` 或 `"rocprof_sys"`。
`enabled`	`true`	总开关。设置为 `false` 可临时关闭，无需删除配置。
`worker_groups`	`null`	需要 profile 的 worker group 名称列表。`null` 表示不 profile 任何 worker。
`steps`	`null`	限定 profiling 的训练 step 索引。`null` 表示覆盖整个 worker 生命周期。
`output_dir`	(自动推导)	输出目录。省略时默认为 `<log_path>/<experiment_name>/profiling/`。

`enabled` 开关#

cluster:
  profiling:
    backend: nsight
    enabled: false

当 enabled: false 时：

不会用 profiler 命令包装任何 worker。
不会创建输出目录。
其余 profiling 配置可以保留，方便后续再次开启。

输出目录#

默认情况下，报告写入：

runner.logger.log_path/runner.logger.experiment_name/profiling/

例如：

../results/libero_spatial_ppo_openpi/profiling/

如果希望写入固定目录，可以显式设置 output_dir：

cluster:
  profiling:
    backend: nsight
    output_dir: /mnt/public/profiles/my_run

如何覆盖 worker_groups#

在主 YAML 中覆盖 worker_groups：

cluster:
  profiling:
    backend: nsight
    worker_groups: [ActorGroup, RolloutGroup]

如果省略 worker_groups 或设为 null，则不会 profile 任何 worker。

这里有一个容易混淆的点：ChannelWorker 不是 ActorGroup / RolloutGroup 某个 rank 的子进程，而是通过 Channel.create(name) 单独 launch 出来的独立 worker group，名字通常就是 Env、Rollout、Actor。因此只 profile ActorGroup 并不会自动覆盖 Actor 这个 channel worker；如果你想看 channel 本身，需要把这些名字显式加进 worker_groups。

对于内置的具身 runner：

ActorGroup: actor 计算 worker
RolloutGroup: rollout 计算 worker
EnvGroup: env 计算 worker
Actor: 由 Channel.create("Actor") 创建出来的 channel worker
Rollout: 由 Channel.create("Rollout") 创建出来的 channel worker
Env: 由 Channel.create("Env") 创建出来的 channel worker

如何只 Profile 特定训练 step#

默认情况下，profiling 会覆盖 worker 进程整个生命周期。steps 可以把采样窗口收窄到指定的训练 step：

cluster:
  profiling:
    backend: nsight
    enabled: true
    steps: [3]            # 只 profile global step 3

多个 step：

cluster:
  profiling:
    steps: [3, 10, 50]

Hydra CLI：

python ... '+cluster.profiling.steps=[3]'

当 steps 被设置时：

对于 nsight 后端，RLinf 会自动把 capture-range=cudaProfilerApi 和 capture-range-end=stop 注入 options。
具身 runner 会在每个列出的 step 之前调用 torch.cuda.profiler.start()，在该 step 之后调用 torch.cuda.profiler.stop()。
最终的 trace 大小由列出的几个 step 决定，与训练总时长无关。

NVIDIA：Nsight Systems（`backend: nsight`）#

默认预设#

内置的 profile/default 预设如下：

backend: nsight
enabled: true
worker_groups: [ActorGroup, RolloutGroup, EnvGroup, Actor, Rollout, Env]
options:
  t: cuda,cudnn,cublas,nvtx,osrt
  sample: process-tree
  cpuctxsw: process-tree
  cudabacktrace: all
  osrt-threshold: 1000
flags: []

覆盖 `nsys profile` 参数#

options 会被直接映射到带值的 nsys profile 参数，flags 则用于输出裸 flag：

cluster:
  profiling:
    backend: nsight
    options:
      t: cuda,cudnn,cublas,nvtx,osrt
      sample: process-tree
      backtrace: fp
      capture-range: cudaProfilerApi
      capture-range-end: stop
    flags: [python-backtrace]

渲染规则：

单字符 key → -t cuda,...
多字符 key → --backtrace=fp
flags 里的项 → --python-backtrace

常用参数：

t: 需要采集的 API（cuda、cudnn、cublas、nvtx、osrt）
sample: CPU sampling 模式
backtrace: CPU sampling 搭配使用的回溯方式（lbr、fp、dwarf）
cpuctxsw: CPU 线程调度时间线
cudabacktrace: CUDA API 调用栈（会增加 overhead）
capture-range / capture-range-end: 用 NVTX 或 CUDA profiler API 控制采样窗口

Compute 路径上的 NVTX 注解#

RLinf 在 actor、rollout、env worker 的关键方法上使用 @Worker.timer("...")。这个装饰器会记录 RLinf timer 指标，并通过 AcceleratorUtil.profiling_range 打开当前加速器对应的 profiling range。对 nsight 后端来说，这些 range 会在时间线和 nsys stats --report nvtx_sum 中显示为带标签的 NVTX 区间。

profiling range 只会在 profiling 窗口打开时发射。profiling 关闭时， Worker.timer 仍然会记录 timing 指标，而加速器 profiling range 会退化为 no-op。

内置注解一览：

Worker group	NVTX label	覆盖范围
Actor	`actor/recv_traj`	从 rollout / env 侧接收一批 trajectory
Actor	`actor/compute_adv`	Advantage / return 计算
Actor	`actor/run_training`	Policy / value 优化（forward + backward + optimizer）
Actor	`actor/sync_model_to_rollout`	从 actor 向 rollout 广播权重
Rollout	`rollout/recv_obs`	从 env channel 拉 observation
Rollout	`rollout/predict`	单步 policy forward
Rollout	`rollout/generate`	多步 generation / unroll
Rollout	`rollout/generate_epoch`	一个完整 rollout epoch
Rollout	`rollout/send_actions`	向 env 侧回传 action
Rollout	`rollout/send_traj`	把完成的 trajectory 发回 actor 侧
Rollout（async）	`rollout/poll_weight_sync` / `rollout/request_weight_sync`	和 actor 的异步权重同步握手
Env	`env/recv_actions`	从 rollout 侧接收下一批 action
Env	`env/step` / `env/bootstrap_step`	单步仿真器步进（以及 episode 起始的 warm-up step）
Env	`env/interact` / `env/interact_once`	完整的环境交互循环（以及其中的一次子迭代）
Env	`env/send_obs` / `env/send_rollout_trajectories`	向下游发送 observation / 完成的 rollout

在自定义 worker 上加注解：

from rlinf.scheduler.worker import Worker

class MyWorker(Worker):
    @Worker.timer("my_worker/my_phase")
    def my_phase(self, batch):
        ...

Ad-Hoc Profiling Range#

在函数内部临时圈出一段区间：

from rlinf.scheduler.hardware import AcceleratorUtil

class MyWorker(Worker):
    def my_phase(self, batch):
        with AcceleratorUtil.profiling_range(
            self._accelerator_type, "my_worker/inner_phase"
        ):
            run_inner_phase(batch)

AcceleratorUtil.profiling_range 会分发到当前加速器后端。如果该加速器没有注册 profiling range 实现，或当前 profiling 未开启，它会退化为 no-op。

AMD：ROCm Systems Profiler（`backend: rocprof_sys`）#

最小配置#

在运行配置中内联配置 rocprof_sys：

backend: rocprof_sys
enabled: true
worker_groups: [ActorGroup, RolloutGroup, EnvGroup]
args:
  T: hip           # 采集 HIP API 调用

RLinf 会把每个匹配 worker 的 Python 解释器包装成：

rocprof-sys-python [args] -- <python_interpreter>

覆盖 `rocprof-sys-python` 参数#

args 映射到带值的 rocprof-sys-python 参数：

cluster:
  profiling:
    backend: rocprof_sys
    args:
      T: hip,hsa,rccl     # 单字符 key → -T hip,hsa,rccl
      output-format: json  # 多字符 key → --output-format=json

渲染规则：

单字符 key → -T hip,hsa,rccl
多字符 key → --output-format=json

注入环境变量#

使用 env 向被 profile 的 worker 注入额外环境变量：

cluster:
  profiling:
    backend: rocprof_sys
    env:
      ROCPROFSYS_SAMPLING_FREQ: "100"

RLinf 会自动从 output_dir 推导 ROCPROFSYS_OUTPUT_PATH 和 ROCPROFSYS_OUTPUT_PREFIX；env 中显式设置的值优先级更高。