GPU Profiling#
Use the cluster.profiling configuration for system-level profiling of Ray
worker processes.
RLinf supports wrapping selected worker groups with a backend-specific profiler
command. The backend is selected by the required backend field:
nsight— NVIDIA Nsight Systems (nsys profile)rocprof_sys— AMD ROCm Systems Profiler (rocprof-sys-python)
All backends share the same common fields (enabled, worker_groups,
steps, output_dir). Backend-specific options live under their own keys.
How To Enable It#
Add the profiling preset to defaults in your YAML:
defaults:
- training_backend/fsdp@actor.fsdp_config
- weight_syncer/patch_syncer@weight_syncer
- profile/default@cluster.profiling
The corresponding config file is examples/embodiment/config/profile/default.yaml.
To switch backends, override cluster.profiling.backend in your main YAML or on
the Hydra CLI (see the backend-specific sections below).
Common Fields#
These fields apply to every profiling backend:
Field |
Default |
Description |
|---|---|---|
|
(required) |
Profiling backend: |
|
|
Master switch. Set to |
|
|
List of worker group names to profile. |
|
|
Training step indices to gate profiling around. |
|
(auto-derived) |
Directory for profiling output. When omitted, defaults to |
The enabled Flag#
cluster:
profiling:
backend: nsight
enabled: false
When enabled: false:
Workers are not wrapped with a profiler command.
No output directory is created.
The rest of the config can stay in place for later reuse.
Output Directory#
By default, reports are written under:
runner.logger.log_path/runner.logger.experiment_name/profiling/
For example:
../results/libero_spatial_ppo_openpi/profiling/
To override, set output_dir explicitly:
cluster:
profiling:
backend: nsight
output_dir: /mnt/public/profiles/my_run
How To Override Worker Groups#
Override worker_groups directly in the main YAML:
cluster:
profiling:
backend: nsight
worker_groups: [ActorGroup, RolloutGroup]
If worker_groups is omitted or null, no worker is profiled. To profile
all workers in a run, list the names of every worker group you care about.
One subtle point: ChannelWorker instances are not children of
ActorGroup or RolloutGroup ranks. Channel.create(name) launches a
separate worker group whose group name is usually Env, Rollout, or
Actor. Profiling ActorGroup does not automatically include the Actor
channel worker. Add those channel group names explicitly if you want channel-side
traces.
For the built-in embodied runners:
ActorGroup: actor compute workersRolloutGroup: rollout compute workersEnvGroup: environment compute workersActor: the channel worker behindChannel.create("Actor")Rollout: the channel worker behindChannel.create("Rollout")Env: the channel worker behindChannel.create("Env")
How To Profile Only Specific Training Steps#
By default, profiling covers the entire worker lifetime. steps restricts
collection to specific training steps:
cluster:
profiling:
backend: nsight
enabled: true
steps: [3] # only profile global step 3
Multiple steps:
cluster:
profiling:
steps: [3, 10, 50]
Hydra CLI:
python ... '+cluster.profiling.steps=[3]'
When steps is set:
For the
nsightbackend, RLinf automatically injectscapture-range=cudaProfilerApiandcapture-range-end=stopintooptions.The embodied runner calls
torch.cuda.profiler.start()before each listed step andtorch.cuda.profiler.stop()after it.The resulting trace covers only those steps.
NVIDIA: Nsight Systems (backend: nsight)#
Default Preset#
The built-in profile/default preset looks like this:
backend: nsight
enabled: true
worker_groups: [ActorGroup, RolloutGroup, EnvGroup, Actor, Rollout, Env]
options:
t: cuda,cudnn,cublas,nvtx,osrt
sample: process-tree
cpuctxsw: process-tree
cudabacktrace: all
osrt-threshold: 1000
flags: []
Overriding nsys profile Options#
options maps to nsys profile flags that take values; flags emits
bare flags:
cluster:
profiling:
backend: nsight
options:
t: cuda,cudnn,cublas,nvtx,osrt
sample: process-tree
backtrace: fp
capture-range: cudaProfilerApi
capture-range-end: stop
flags: [python-backtrace]
Rendering rules:
Single-character keys →
-t cuda,...Multi-character keys →
--backtrace=fpflagsentries →--python-backtrace
Useful options:
t: traced APIs (cuda,cudnn,cublas,nvtx,osrt)sample: CPU sampling modebacktrace: CPU backtrace method (lbr,fp,dwarf)cpuctxsw: CPU thread scheduling tracecudabacktrace: CUDA API backtraces (adds overhead)capture-range/capture-range-end: scope collection to NVTX or CUDA profiler API ranges
Compute-Path NVTX Annotations#
RLinf decorates the hot path of actor, rollout, and env workers with
@Worker.timer("..."). The decorator records RLinf timer metrics and opens an
accelerator-specific profiling range through AcceleratorUtil.profiling_range.
For the nsight backend these ranges appear as labelled NVTX intervals in the
timeline and in nsys stats --report nvtx_sum.
The profiling range is emitted only while a profiling window is active. When
profiling is off, Worker.timer still records timing metrics and the
accelerator profiling range falls back to a no-op.
Built-in annotations:
Worker group |
NVTX label |
What it covers |
|---|---|---|
Actor |
|
Receiving a trajectory batch from the rollout / env side |
Actor |
|
Advantage / return computation |
Actor |
|
Policy / value optimization step (forward + backward + optimizer) |
Actor |
|
Weight broadcast from actor to rollout workers |
Rollout |
|
Pulling observations from the env channel |
Rollout |
|
Single-step policy forward pass |
Rollout |
|
Multi-step generation / unroll |
Rollout |
|
A full rollout epoch |
Rollout |
|
Sending actions back to env workers |
Rollout |
|
Shipping completed trajectories to the actor side |
Rollout (async) |
|
Async weight-sync handshake with the actor |
Env |
|
Receiving the next-action batch from the rollout side |
Env |
|
One simulator step (and the warm-up step at episode start) |
Env |
|
The full env interaction loop (and a single sub-iteration of it) |
Env |
|
Pushing observations / completed rollouts downstream |
Decorating your own worker method:
from rlinf.scheduler.worker import Worker
class MyWorker(Worker):
@Worker.timer("my_worker/my_phase")
def my_phase(self, batch):
...
Ad-Hoc Profiling Ranges#
For one-off in-function annotations:
from rlinf.scheduler.hardware import AcceleratorUtil
class MyWorker(Worker):
def my_phase(self, batch):
with AcceleratorUtil.profiling_range(
self._accelerator_type, "my_worker/inner_phase"
):
run_inner_phase(batch)
AcceleratorUtil.profiling_range dispatches to the current accelerator
backend. It is a no-op when the accelerator has no registered profiling range
implementation or when profiling is not active.
AMD: ROCm Systems Profiler (backend: rocprof_sys)#
Minimal Configuration#
Configure rocprof_sys inline in your run config:
backend: rocprof_sys
enabled: true
worker_groups: [ActorGroup, RolloutGroup, EnvGroup]
args:
T: hip # trace HIP API calls
RLinf wraps each matching worker’s Python interpreter with:
rocprof-sys-python [args] -- <python_interpreter>
Overriding rocprof-sys-python Arguments#
args maps to rocprof-sys-python flags that take values:
cluster:
profiling:
backend: rocprof_sys
args:
T: hip,hsa,rccl # single-char key → -T hip,hsa,rccl
output-format: json # multi-char key → --output-format=json
Rendering rules:
Single-character keys →
-T hip,hsa,rcclMulti-character keys →
--output-format=json
Injecting Environment Variables#
Use env to pass extra environment variables to profiled workers:
cluster:
profiling:
backend: rocprof_sys
env:
ROCPROFSYS_SAMPLING_FREQ: "100"
RLinf automatically derives ROCPROFSYS_OUTPUT_PATH and
ROCPROFSYS_OUTPUT_PREFIX from output_dir; values provided in env
take precedence.
Recommended Workflow#
NVIDIA first pass:
Start with
profile/default@cluster.profiling.Keep
enabled: trueand use the preset as-is for both CUDA-side and CPU runtime visibility.Avoid
capture-range: nvtxuntil you have confirmed the target workers emit NVTX ranges.Use
steps: [3]to limit trace size on long runs.
AMD first pass:
Configure
rocprof_sysinline:cluster: profiling: backend: rocprof_sys enabled: true worker_groups: [ActorGroup, RolloutGroup, EnvGroup] args: T: hip
Verify
rocprof-sys-pythonis onPATHin each worker’s environment.Check
output_dirfor.jsonor binary trace files after the run.