Checkpoint Resume#
Unexpected eventsβnetwork errors, power loss, node pre-emptionsβcan
interrupt a long-running distributed job.
To tackle this challenge, RLinf saves a full checkpoint every runner.save_interval steps and lets
you resume from the most recent snapshot with minimal loss of work.
Checkpoint layout#
Assume the following YAML fragment:
runner:
task_type: math
logger:
log_path: ${runner.output_dir}/${runner.experiment_name}
project_name: rlinf
experiment_name: ${runner.experiment_name}
save_interval: 50
experiment_name: grpo-1.5b
output_dir: ./logs
If Megatron is used as the training backend, its checkpoints will appear under output_dir/experiment_name/checkpoints/,
while if FSDP/FSDP2 is used as the training backend, its checkpoints will appear under log_path/experiment_name/checkpoints/.
Megatron Checkpoints#
Megatron Checkpointβs file structure looks like this:
logs/grpo-1.5b/checkpoints/
βββ global_step_50/
β βββ actor/
β β βββ iter_0000050/
β β β βββ mp_rank_00/
β β β β βββ distrib_optim.pt
β β β β βββ model_optim_rng.pt
β β β βββ mp_rank_01/
β β β βββ distrib_optim.pt
β β β βββ model_optim_rng.pt
β β βββ latest_checkpointed_iteration.txt
β βββ data/
β βββ data.pt
βββ global_step_100/
βββ β¦
Key points#
Sharded weights β files inside
mp_rank_*follow the Megatron tensor-parallel layout; each GPU only reloads its own slice.Optimizer / RNG state β both the Adam parameters (
distrib_optim.pt) and random-number generators are captured, guaranteeing bit-for-bit reproducibility after resume.Data sampler β
data.ptstores dataloader, so no samples are skipped or repeated.
FSDP/FSDP2 Checkpoint#
FSDP/FSDP2 Checkpointβs file structure looks like this:
experiment_name/checkpoints/
βββ global_step_10/
β βββ actor/
β βββ dcp_checkpoint/
β β βββ __0_0.distcp
β β βββ __1_0.distcp
β β βββ __2_0.distcp
β β βββ __3_0.distcp
β βββ model_state_dict/
β βββ full_weigths.pt
βββ global_step_20/
βββ β¦
FSDP/FSDP2 saves and loads checkpoints via DCP (torch.distributed.checkpoint), resulting in a set of distributed checkpoint files (.distcp). Each file contains a slice of model parameters, optimizer state, and RNG state.
Resuming training#
Choose the latest checkpoint
If
global_step_10/is the highest numbered directory it is the newest snapshot.Edit the YAML
runner: resume_dir: ${runner.output_dir}/${runner.experiment_name}/checkpoints/global_step_10
Relaunch exactly as before
Start Ray, then the same
run_main_*.shlauncher. RLinf will automatically detect theresume_dirand:Restores model shards, optimizer, RNG and dataloader state on every node/rank.
Continues step counting from
global_step_10β your next saved checkpoint will beglobal_step_20(becausesave_intervalis 10).
Tip
To verify resumption, look for the log line. If the next training step starts at 30, then the resume is working well!