Weight Synchronization#
Use the weight_syncer mechanism to optimize weight synchronization from the
actor side to the rollout-side policy model in embodied training, reducing
the communication and loading overhead after each parameter update.
At the moment, this capability is mainly intended for the
FSDP actor + HuggingFace rollout path used by
examples/embodiment/train_embodied_agent.py and
examples/embodiment/train_async.py.
Why Weight Syncer Exists#
In embodied RL, every actor update usually needs to be synchronized to the rollout workers. For large models such as OpenPI, OpenVLA, OpenVLA-OFT, and GR00T, full-weight synchronization can become expensive:
The model is large, so full sync can easily become a major part of step time.
Repeatedly loading a full
state_dicton the rollout side also adds GPU and CPU overhead.In async settings, blocking full sync directly hurts rollout throughput and policy freshness.
To address this, RLinf abstracts the logic into a unified WeightSyncer
interface so different synchronization strategies can share the same sender /
receiver workflow.
Core Interface#
WeightSyncer has four main responsibilities:
init_sender(...): one-time sender-side initializationinit_receiver(...): one-time receiver-side initializationsync(...): send the current version of model weightsapply(...): receive and apply weights, then return the appliedversion
This means rollout code does not need to care whether the underlying mechanism
is patch-based sync or bucket-based sync. After initialization, it only needs to
call apply(...) through the common interface.
The implementation lives in rlinf/hybrid_engines/weight_syncer/, while the
YAML entry point remains the same independent weight_syncer Hydra config
group.
Supported Sync Strategies#
RLinf currently provides two strategies:
patchIncremental synchronization. The sender maintains a snapshot and only sends the changed positions and values relative to that snapshot. In the current FSDP actor integration, the incremental patch path only tracks trainable parameters plus persistent buffers; frozen parameters with
requires_grad=Falseare excluded from incremental patch construction.bucketBucketized tensor synchronization. Selected tensors are sent in full, bucket by bucket. In the current FSDP actor integration, the selected key set is typically trainable parameters plus persistent buffers, so frozen parameters are also excluded here.
State Dict Device Requirements#
Different weight_syncer implementations have different requirements for
the sender-side state_dict device:
bucketThere is no special device requirement for the sender-side
state_dictpassed tosync(...). Parameters can live on either CPU or GPU, and the bucket syncer stages them according tobucket_deviceandbucket_dtypebefore sending. On the receiver side,apply(...)usesload_state_dict; PyTorch copies input tensors to the target model parameter device and casts them to the target parameter dtype. In the current actor integration,init_sender(...)also provides the selected key subset, so bucket mode usually transmits trainable parameters plus persistent buffers rather than the entirestate_dict.patchThe sender-side
state_dictpassed toinit_sender(...)andsync(...)is expected to be on GPU. Even whensnapshot_device: cpuis used, only the sender-side snapshot stays on CPU; difference comparison,nonzero, and new-value gathering still run on GPU. Providing a CPU senderstate_dictwould turn patch construction into CPU scanning, which cannot use the current optimized path and is not the intended patch-mode design.
On the receiver side, apply(...) moves patch payload tensors to the target
model parameter device before writing them. The receiver model must still match
the metadata required by patch mode. If init_sync.enabled=true is used,
patch mode can also bootstrap initial state_dict values during
init_sender(...) / init_receiver(...) before the first incremental
patch.
Recommendation#
For the mainstream embodied VLA configurations in RLinf, patch is the
recommended default because:
Weight updates after each actor step are often highly sparse.
Pi-series and other VLM-based policies often freeze most or all of the VLM, so excluding frozen weights can substantially reduce patch comparison and transfer cost.
Actor and rollout usually start from the same checkpoint or model path.
Patch mode often sends far less data than full sync.
But there is one critical caveat:
Warning
The incremental patch path still sends deltas relative to the sender-side snapshot, not an independent full model snapshot.
To make this safe for models that create extra modules locally, RLinf now
supports a one-time init bootstrap in patch mode. The recommended default is
to enable patch.init_sync.enabled=true and use prefixes: null so the
receiver is aligned with the sender before the first real patch.
If you explicitly disable init bootstrap, actor and rollout must still start from the same initial weights, especially for frozen parameters that are excluded from later incremental patch sync.
How To Enable It In YAML#
weight_syncer is exposed as an independent Hydra config group.
In embodied YAMLs, the recommended usage looks like this:
defaults:
- training_backend/fsdp@actor.fsdp_config
- weight_syncer/patch_syncer@weight_syncer
The corresponding config files are:
examples/embodiment/config/weight_syncer/patch_syncer.yamlexamples/embodiment/config/weight_syncer/bucket_syncer.yaml
Patch Mode#
A typical patch configuration looks like this:
weight_syncer:
type: patch
patch:
snapshot_device: cpu
transport_device: cpu
delta_encoding: true
compression: none
init_sync:
enabled: true
prefixes: null
bucket_size: 134217728
The fields mean:
typeFixed to
patchto enable incremental synchronization.patch.snapshot_deviceDevice where the snapshot is stored. It can be either
cpuorcuda.cpuis currently recommended as the default: it avoids keeping an additional model-sized snapshot in GPU memory, and after GPU-side comparison, asynchronous prefetching, and background snapshot flushing optimizations, its synchronization latency is already close tosnapshot_device: cuda. If GPU memory is very abundant,cudaremains the most direct low-latency path.patch.transport_deviceDevice used before sending the patch. The default can be
cpu. If you want GPU-side compression or GPU transport, this is typicallycuda.patch.delta_encodingWhether to delta-encode COO coordinates. Enabled by default and recommended.
patch.compressionCompression algorithm. Supported values currently include:
none: no compressionnvcomp_lz4: GPU-side lossless compression via nvCOMP
patch.init_sync.enabledWhether to perform a one-time bootstrap during
init_sender(...)/init_receiver(...)before normal patch sync begins. When enabled, the sender transmits a bucketizedstate_dictsubset once, then patch mode continues with the usual incremental snapshot-based updates.patch.init_sync.prefixesWhich
state_dictkey prefixes to bootstrap. If set tonull, RLinf bootstraps the fullstate_dictincluding parameters and persistent buffers. If set to a list, RLinf only bootstraps keys matching eitherprefixorprefix..nullis the recommended default because targeted prefixes can miss nested module paths such asaction_head.value_head.patch.init_sync.bucket_sizeMaximum size in bytes of each init bootstrap bucket. This only affects the one-time init bootstrap path; the normal incremental patch payload format is unchanged.
How Patch Mode Works#
Patch mode is roughly split into two stages:
One-time initialization
The receiver sends local model metadata in
init_receiver(...).If
init_sync.enabled=true, the sender receives the metadata ininit_sender(...)and sends a one-time bucketized bootstrap of the fullstate_dictor the configured prefix subset.The receiver applies those bootstrap weights directly to its local
state_dict.The sender then creates its snapshot for the later incremental patch path. That snapshot only covers the keys selected for incremental sync: currently trainable parameters and persistent buffers.
The metadata currently includes:
a fixed parameter order
ordered_keysthe original shape of each tensor in
original_shapesthe dtype of each receiver-side tensor
The receiver does not store the sender-side snapshot. It only stores the structural information needed to apply patches correctly to its local model. The sender-side snapshot uses the same dtype as the corresponding receiver-side weight, so mixed-precision models with both
bfloat16andfloat32weights are handled correctly. Init bootstrap also respects the receiver-side dtype for each key.Per-sync update
The sender compares the current incremental-sync subset (trainable parameters plus persistent buffers) with the snapshot.
The changed entries are packed into a patch and sent.
The receiver applies those changes directly to local model parameters.
CPU Snapshot Optimization Path#
When snapshot_device: cpu, the sender-side snapshot stays on CPU while the
current state_dict remains on GPU. To avoid moving the patch-building hot
path back to CPU, RLinf applies several optimizations for this case:
The CPU snapshot is stored in pinned memory to enable asynchronous CPU-GPU copies.
Before comparing each tensor, the corresponding CPU snapshot tensor is asynchronously prefetched to the GPU where the state tensor lives.
Snapshot prefetch uses a dedicated CUDA copy stream so it can overlap as much as possible with GPU-side comparison of other tensors.
Difference comparison,
nonzero, and new-value gathering all run on GPU, avoiding CPU-side element scanning.The
rows,cols, andvaluesneeded by the patch are asynchronously copied into pinned CPU staging buffers, andtorch.cuda.Eventis used to mark when those copies complete.After patch construction finishes, the sender can return immediately and continue with the following transfer steps; CPU snapshot flushing is handled by a background thread.
Before the next patch construction starts, RLinf waits for the previous background flush to finish, which preserves snapshot consistency.
Therefore, snapshot_device: cpu no longer means “compare on CPU”. The
effective path is:
CPU snapshot -> GPU prefetch -> GPU compare/nonzero/gather
-> pinned CPU staging -> background snapshot flush
This trades a small amount of extra asynchronous copy and background flushing
for much lower GPU memory usage. In current embodied VLA training
configurations, CPU snapshot synchronization latency can already be close to GPU
snapshot latency. When GPU memory is tight, snapshot_device: cpu is usually
the safer default.
Patch Data Layout#
The current patch representation is based on flattened tensor index information. The main fields are:
ordinals: which tensor changednnz_per_tensor: number of changed entries in that tensorrows/cols: 2D coordinates of changed positionsvalues: the new values at those positionsversion: the sync version carried by this patch
These 2D coordinates come from an internal 2D COO-style view. Tensors are interpreted as:
scalars:
(1, 1)1D tensors:
(1, N)2D tensors: unchanged
3D and higher:
(shape[0], prod(shape[1:]))
This makes it possible to express tensors of different ranks with one uniform patch format.
Delta Encoding#
When delta_encoding=true, rows and cols do not send absolute
coordinates directly. Instead, they send delta-encoded coordinates:
rowsstores increments between adjacent row coordinatesif two adjacent entries stay on the same row,
colsstores column deltaswhen switching to a new row,
colsstores the absolute starting column of that row
This helps because:
index values usually become smaller
they can often be downscaled to tighter dtypes such as
uint8orint32downstream compression becomes more effective
Compression#
Patch compression only applies to the incremental patch payload itself, not the full model weights. The one-time init bootstrap uses bucketized weight transfer and does not go through the patch compressor.
RLinf currently provides these patch compressors:
none: send patch tensors directlynvcomp_lz4: apply GPU-side lossless compression separately torows,cols, andvalues
If you enable nvcomp_lz4, you need:
transport_device: cudanvidia-nvcomp-cu12installed in the runtime environment
If you install embodied environments through
bash requirements/install.sh embodied ..., this dependency is installed as
part of the common embodied requirements.
When Patch Mode Is Not A Good Fit#
Patch mode is not a good default in the following cases:
actor and rollout do not share the same
state_dictstructure or metadatayou disable init bootstrap but cannot guarantee identical initial weights
you need an explicit bootstrap or full sync
updates are not sparse enough for patching to pay off
you want the most conservative synchronization strategy first when debugging correctness issues
Bucket Mode#
A typical bucket configuration looks like this:
weight_syncer:
type: bucket
bucket:
bucket_size: 536870912
bucket_dtype: null
bucket_device: cuda
is_agent: false
load_instant: true
The fields mean:
typeFixed to
bucketto enable full bucket-based synchronization.bucket.bucket_sizeMaximum size in bytes of each bucket.
bucket.bucket_dtypeDtype used when sending bucket payloads. If set to
null, each tensor keeps its original dtype. If set tobfloat16,float16, orfloat32, only floating-point tensors are converted; non-floating buffers such asintandboolkeep their original dtype to avoid corrupting model state.bucket.bucket_deviceDevice where bucket tensors are staged, typically
cuda.bucket.is_agentA compatibility switch for some agent-side naming behavior. For embodied training, this is usually kept as
false.bucket.load_instantWhether to call
load_state_dictimmediately after each bucket is received.
Characteristics Of Bucket Mode#
Bucket mode splits the selected sync subset into multiple chunks and sends them in order. Its main characteristics are:
Advantage: simple semantics for full-tensor transport of the selected keys
Advantage: does not depend on a sender-side snapshot and does not assume sparse updates
Disadvantage: typically much more data is transferred than in patch mode
If load_instant=true, each bucket is loaded immediately after it arrives.
If load_instant=false, the receiver buffers buckets first and loads them at
the end.
Behavior In Async Training#
In async embodied training, if actor.sync_weight_no_wait=true is enabled,
rollout-side weight receiving and applying are handled in a background
asyncio task.
This means:
rollout does not necessarily block immediately when actor requests a sync
new weights only become effective after the background task completes
there may be a small delay between “sync requested” and “sync applied”
In this async path, version propagation matters more. WeightSyncer.apply(...)
returns the version that was actually applied on rollout, and rollout updates
its internal version state from that result.
Performance Suggestions#
If your priority is to reduce synchronization overhead, a good tuning order is:
Start with
patchand keepinit_sync.enabled=true.Prefer
init_sync.prefixes: nullunless you are deliberately optimizing a small, well-understood subset of keys.Prefer
snapshot_device: cpuby default, which avoids an extra model-sized GPU-memory snapshot while providing synchronization latency close to GPU snapshot.Keep
delta_encoding: true.First get the workflow stable with
compression: none, then evaluate whethernvcomp_lz4is worth enabling.If GPU memory is very abundant and you are pursuing the lowest possible sync latency, evaluate
snapshot_device: cuda.If you want the simplest per-tensor transport path for the selected sync subset, switch to
bucket. If you need to realign the full model including frozen weights, rely on init bootstrap or another explicit full weight load.
Patch mode keeps an extra sender-side snapshot. When snapshot_device: cuda,
that snapshot consumes GPU memory roughly equal to the number of model
parameters multiplied by the byte size of the corresponding receiver-side
weight dtype. For large models or memory-tight setups, reserve enough GPU memory
for this snapshot to avoid OOM during training or synchronization.
When snapshot_device: cpu, this snapshot does not consume GPU memory, but it
does consume one model-sized CPU pinned-memory copy. Its size is also roughly the
number of model parameters multiplied by the byte size of the corresponding
receiver-side weight dtype.
In this mode, patch comparison still runs on GPU, and CPU snapshot overhead is
reduced through prefetching, event synchronization, and background flushing. For
memory-tight training jobs, this is the currently recommended configuration. In
addition, nvcomp_lz4 requires transport_device to be cuda.
Limitations And Caveats#
The current implementation has several constraints to keep in mind:
if
patch.init_sync.enabled=false, patch assumes actor and rollout start from the same initial weightstargeted
patch.init_sync.prefixescan miss nested module paths if the configured prefixes are incomplete;nullis the safest defaultpatchis currently designed primarily for the embodied HuggingFace rollout pathhigh-rank tensors are converted to a 2D view internally; if trailing dimensions cannot be flattened as a view, patch mode will raise an error
compression settings in this document refer to patch payload compression, not compression of the model weights themselves
bucketnow shares the same selected-key filtering used by the current actor integration, so it is not a guaranteed full-model realignment path for frozen weightsif your immediate goal is “validate the selected-key transport path with the simplest semantics”, use
bucket; if your goal is “make weight sync fast after correctness is verified”, usepatch
Recommended Usage Pattern#
A simple rule of thumb is:
default training: use
patch + init_sync.enabled=true + prefixes:nulltargeted bootstrap only when you know the exact
state_dictkey paths you want to alignbootstrap or debug the selected-key transport path with the simplest semantics: start with
buckethigh sparsity and aggressive optimization:
patch + delta_encoding + optional nvcomp
If you are not fully sure the patch assumptions hold for your pipeline, the safest approach is:
first ensure actor and rollout have the same
state_dictstructurekeep patch init bootstrap enabled, or use
bucketfirst to validate the selected-key transport paththen switch to
patchfor performance optimization