5D Parallelism Configuration#
RLinf employs NVIDIA Megatron-LM as its large-scale Transformer training backend and exposes five orthogonal parallelism modes:
Tensor Parallelism (TP)
Data Parallelism (DP)
Pipeline Parallelism (PP)
Sequence Parallelism (SP)
Context Parallelism (CP)
A suitable combination lets the system scale from single-node to hundreds-GPU clusters while balancing memory, communication, and utilisation.
Below is a detailed introduction to the five parallel modes, as well as instructions on how to configure them for startup.
1. Tensor Parallelism (TP)#
Definition
Tensor Parallelism shards the model’s weight matrices across a process group. Each GPU holds and computes only its slice.
Mechanics
Linear layers – split
Linear(h, 4h)by columns – splitLinear(4h, h)by rowsAttention projections – shard \(Q,K,V\) along the head dimension
Results are fused via all-reduce after every sharded operation.
Sample YAML
actor:
model:
tensor_model_parallel_size: 2 # tp_size = 2
Pros
Bypasses single-GPU memory limits.
Balances compute across GPUs.
Cons
All-reduce after almost every layer → high latency.
Scalability bounded by hidden-size width.
2. Data Parallelism (DP)#
Definition
Data Parallelism partitions the minibatch; every replica stores a full copy of the model.
Mechanics
Split the global batch across DP ranks.
Each rank computes forward/backward locally.
Gradients are synchronised with an all-reduce before the optimiser step.
Sizing Example
cluster:
num_nodes: 16
actor:
model:
tensor_model_parallel_size: 2 # TP
pipeline_model_parallel_size: 2 # PP
context_parallel_size: 2 # CP
# dp_size = 128 / 2 / 2 / 2 = 16
Pros
Simple and model-agnostic.
Perfect for scaling dataset size.
Cons
Full-model replica on every GPU → memory heavy.
Gradient all-reduce over the entire parameter set.
Usually combined with TP/PP/CP to fit larger models.
3. Pipeline Parallelism (PP)#
Definition
Pipeline Parallelism places different layer stacks on different ranks to form a computation pipeline.
Mechanics
Evenly split layers across
pp_sizestages.Use 1F1B (one-forward-one-backward) or similar schedulers to overlap compute.
Schedule Illustration
GPU 0: [F1][F2][F3][F4][B4][B3][B2][B1]
GPU 1: [F1][F2][F3][F4][B4][B3][B2][B1]
GPU 2: [F1][F2][F3][F4][B4][B3][B2][B1]
GPU 3: [F1][F2][F3][F4][B4][B3][B2][B1]
F = forward micro-batch, B = backward micro-batch, index = micro-batch ID.
Sample YAML
actor:
model:
pipeline_model_parallel_size: 2
Pros
Reduces memory for very deep models.
Only neighbour-to-neighbour communication (activations).
Cons
Pipeline bubbles (idle slots) may lower utilisation.
4. Sequence Parallelism (SP)#
Definition
Megatron’s Sequence Parallelism augments TP to reduce memory for long-context attention and MLP blocks.
Mechanics
Must be enabled with TP; both use the same process group.
Inputs/outputs of attention and MLP are partitioned across the sequence dimension while weight shards stay identical to TP.
Sample YAML
actor:
model:
tensor_model_parallel_size: 2 # TP is active
sequence_parallel: True # enable SP
# If TP = 1, SP must be disabled
actor:
model:
tensor_model_parallel_size: 1
sequence_parallel: False
Pros
Significant memory relief for long sequences.
Cons
Extra communication on sequence-dim shuffles.
5. Context Parallelism (CP)#
Definition
Context Parallelism targets ultra-long sequences by chunking the entire attention computation along the sequence axis; all tensors are sharded in that dimension.
Mechanics
Split \(Q,K,V\) and logits into context chunks.
Use ring attention to communicate and incrementally accumulate output.
Sample YAML
actor:
model:
context_parallel_size: 2
Pros
Breaks memory wall for 100k+ token contexts.
Pairs well with dynamic batch sizing.
Cons
High bandwidth cost; parameters are not sharded, so model memory is replicated.
Summary#
Megatron-LM’s flexible combination of TP, DP, PP, SP, and CP enables RLinf to scale models by width (TP), data volume (DP), depth (PP), or context length (SP / CP). Select sizes based on model architecture, target sequence length, GPU memory, and interconnect topology for best throughput.