5D Parallelism Configuration#

RLinf employs NVIDIA Megatron-LM as its large-scale Transformer training backend and exposes five orthogonal parallelism modes:

Tensor Parallelism (TP)
Data Parallelism (DP)
Pipeline Parallelism (PP)
Sequence Parallelism (SP)
Context Parallelism (CP)

A suitable combination lets the system scale from single-node to hundreds-GPU clusters while balancing memory, communication, and utilisation.

Below is a detailed introduction to the five parallel modes, as well as instructions on how to configure them for startup.

1. Tensor Parallelism (TP)#

Definition

Tensor Parallelism shards the model’s weight matrices across a process group. Each GPU holds and computes only its slice.

Mechanics

Linear layers – split Linear(h, 4h) by columns – split Linear(4h, h) by rows
Attention projections – shard \(Q,K,V\) along the head dimension
Results are fused via all-reduce after every sharded operation.

Sample YAML

actor:
  model:
    tensor_model_parallel_size: 2    # tp_size = 2

Pros

Bypasses single-GPU memory limits.
Balances compute across GPUs.

Cons

All-reduce after almost every layer → high latency.
Scalability bounded by hidden-size width.

2. Data Parallelism (DP)#

Definition

Data Parallelism partitions the minibatch; every replica stores a full copy of the model.

Mechanics

Split the global batch across DP ranks.
Each rank computes forward/backward locally.
Gradients are synchronised with an all-reduce before the optimiser step.

Sizing Example

cluster:
  num_nodes: 16

actor:
  model:
    tensor_model_parallel_size: 2  # TP
    pipeline_model_parallel_size: 2  # PP
    context_parallel_size: 2        # CP
  # dp_size = 128 / 2 / 2 / 2 = 16

Pros

Simple and model-agnostic.
Perfect for scaling dataset size.

Cons

Full-model replica on every GPU → memory heavy.
Gradient all-reduce over the entire parameter set.
Usually combined with TP/PP/CP to fit larger models.

3. Pipeline Parallelism (PP)#

Definition

Pipeline Parallelism places different layer stacks on different ranks to form a computation pipeline.

Mechanics

Evenly split layers across pp_size stages.
Use 1F1B (one-forward-one-backward) or similar schedulers to overlap compute.

Schedule Illustration

GPU 0: [F1][F2][F3][F4][B4][B3][B2][B1]
GPU 1:     [F1][F2][F3][F4][B4][B3][B2][B1]
GPU 2:          [F1][F2][F3][F4][B4][B3][B2][B1]
GPU 3:               [F1][F2][F3][F4][B4][B3][B2][B1]

F = forward micro-batch, B = backward micro-batch, index = micro-batch ID.

Sample YAML

actor:
  model:
    pipeline_model_parallel_size: 2

Pros

Reduces memory for very deep models.
Only neighbour-to-neighbour communication (activations).

Cons

Pipeline bubbles (idle slots) may lower utilisation.

4. Sequence Parallelism (SP)#

Definition

Megatron’s Sequence Parallelism augments TP to reduce memory for long-context attention and MLP blocks.

Mechanics

Must be enabled with TP; both use the same process group.
Inputs/outputs of attention and MLP are partitioned across the sequence dimension while weight shards stay identical to TP.

Sample YAML

actor:
  model:
    tensor_model_parallel_size: 2     # TP is active
    sequence_parallel: True           # enable SP

# If TP = 1, SP must be disabled
actor:
  model:
    tensor_model_parallel_size: 1
    sequence_parallel: False

Pros

Significant memory relief for long sequences.

Cons

Extra communication on sequence-dim shuffles.

5. Context Parallelism (CP)#

Definition

Context Parallelism targets ultra-long sequences by chunking the entire attention computation along the sequence axis; all tensors are sharded in that dimension.

Mechanics

Split \(Q,K,V\) and logits into context chunks.
Use ring attention to communicate and incrementally accumulate output.

Sample YAML

actor:
  model:
    context_parallel_size: 2

Pros

Breaks memory wall for 100k+ token contexts.
Pairs well with dynamic batch sizing.

Cons

High bandwidth cost; parameters are not sharded, so model memory is replicated.

Summary#

Megatron-LM’s flexible combination of TP, DP, PP, SP, and CP enables RLinf to scale models by width (TP), data volume (DP), depth (PP), or context length (SP / CP). Select sizes based on model architecture, target sequence length, GPU memory, and interconnect topology for best throughput.