YAML Configuration#
Below is a complete reference for the configuration file used in the RLinf Every important key in the YAML is documented below so that you can confidently adapt the file to your own cluster, model, or research ideas. Parameters are grouped exactly by their top-level key.
For clarity, this section includes the following three main parts: Basic Configuration, MATH-specific Configuration, and Embody-specific Configuration. Therefore, users can find the corresponding configuration information according to their own needs.
Basic Configuration#
hydra#
hydra:
run:
dir: .
output_subdir: null
hydra.run.dir: Working directory for Hydra runs.
hydra.output_subdir: Output subdirectory (null disables subdirectory creation).
cluster#
cluster:
num_nodes: 1
component_placement:
actor,inference,rollout: all
cluster.num_nodes: Physical nodes to use for training.
cluster.component_placement:
The placement strategy for each component.
Each line of component placement config is a dictionary of component_names: resource_ranks.
In this simple example of running on GPU nodes, the meaning is:
The key is the names of components, e.g.,
rollout, orrollout,inference,actor- The value is the hardware (e.g., GPU) ranks allocated to the components, which can be:
“all”: use all accelerators in the cluster
A single integer, e.g., “3”: use accelerator 3
A list of integers separated by comma, e.g., “0,2,3”: use accelerator 0, 2, and 3
A range of integers separated by hyphen, e.g., “0-3”: use accelerator 0, 1, 2, and 3
A combination of the above two, e.g., “0-3,5,14”: use accelerator 0, 1, 2, 3, 5 (on node 0), and 14 (i.e., accelerator 6 on node 1)
For more advanced usage of component placement (e.g., heterogeneous cluster with different GPU models, robotic hardware, or CPU-only nodes) and customization in code, see Worker Placement Strategy.
runner#
runner:
task_type: math
logger:
log_path: ${runner.output_dir}/${runner.experiment_name}
project_name: rlinf
experiment_name: ${runner.experiment_name}
logger_backends: ["tensorboard"] # wandb, swanlab
max_epochs: 5
max_steps: -1
val_check_interval: 1
save_interval: 50
seq_length: 2048
resume_dir: null
experiment_name: grpo-1.5b
output_dir: ../results
runner.task_type: Task type identifier, math or embodied.
logger:
runner.logger.log_path: Base directory for log files.
runner.logger.project_name: Project name for experiment tracking.
runner.logger.experiment_name: Specific experiment name.
runner.logger.logger_backends: List of logging backends (tensorboard, wandb, swanlab).
See more details about logger backends in Training Visualisation.
runner.max_epochs: Maximum number of training epochs.
runner.max_steps: Maximum training steps. If set to -1, this defaults to set automatially based on the runner.max_epochs.
runner.val_check_interval: How often to launch a validation rollout (-1 to disable).
runner.save_interval: Checkpoint frequency in trainer steps.
runner.seq_length: Total sequence length (prompt + generated response) fed into models.
algorithm#
algorithm:
group_size: 2
logprob_forward_micro_batch_size: 1
val_rollout_batch_size_per_gpu: 4
loss_type: ppo
loss_agg_func: "token-mean"
kl_beta: 0.0
kl_penalty_type: low_var_kl
ratio_clip_eps: 0.2
entropy_bonus: 0.0
calculate_entropy: False
clip_ratio_c: null
adv_type: grpo
normalize_advantages: True
early_stop_imp_ratio: 5.0
use_valid_token_scale: False
sampling_params:
do_sample: True
temperature: 1.0
top_k: 1000000
top_p: 1.0
repetition_penalty: 1.0
algorithm.group_size: Responses per prompt (set > 1 to enable group baselines).
algorithm.logprob_forward_micro_batch_size: Micro-batch size for log-prob forward passes.
algorithm.val_rollout_batch_size_per_gpu: Validation rollout micro-batch per GPU.
algorithm.loss_type: Policy loss type (e.g., ppo).
algorithm.loss_agg_func: How to aggregate token losses (e.g., token-mean).
algorithm.kl_beta: Weight of KL penalty added to rewards.
algorithm.kl_penalty_type: KL shaping variant (e.g., low_var_kl).
algorithm.ratio_clip_eps: PPO clipping epsilon for importance ratios.
algorithm.entropy_bonus: Entropy reward coefficient.
algorithm.calculate_entropy: Whether to compute/persist entropy terms.
algorithm.adv_type: Advantage estimator type (e.g., grpo).
algorithm.normalize_advantages: Normalize advantages across the batch.
algorithm.early_stop_imp_ratio: Stop an update early if ratios exceed this threshold.
algorithm.use_valid_token_scale: Scale losses/advantages by valid-token masks.
sampling_params:
algorithm.sampling_params.do_sample: Deterministic decoding if False.
algorithm.sampling_params.temperature: Softmax temperature during sampling.
algorithm.sampling_params.top_k: Top-k cutoff (use a very large value to disable).
algorithm.sampling_params.top_p: Nucleus sampling threshold.
algorithm.sampling_params.repetition_penalty: Penalize repeated tokens.
rollout#
rollout:
group_name: "RolloutGroup"
gpu_memory_utilization: 0.55
model:
model_path: ../../model/DeepSeek-R1-Distill-Qwen-1.5B/
model_type: qwen2.5
recompute_logprobs: True
rollout.gpu_memory_utilization: Target GPU memory utilization fraction.
rollout.group_name: Logical name for rollout/inference workers.
rollout.model.model_path: Path to the HF model used by the generation backend.
rollout.model.model_type: Internal architecture tag used by the backend (e.g., qwen2.5).
rollout.recompute_logprobs: Recompute log-probs for sampled sequences.
actor#
actor:
group_name: "ActorGroup"
model:
megatron_checkpoint: null
seed: 1234
Top-level
actor.group_name: Logical name for the training (actor) workers.
actor.model.megatron_checkpoint: Path to a megatron model checkpoint to load before training.
actor.seed: Global seed for reproducibility.
reward#
reward:
use_reward_model: false
reward.use_reward_model: Whether to use a reward model.
critic#
critic:
use_critic_model: false
critic.use_critic_model: Whether to use a critic model.
MATH-specific Configuration#
runner#
runner:
enable_dynamic_batch_size: False
max_tokens_per_mbs: 2048
runner.enable_dynamic_batch_size: Whether to user dynamic batch size when training by Megatron.
runner.max_tokens_per_mbs: Upper limit of tokens in a Megatron microbatch when dynamic batching is enabled.
algorithm#
algorithm:
n_minibatches: 4
training_batch_size_per_gpu: 1
rollout_batch_size_per_gpu: null
sampling_params:
max_new_tokens: ${subtract:${runner.seq_length}, ${data.max_prompt_length}}
min_new_tokens: 1
algorithm.n_minibatches: Number of gradient update per batch.
algorithm.training_batch_size_per_gpu: Micro-batch size on each actor GPU.
algorithm.rollout_batch_size_per_gpu: Inference micro-batch per GPU; null divides the global rollout batch evenly.
sampling_params:
algorithm.sampling_params.max_new_tokens: Max generated tokens; computed from runner.seq_length and data.max_prompt_length.
algorithm.sampling_params.min_new_tokens: Minimum generated tokens.
rollout#
rollout:
enforce_eager: False # if False, rollout engine will capture cuda graph, which will take more time to initialize.
distributed_executor_backend: mp # ray or mp
disable_log_stats: False
detokenize: False # Whether to detokenize the output. During RL we actually don't need to detokenize it. Can be set to True for debugging.
padding: null # will be tokenizer.pad_token_id if null. it is used to filter megatron's padding for rollout engine
eos: null # will be tokenizer.eos_token_id if null.
attention_backend: triton
tensor_parallel_size: 1
pipeline_parallel_size: 1
validate_weight: False # whether to send all weights at first for weight comparison.
validate_save_dir: null # the directory to save the weights for comparison. If validate_weight is True, this will be used to save the weights for comparison.
print_outputs: False # whether to print the outputs (token ids, texts, etc.) of rollout engine.
sglang_decode_log_interval: 500000 # the interval for SGLang to log the decode time and other stats.
max_running_requests: 64 # the maximum number of running requests in the rollout engine.
cuda_graph_max_bs: 128 # the maximum batch size for cuda graph. If the batch size is larger than this, cuda graph will not be used.
use_torch_compile: False # enable torch_compile in SGLang for rollout.
torch_compile_max_bs: 128 # the maximum batch size for torch compile. If the batch size is larger than this, torch compile will not be used.
rollout.enforce_eager: If True, disable CUDA graph capture to shorten warm-up.
rollout.distributed_executor_backend: Backend for launching rollout workers (mp or ray).
rollout.disable_log_stats: Suppress periodic backend stats logging.
rollout.detokenize: Detokenize outputs for debugging (RL usually uses token ids only).
rollout.padding: Pad token id override; null uses tokenizer.pad id.
rollout.eos: EOS token id override; null uses tokenizer.eos id.
rollout.attention_backend: Attention kernel backend (e.g., triton).
rollout.tensor_parallel_size: TP degree inside the generation backend.
rollout.pipeline_parallel_size: PP degree inside the generation backend.
See more details about the parallelism in 5D Parallelism Configuration.
rollout.validate_weight: Send full weights once for cross-check/validation.
rollout.validate_save_dir: Directory to store weights for comparison when validation is enabled.
rollout.print_outputs: Print token ids/texts from the engine for debugging.
rollout.sglang_decode_log_interval: Interval for SGLang to log decode stats.
rollout.max_running_requests: Max concurrent decode requests.
rollout.cuda_graph_max_bs: Max batch size eligible for CUDA graph.
rollout.use_torch_compile: Enable torch.compile inside SGLang.
rollout.torch_compile_max_bs: Max batch size eligible for torch.compile.
data#
data:
type: math
max_prompt_length: 1024
rollout_batch_size: 64
val_rollout_batch_size: null
num_workers: 2
prompt_key: prompt
shuffle: True
validation_shuffle: True
seed: 1234
train_data_paths: ["../../data/boba/AReaL-boba-106k.jsonl"]
val_data_paths: ["../../data/boba/AReaL-boba-106k.jsonl"]
data.type: Dataset/task family (e.g., math).
data.max_prompt_length: Maximum tokens allowed for prompts.
data.rollout_batch_size: Global rollout batch size across engines.
data.val_rollout_batch_size: Global validation rollout batch size; null falls back to data.rollout_batch_size.
data.num_workers: Data loader workers per actor rank.
data.prompt_key: JSONL key that stores the prompt text.
data.shuffle: Shuffle training data each epoch.
data.validation_shuffle: Shuffle validation data (usually keep True for on-policy eval variety).
data.seed: RNG seed for loaders and sampling.
data.train_data_paths: List of training JSONL file paths.
data.val_data_paths: List of validation JSONL file paths.
actor#
actor:
training_backend: megatron
mcore_gpt: True
spec_name: decoder_gpt
offload_optimizer: True
offload_weight: True
offload_grad: True
enable_dp_load_balance: False
calculate_flops: False
model:
precision: fp16
add_bias_linear: False
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
activation: swiglu
sequence_parallel: True
# recompute_method: block
# recompute_granularity: selective
recompute_method: block
recompute_granularity: full
recompute_num_layers: 20
seq_length: ${runner.seq_length}
encoder_seq_length: ${runner.seq_length}
normalization: rmsnorm
position_embedding_type: rope
apply_rope_fusion: True
bias_dropout_fusion: False
persist_layer_norm: False
bias_activation_fusion: False
attention_softmax_in_fp32: True
batch_p2p_comm: False
variable_seq_lengths: True
gradient_accumulation_fusion: False
moe_token_dispatcher_type: alltoall
use_cpu_initialization: False
optim:
optimizer: adam
bf16: False
fp16: True
lr: 2e-05
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-05
min_lr: 2.0e-6
weight_decay: 0.05
use_distributed_optimizer: True
overlap_grad_reduce: True
overlap_param_gather: True
optimizer_enable_pin: false
overlap_param_gather_with_optimizer_step: False
clip_grad: 1.0
loss_scale_window: 5
lr_sched:
lr_warmup_fraction: 0.01
lr_warmup_init: 0.0
lr_warmup_iters: 0
max_lr: 2.0e-5
min_lr: 0.0
lr_decay_style: constant
lr_decay_iters: 10
tokenizer:
tokenizer_model: ../../model/DeepSeek-R1-Distill-Qwen-1.5B/
use_fast: False
trust_remote_code: True
padding_side: 'right'
megatron:
ddp_bucket_size: null
distributed_backend: nccl # Support 'nccl' and 'gloo'
distributed_timeout_minutes: 30
ckpt_format: torch
use_dist_ckpt: False
tp_comm_bootstrap_backend: nccl
tp_comm_overlap_cfg: null
use_hf_ckpt: True # if true, will transfer hf model to generate megatron checkpoint and use it for training.
ckpt: # config for ckpt convertor
model: DeepSeek-R1-Distill-Qwen-1.5B
hf_model_path: ${rollout.model.model_path} # path to the hf model
save_path: ${runner.output_dir}/${runner.experiment_name}/actor/megatron_ckpt_from_hf
use_gpu_num : 0
use_gpu_index: null #
process_num: 16 # number of processes to use for checkpointing
tensor_model_parallel_size: ${actor.model.tensor_model_parallel_size}
pipeline_model_parallel_size: ${actor.model.pipeline_model_parallel_size}
fsdp_config:
strategy: "fsdp"
sharding_strategy: "no_shard"
cpu_offload: False
offload_pin_memory: False
reshard_after_forward: True
enable_gradient_accumulation: True
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: null
use_orig_params: False
use_liger_kernel: False
fsdp_size: -1
mixed_precision:
param_dtype: ${actor.model.precision}
reduce_dtype: ${actor.model.precision}
buffer_dtype: ${actor.model.precision}
amp_autocast:
enabled: False
precision: "bf16"
grad_scaler:
enabled: False
Top-level
actor.training_backend: Training backend (megatron).
actor.mcore_gpt: Use Megatron-Core GPT stack.
actor.spec_name: Model spec/preset name (e.g., decoder-only GPT).
actor.offload_optimizer: Offload optimizer state to CPU to reduce GPU memory.
actor.offload_weight: Offload model weights to CPU when possible (ZeRO-style).
actor.offload_grad: Offload gradients to CPU to reduce GPU memory.
actor.enable_dp_load_balance: Enable data-parallel load balancing.
actor.calculate_flops: Compute and log FLOPs for profiling.
Model sub-section
actor.model.precision: Numerical precision for training (e.g., fp16).
actor.model.add_bias_linear: Add bias terms to linear layers.
actor.model.tensor_model_parallel_size: TP degree for actor.
actor.model.pipeline_model_parallel_size: PP degree for actor.
actor.model.activation: Activation function (e.g., swiglu).
actor.model.sequence_parallel: Enable sequence parallelism (requires TP).
actor.model.recompute_method: Activation recompute strategy (e.g., block).
actor.model.recompute_granularity: Recompute scope (e.g., full or selective).
actor.model.recompute_num_layers: Number of layers to checkpoint/recompute.
actor.model.seq_length: Decoder context length for training.
actor.model.encoder_seq_length: Encoder length (for encoder-decoder; mirrors seq_length here).
actor.model.normalization: Norm layer type (e.g., rmsnorm).
actor.model.position_embedding_type: Positional embedding type (e.g., rope).
actor.model.apply_rope_fusion: Use fused RoPE kernels if available.
actor.model.bias_dropout_fusion: Fuse bias + dropout kernels.
actor.model.persist_layer_norm: Persist LN params in higher precision.
actor.model.bias_activation_fusion: Fuse bias + activation kernels.
actor.model.attention_softmax_in_fp32: Compute attention softmax in FP32 for stability.
actor.model.batch_p2p_comm: Batch P2P communications across layers.
actor.model.variable_seq_lengths: Allow variable sequence lengths per micro-batch.
actor.model.gradient_accumulation_fusion: Fused gradient accumulation.
actor.model.moe_token_dispatcher_type: MoE token dispatcher (e.g., alltoall).
actor.model.use_cpu_initialization: Initialize weights on CPU to reduce GPU spikes.
Optimizer
actor.optim.optimizer: Optimizer choice (adam).
actor.optim.bf16 / actor.optim.fp16: Mixed precision flags.
actor.optim.lr: Base learning rate.
actor.optim.adam_beta1 / adam_beta2 / adam_eps: Adam hyper-parameters.
actor.optim.min_lr: Minimum LR (for schedulers that decay below base LR).
actor.optim.weight_decay: L2 weight decay.
actor.optim.use_distributed_optimizer: Use Megatron distributed optimizer.
actor.optim.overlap_grad_reduce: Overlap gradient reduction with backward pass.
actor.optim.overlap_param_gather: Overlap parameter all-gather with forward pass.
actor.optim.optimizer_enable_pin: Pin optimizer memory.
actor.optim.overlap_param_gather_with_optimizer_step: Overlap param gather with step.
actor.optim.clip_grad: Global gradient clipping norm.
actor.optim.loss_scale_window: Dynamic loss scale window for FP16.
LR schedule
actor.lr_sched.lr_warmup_fraction: Warm-up as a fraction of total iters.
actor.lr_sched.lr_warmup_init: Initial LR value during warm-up.
actor.lr_sched.lr_warmup_iters: Warm-up iterations (overrides fraction when > 0).
actor.lr_sched.max_lr / min_lr: LR bounds for schedulers.
actor.lr_sched.lr_decay_style: Decay policy (e.g., constant).
actor.lr_sched.lr_decay_iters: Total decay iterations.
Tokenizer
actor.tokenizer.tokenizer_model: Path/name of the tokenizer.
actor.tokenizer.use_fast: Use HF fast tokenizer.
actor.tokenizer.trust_remote_code: Allow custom tokenizer code.
actor.tokenizer.padding_side: left or right padding.
Megatron integration
actor.megatron.ddp_bucket_size: DDP gradient bucket size.
actor.megatron.distributed_backend: Distributed backend (nccl or gloo).
actor.megatron.distributed_timeout_minutes: Backend communication timeout.
actor.megatron.ckpt_format: Checkpoint format (e.g., torch).
actor.megatron.use_dist_ckpt: Use distributed checkpointing (sharded).
actor.megatron.tp_comm_bootstrap_backend: Backend used for TP bootstrap (e.g., nccl).
actor.megatron.tp_comm_overlap_cfg: YAML path for TP comm/compute overlap.
actor.megatron.use_hf_ckpt: Convert/load from a HuggingFace checkpoint for training.
Megatron checkpoint converter
actor.megatron.ckpt.model: Model name for the converter metadata.
actor.megatron.ckpt.hf_model_path: Source HF model path.
actor.megatron.ckpt.save_path: Target directory to write Megatron checkpoints.
actor.megatron.ckpt.use_gpu_num: Number of GPUs to use for conversion.
actor.megatron.ckpt.use_gpu_index: Specific GPU index to use.
actor.megatron.ckpt.process_num: CPU processes for conversion work.
actor.megatron.ckpt.tensor_model_parallel_size: TP degree for converted checkpoints.
actor.megatron.ckpt.pipeline_model_parallel_size: PP degree for converted checkpoints.
FSDP Integration:
actor.fsdp_config.strategy: Determines the FSDP strategy used, supporting fsdp and fsdp2 (case-insensitive).
actor.fsdp_config.sharding_strategy: FSDP/FSDP2 parameter, indicating the sharding strategy used by FSDP, supporting full_shard, shard_grad_op, hybrid_shard, and no_shard.
actor.fsdp_config.cpu_offload: FSDP2 parameter, determines whether FSDP2 places parameters on the CPU side, transmitting them to the GPU side only when necessary.
actor.fsdp_config.offload_pin_memory: FSDP2 parameter, only effective when the cpu_offload option is True. If true, the CPU-side memory is pinned memory to improve transmission efficiency.
actor.fsdp_config.reshard_after_forward: FSDP2 parameter, indicates whether to reslice parameters after forward propagation to save GPU memory.
actor.fsdp_config.enable_gradient_accumulation: FSDP/FSDP2 parameter, indicates whether to enable gradient accumulation. If true, communication and gradient updates are only performed after the last micro-batch. Enabling this increases GPU memory usage but speeds up training.
actor.fsdp_config.forward_prefetch: FSDP parameter, indicates whether to prefetch the next all-gather operation during forward propagation. Enabling this increases GPU memory usage; it is recommended to enable it when GPU memory is sufficient to overlap communication and computation, thereby improving performance.
actor.fsdp_config.limit_all_gathers: FSDP parameter, indicates whether to limit the number of concurrent all-gather operations. It is recommended to enable this when CPU or memory is a bottleneck.
actor.fsdp_config.backward_prefetch: FSDP parameter, indicating the prefetch strategy during backpropagation (null/’pre’/’post’). If ‘pre’, the next all-gather operation is prefetched during gradient computation, resulting in more aggressive overlap and higher throughput. If ‘post’, the next all-gather operation is prefetched after the current gradient computation is complete, which is more conservative than ‘pre’.
actor.fsdp_config.use_orig_params: FSDP parameter, indicating whether to use the module’s original parameters, exposing the original parameters (nn.Module.named_parameters) instead of the flattened parameters of FSDP. This improves compatibility but introduces additional communication overhead and reduces performance.
actor.fsdp_config.use_liger_kernel: FSDP/FSDP2 parameter, determines whether to use liger_kernel (currently only supported for some models, including qwen2.5 and qwen2.5-vl). Enabling it can reduce GPU memory usage and improve training speed.
actor.fsdp_config.fsdp_size: FSDP2 parameter. If not -1, FSDP2 will group slices according to the size specified by this parameter.
actor.fsdp_config.mixed_precision.param_dtype: FSDP/FSDP2 parameter, specifying the parameter type.
actor.fsdp_config.mixed_precision.reduce_dtype: FSDP/FSDP2 parameter, specifying the data type used during reduction.
actor.fsdp_config.mixed_precision.buffer_dtype: FSDP parameter, specifying the data type used for the buffer.
actor.fsdp_config.amp_autocast.enabled: FSDP/FSDP2 parameter, indicating whether automatic mixed-precision training is enabled.
actor.fsdp_config.amp_autocast.precision: FSDP/FSDP2 parameter, indicating the numerical precision used by AMP.
actor.fsdp_config.grad_scaler.enabled: FSDP/FSDP2 parameter, indicating whether the gradient scaler is enabled.
actor.fsdp_config.grad_scaler.init_scale: FSDP/FSDP2 parameter, indicating the initial scale factor used by the gradient scaler to prevent numerical underflow.
actor.fsdp_config.grad_scaler.growth_interval: FSDP/FSDP2 parameter, indicating the number of consecutive steps without gradient overflows required before the scale factor is increased.
reward#
reward:
reward_type: math
reward_scale: 5.0
reward.reward_type: Which reward type to use for the training.
reward.reward_scale: when the answer is correct, it receives reward_scale; when it is incorrect, it receives -reward_scale.
Embody-specific Configuration#
defaults#
defaults:
- env/manikill_put_carrot_on_plate_in_scene@env.train
- env/manikill_put_carrot_on_plate_in_scene@env.eval
defaults: Hydra configuration inheritance. Specifies which environment configurations to load for training and evaluation.
hydra#
hydra:
searchpath:
- file://${oc.env:REPO_PATH}/config/
hydra.searchpath: Additional search paths for configuration files.
runner#
runner:
only_eval: False
max_prompt_length: 30
runner.only_eval: Run evaluation only without training.
runner.max_prompt_length: Maximum prompt length in tokens.
algorithm#
algorithm:
normalize_advantages: True
kl_penalty: kl
rollout_epoch: 1
reward_type: chunk_level
logprob_type: token_level
entropy_type: token_level
length_params:
max_new_token: null
max_length: 1024
min_length: 1
algorithm.normalize_advantages: Normalize advantages across the batch.
algorithm.rollout_epoch: Number of rollout epochs per training step.
algorithm.reward_type: Reward aggregation level (chunk_level, action_level).
algorithm.logprob_type: Log probability computation level.
algorithm.entropy_type: Entropy computation level.
length_params:
algorithm.length_params.max_new_token: Maximum new tokens to generate.
algorithm.length_params.max_length: Maximum total sequence length.
algorithm.length_params.min_length: Minimum sequence length.
env#
env:
group_name: "EnvGroup"
channel:
name: "env_buffer_list"
queue_name: "obs_buffer"
queue_size: 0
enable_offload: True
train:
total_num_envs: null
auto_reset: False
ignore_terminations: False
use_fixed_reset_state_ids: True
max_episode_steps: 10
eval:
total_num_envs: null
auto_reset: False
ignore_terminations: False
use_fixed_reset_state_ids: True
max_episode_steps: 10
env.group_name: Logical name for environment worker group.
env.channel.name: Shared memory channel name for inter-process communication.
env.channel.queue_name: Queue name for observation buffer.
env.channel.queue_size: Queue size (0 for unlimited).
env.enable_offload: Enable environment offloading to reduce memory usage.
env.train.total_num_envs: Total number of parallel environments for training.
env.train.auto_reset: Automatically reset environments when episodes terminate.
env.train.ignore_terminations: Ignore episode terminations during training (if enabled, episode only ends when it reaches the max_episode_steps).
env.train.use_fixed_reset_state_ids: Use fixed reset state IDs (false for randomization). Always True for GRPO, default be False for PPO.
env.train.max_episode_steps: Maximum number of steps per episode for training.
env.eval.total_num_envs: Total number of parallel environments for evaluation.
env.eval.auto_reset: Automatically reset environments when episodes terminate for evaluation.
env.eval.ignore_terminations: Ignore episode terminations during evaluation (if enabled, episode only ends when it reaches the max_episode_steps for evaluation).
env.eval.use_fixed_reset_state_ids: Use fixed reset state IDs (false for randomization). Always True for GRPO, default be False for PPO.
env.eval.max_episode_steps: Maximum number of steps per episode for evaluation.
rollout#
rollout:
channel:
name: ${env.channel.name}
queue_name: "action_buffer"
queue_size: 0
mode: "collocate"
backend: "huggingface"
enforce_eager: True
enable_offload: True
pipeline_stage_num: 2
rollout.channel.name: Shared memory channel (inherits from env).
rollout.channel.queue_name: Queue name for action buffer.
rollout.channel.queue_size: Queue size.
rollout.mode: Rollout mode (collocate for shared GPU).
rollout.backend: Model backend (huggingface, vllm).
rollout.pipeline_stage_num: Number of pipeline stages for rollout.
actor#
actor:
channel:
name: ${env.channel.name}
queue_name: "replay_buffer"
queue_size: 0
training_backend: "fsdp"
micro_batch_size: 8
global_batch_size: 160
enable_offload: True
model:
model_path: "/path/to/huggingface_model"
model_type: "openvla_oft"
action_dim: 7
num_action_chunks: 8
use_proprio: False
unnorm_key: bridge_orig
value_type: ${algorithm.reward_type}
val_micro_batch_size: 8
center_crop: True
do_sample: False
precision: "bf16"
add_bias_linear: False
add_qkv_bias: True
vocab_size: 32000
hidden_size: 4096
policy_setup: "widowx_bridge"
image_size: [224, 224]
is_lora: True
lora_rank: 32
lora_path: /storage/models/oft-sft/lora_004000
num_images_in_input: 1
attn_implementation: "flash_attention_2"
low_cpu_mem_usage: True
trust_remote_code: True
tokenizer:
tokenizer_type: "HuggingFaceTokenizer"
tokenizer_model: "/storage/download_models/Openvla-oft-SFT-libero10-trajall/"
extra_vocab_size: 421
use_fast: False
trust_remote_code: True
padding_side: "right"
optim:
lr: 1.0e-4
value_lr: 3.0e-3
adam_beta1: 0.9
adam_beta2: 0.999
adam_eps: 1.0e-05
clip_grad: 10.0
actor.channel.name: Shared memory channel (inherits from env).
actor.channel.queue_name: Queue name for replay buffer.
actor.training_backend: Training backend (fsdp for distributed training).
actor.micro_batch_size: Micro-batch size per GPU.
actor.global_batch_size: Global batch size across all GPUs.
actor.enable_offload: Enable model offloading to reduce memory usage.
Model Configuration:
actor.model.model_type: Model architecture name (openvla_oft).
actor.model.model_path: Path to huggingface model.
actor.model.action_dim: Action space dimensionality.
actor.model.num_action_chunks: Number of action chunks per sequence.
actor.model.use_proprio: Whether to use proprioceptive information.
actor.model.unnorm_key: Key for action normalization.
actor.model.value_type: Value function type (inherits from algorithm.reward_type).
actor.model.val_micro_batch_size: Micro-batch size for value function computation.
actor.model.center_crop: Whether to center crop input images.
actor.model.do_sample: Whether to use sampling during inference.
actor.model.precision: Numerical precision (bf16, fp16, fp32).
actor.model.add_bias_linear: Add bias to linear layers.
actor.model.add_qkv_bias: Add bias to QKV projections.
actor.model.vocab_size: Vocabulary size.
actor.model.hidden_size: Hidden dimension size.
actor.model.policy_setup: Policy configuration (widowx_bridge).
actor.model.image_size: Input image dimensions [height, width].
actor.model.is_lora: Whether to use LoRA fine-tuning.
actor.model.lora_rank: LoRA rank for low-rank adaptation.
actor.model.lora_path: Path to LoRA weights.
actor.model.num_images_in_input: Number of images in model input.
actor.model.attn_implementation: Attention implementation (flash_attention_2).
actor.model.low_cpu_mem_usage: Use low CPU memory initialization.
actor.model.trust_remote_code: Trust remote code in model loading.
Tokenizer Configuration:
actor.tokenizer.tokenizer_type: Tokenizer type (HuggingFaceTokenizer).
actor.tokenizer.tokenizer_model: Path to tokenizer model.
actor.tokenizer.extra_vocab_size: Additional vocabulary size.
actor.tokenizer.use_fast: Use fast tokenizer implementation.
actor.tokenizer.trust_remote_code: Trust remote code in tokenizer.
actor.tokenizer.padding_side: Padding side (left or right).
Optimizer Configuration:
actor.optim.lr: Learning rate for policy network.
actor.optim.value_lr: Learning rate for value function.
actor.optim.adam_beta1/beta2: Adam optimizer beta parameters.
actor.optim.adam_eps: Adam optimizer epsilon.
actor.optim.clip_grad: Gradient clipping norm.
Env-based#
The following configuration describes the key parameters of the environment, using Libero-10 as an example.
The path is
Environment Type
env_type: libero
task_suite_name: libero_10
env_type: Specifies the simulator type (libero for Libero benchmark).
task_suite_name: Specifies the task suite (libero_10 for 10-task benchmark).
Episode Configuration
auto_reset: ${algorithm.auto_reset}
ignore_terminations: ${algorithm.ignore_terminations}
max_episode_steps: 512
auto_reset: Automatically reset environment when episode terminates (inherits from algorithm config).
ignore_terminations: Ignore episode terminations during training (inherits from algorithm config).
max_episode_steps: Maximum number of steps per episode (512 for complex Libero tasks).
Reward Configuration
use_rel_reward: true
reward_coef: 5.0
use_rel_reward: Use relative rewards (difference between current and previous step rewards).
reward_coef: Reward coefficient for scaling rewards (5.0 for amplified reward signals).
Randomization and Groups
seed: 0
group_size: 1
use_fixed_reset_state_ids: True
seed: Random seed for environment initialization (0 for reproducibility).
group_size: Number of environments per group (inherits from algorithm.group_size).
use_fixed_reset_state_ids: Use fixed reset state IDs (false for randomization). Always True for GRPO, default be False for PPO.
Environment Scaling
total_num_envs: null
total_num_envs: Total number of parallel environments for trainin or evaluation.
Video Recording
video_cfg:
save_video: true
info_on_video: true
video_base_dir: ${runner.logger.log_path}/video/train
video_cfg.save_video: Enable video recording during training.
video_cfg.info_on_video: Overlay training information on videos.
video_cfg.video_base_dir: Directory to save training videos.
Camera Configuration
init_params:
camera_heights: 256
camera_widths: 256
init_params.camera_heights: Camera image height in pixels (256).
init_params.camera_widths: Camera image width in pixels (256).