Embodiment Configuration#

This section covers configuration parameters specific to embodied RL training (robot manipulation, simulators, VLA models). These extend the shared configuration described in Basic Configuration.

defaults#

defaults:
  - env/manikill_put_carrot_on_plate_in_scene@env.train
  - env/manikill_put_carrot_on_plate_in_scene@env.eval

defaults: Hydra configuration inheritance. Specifies which environment configurations to load for training and evaluation.

hydra#

hydra:
  searchpath:
    - file://${oc.env:REPO_PATH}/config/

hydra.searchpath: Additional search paths for configuration files.

runner#

runner:
  only_eval: False
  max_prompt_length: 30
  overlap_env_bootstrap: False

runner.only_eval: Run evaluation only without training.

runner.max_prompt_length: Maximum prompt length in tokens.

runner.overlap_env_bootstrap: Overlap environment bootstrap (reset) with actor training to hide reset latency. This is particularly useful when environment reset is slow. Note: This is only effective when env.train.enable_offload is False. Enabling this may increase GPU memory pressure if the environment and actor share the same accelerator.

algorithm#

algorithm:
  normalize_advantages: True
  kl_penalty: kl

  reward_type: chunk_level
  logprob_type: token_level
  entropy_type: token_level

algorithm.normalize_advantages: Normalize advantages across the batch.

algorithm.reward_type: Reward aggregation level (chunk_level, action_level).

algorithm.logprob_type: Log probability computation level.

algorithm.entropy_type: Entropy computation level.

env#

env:
  group_name: "EnvGroup"
  enable_offload: True

  train:
    rollout_epoch: 1
    total_num_envs: null
    auto_reset: False
    ignore_terminations: False
    use_fixed_reset_state_ids: True
    max_episode_steps: 10

  eval:
    rollout_epoch: 1
    total_num_envs: null
    auto_reset: False
    ignore_terminations: False
    use_fixed_reset_state_ids: True
    max_episode_steps: 10

env.group_name: Logical name for environment worker group.

env.enable_offload: Enable environment offloading to reduce memory usage.

env.train.rollout_epoch: Number of rollout epochs per training step.

env.train.total_num_envs: Total number of parallel environments for training.

env.train.auto_reset: Automatically reset environments when episodes terminate.

env.train.ignore_terminations: Ignore episode terminations during training (if enabled, episode only ends when it reaches the max_episode_steps).

env.train.use_fixed_reset_state_ids: Use fixed reset state IDs (false for randomization). Always True for GRPO, default be False for PPO.

env.train.max_episode_steps: Maximum number of steps per episode for training.

env.eval.rollout_epoch: Number of evaluation rollout epochs; metrics are averaged over passes with the same seeds.

env.eval.total_num_envs: Total number of parallel environments for evaluation.

env.eval.auto_reset: Automatically reset environments when episodes terminate for evaluation.

env.eval.ignore_terminations: Ignore episode terminations during evaluation (if enabled, episode only ends when it reaches the max_episode_steps for evaluation).

env.eval.use_fixed_reset_state_ids: Use fixed reset state IDs (false for randomization). Always True for GRPO, default be False for PPO.

env.eval.max_episode_steps: Maximum number of steps per episode for evaluation.

rollout#

rollout:
  sampling_params:
    do_sample: True
    temperature_train: 1.0
    temperature_eval: 0.6
    top_k: 0
    top_p: 1.0
    repetition_penalty: 1.0
    max_new_tokens: 7

  group_name: "RolloutGroup"
  backend: "huggingface"
  enable_offload: True
  pipeline_stage_num: 2

  model:
    model_path: "/path/to/hf_model"
    precision: ${actor.model.precision}

sampling_params (autoregressive VLA policies):

rollout.sampling_params.do_sample: Deterministic decoding if False.

rollout.sampling_params.temperature_train / temperature_eval: Sampling temperature for training and evaluation.

rollout.sampling_params.top_k / top_p: Top-k and nucleus sampling parameters.

rollout.sampling_params.repetition_penalty: Penalize repeated tokens.

rollout.sampling_params.max_new_tokens: Maximum generated tokens per step (action dimension).

Continuous policies (MLP, CNN, OpenPI, GR00T, etc.) do not use rollout.sampling_params.

rollout.group_name: Logical name for the rollout worker group.

rollout.backend: Model backend (huggingface, vllm).

rollout.enable_offload: Enable rollout model offloading to reduce GPU memory usage.

rollout.pipeline_stage_num: Number of pipeline stages for rollout.

rollout.model.model_path: Model checkpoint path used by rollout (may match actor).

rollout.model.precision: Inference precision for rollout.

actor#

actor:
  group_name: "ActorGroup"
  training_backend: "fsdp"
  micro_batch_size: 8
  global_batch_size: 160
  enable_offload: True

  model:
    model_path: "/path/to/huggingface_model"
    model_type: "openvla_oft"
    action_dim: 7
    num_action_chunks: 8
    use_proprio: False
    unnorm_key: bridge_orig
    value_type: ${algorithm.reward_type}
    val_micro_batch_size: 8
    center_crop: True
    do_sample: False

    precision: "bf16"
    add_bias_linear: False
    add_qkv_bias: True
    vocab_size: 32000
    hidden_size: 4096
    policy_setup: "widowx_bridge"
    image_size: [224, 224]
    is_lora: True
    lora_rank: 32
    lora_path: /storage/models/oft-sft/lora_004000
    num_images_in_input: 1
    attn_implementation: "flash_attention_2"
    low_cpu_mem_usage: True
    trust_remote_code: True

  tokenizer:
    tokenizer_type: "HuggingFaceTokenizer"
    tokenizer_model: "/storage/download_models/Openvla-oft-SFT-libero10-trajall/"
    extra_vocab_size: 421
    use_fast: False
    trust_remote_code: True
    padding_side: "right"

  optim:
    lr: 1.0e-4
    value_lr: 3.0e-3
    adam_beta1: 0.9
    adam_beta2: 0.999
    adam_eps: 1.0e-05
    clip_grad: 10.0

actor.group_name: Logical name for the actor worker group.

actor.training_backend: Training backend (fsdp for distributed training).

actor.micro_batch_size: Micro-batch size per GPU.

actor.global_batch_size: Global batch size across all GPUs.

actor.enable_offload: Enable model offloading to reduce memory usage.

Model Configuration:

actor.model.model_type: Model architecture name (openvla_oft).

actor.model.model_path: Path to huggingface model.

actor.model.action_dim: Action space dimensionality.

actor.model.num_action_chunks: Number of action chunks per sequence.

actor.model.use_proprio: Whether to use proprioceptive information.

actor.model.unnorm_key: Key for action normalization.

actor.model.value_type: Value function type (inherits from algorithm.reward_type).

actor.model.val_micro_batch_size: Micro-batch size for value function computation.

actor.model.center_crop: Whether to center crop input images.

actor.model.do_sample: Whether to use sampling during inference.

actor.model.precision: Numerical precision (bf16, fp16, fp32).

actor.model.add_bias_linear: Add bias to linear layers.

actor.model.add_qkv_bias: Add bias to QKV projections.

actor.model.vocab_size: Vocabulary size.

actor.model.hidden_size: Hidden dimension size.

actor.model.policy_setup: Policy configuration (widowx_bridge).

actor.model.image_size: Input image dimensions [height, width].

actor.model.is_lora: Whether to use LoRA fine-tuning.

actor.model.lora_rank: LoRA rank for low-rank adaptation.

actor.model.lora_path: Path to LoRA weights.

actor.model.num_images_in_input: Number of images in model input.

actor.model.attn_implementation: Attention implementation (flash_attention_2).

actor.model.low_cpu_mem_usage: Use low CPU memory initialization.

actor.model.trust_remote_code: Trust remote code in model loading.

Tokenizer Configuration:

actor.tokenizer.tokenizer_type: Tokenizer type (HuggingFaceTokenizer).

actor.tokenizer.tokenizer_model: Path to tokenizer model.

actor.tokenizer.extra_vocab_size: Additional vocabulary size.

actor.tokenizer.use_fast: Use fast tokenizer implementation.

actor.tokenizer.trust_remote_code: Trust remote code in tokenizer.

actor.tokenizer.padding_side: Padding side (left or right).

Optimizer Configuration:

actor.optim.lr: Learning rate for policy network.

actor.optim.value_lr: Learning rate for value function.

actor.optim.adam_beta1/beta2: Adam optimizer beta parameters.

actor.optim.adam_eps: Adam optimizer epsilon.

actor.optim.clip_grad: Gradient clipping norm.

Environment-Specific Configuration#

The following configuration describes the key parameters of the environment, using Libero-10 as an example.

The path is

Environment Type

env_type: libero
task_suite_name: libero_10

env_type: Specifies the simulator type (libero for Libero benchmark).

task_suite_name: Specifies the task suite (libero_10 for 10-task benchmark).

Episode Configuration

auto_reset: False
ignore_terminations: False
max_episode_steps: 512

auto_reset: Automatically reset environment when episode terminates (configured in env.train / env.eval).

ignore_terminations: Ignore episode terminations during training (configured in env.train).

max_episode_steps: Maximum number of steps per episode (512 for complex Libero tasks).

Reward Configuration

use_rel_reward: true
reward_coef: 5.0

use_rel_reward: Use relative rewards (difference between current and previous step rewards).

reward_coef: Reward coefficient for scaling rewards (5.0 for amplified reward signals).

Randomization and Groups

seed: 0
group_size: 1
use_fixed_reset_state_ids: True

seed: Random seed for environment initialization (0 for reproducibility).

group_size: Number of environments per group (inherits from algorithm.group_size).

use_fixed_reset_state_ids: Use fixed reset state IDs (false for randomization). Always True for GRPO, default be False for PPO.

Environment Scaling

total_num_envs: null

total_num_envs: Total number of parallel environments for training or evaluation.

Video Recording

video_cfg:
  save_video: true
  info_on_video: true
  video_base_dir: ${runner.logger.log_path}/video/train

video_cfg.save_video: Enable video recording during training.

video_cfg.info_on_video: Overlay training information on videos.

video_cfg.video_base_dir: Directory to save training videos.

Camera Configuration

init_params:
  camera_heights: 256
  camera_widths: 256

init_params.camera_heights: Camera image height in pixels (256).

init_params.camera_widths: Camera image width in pixels (256).