DAgger for Embodied Policies#

https://raw.githubusercontent.com/RLinf/misc/main/pic/dagger.jpg

A DAgger training loop.#

DAgger (Dataset Aggregation) is an imitation-learning algorithm that lets the student policy interact with the environment, asks an expert policy to relabel the visited states, and aggregates those expert-labeled trajectories for further training. This page documents RLinf’s simulator-based embodied DAgger workflow. Current DAgger support covers MLP and Pi0 models, and both sync and async training pipelines.

For the real-world Franka pipeline, see Using HG-DAgger with Franka.

Overview#

DAgger-finetune an embodied policy: the student acts, an expert relabels visited states, and the aggregated expert data trains the student.

Algorithm

DAgger

Models

MLP · π₀

Environments / Data

ManiSkill · LIBERO · RoboTwin

Training

Sync · Async

You’ll do: install → set student/expert checkpoints → launch run_embodiment.sh (or run_async.sh) → watch env/success_once.
Prerequisites: Installation · a student and an expert checkpoint (steps below).

Supported Configurations#

Model

Environment

Config

MLP

ManiSkill (pick-cube)

maniskill_dagger_mlp.yaml

π₀

LIBERO-Spatial

libero_spatial_dagger_openpi.yaml

π₀

RoboTwin (adjust-bottle)

robotwin_adjust_bottle_dagger_openpi.yaml

How DAgger Works#

DAgger Pipeline

  1. Mixed Rollout Policy

    • During training, the rollout worker chooses the expert action with probability beta.

    • During evaluation, RLinf always uses the student policy.

  2. Expert Relabeling

    • If the student acts in the environment, RLinf runs an extra expert forward pass on the same observation.

    • The expert action is stored as the supervision target for that step.

  3. Replay-Buffer Training

    • Expert-labeled trajectories are written into the replay buffer.

    • The actor then optimizes the embodied_dagger loss on replayed samples.

  4. Beta Scheduling

    • init_beta controls the initial expert-action probability.

    • beta_schedule and beta_decay control how quickly execution shifts from expert to student.

    • beta_min is optional and sets the lower bound of beta.

Installation#

For installation details, please first refer to Installation. The DAgger examples below use the embodied image or the equivalent local environment.

Option 1: Docker Image

docker run -it --rm --gpus all \
   --shm-size 20g \
   --network host \
   --name rlinf \
   -v .:/workspace/RLinf \
   rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
   # For mainland China users, you can use the following for better download speed:
   # docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-maniskill_libero

Please switch to the corresponding virtual environment via the built-in switch_env utility in the image:

source switch_env openpi

Option 2: Custom Environment

# For mainland China users, you can add the `--use-mirror` flag for better download speed.
bash requirements/install.sh embodied --model openpi --env maniskill_libero
# For robotwin environment, please use the following command:
# bash requirements/install.sh embodied --model openpi --env robotwin
source .venv/bin/activate

Checkpoint Setup#

Before launch, fill in the student and expert paths in the chosen YAML file.

1. ManiSkill + MLP

The MLP DAgger config uses a student checkpoint and an expert checkpoint under runner:

runner:
  ckpt_path: null                       # Optional student warm start
  expert_ckpt_path: /path/to/expert_ckpt

The expert model in expert_ckpt_path could be produced by a PPO run in MLP Policy Reinforcement Learning Training.

2. LIBERO Spatial + Pi0

The Pi0 DAgger config uses separate student and expert model paths:

actor:
  model:
    model_path: /path/to/student_model

rollout:
  model:
    model_path: /path/to/student_model
  expert_model:
    model_path: /path/to/expert_model

You can find pretrained Pi0 checkpoints on Hugging Face for student initialization. For example:

# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT --local-dir /path/to/model

The expert model checkpoint could also come from a PPO run in RL on π0 and π0.5 Models.

3. RoboTwin + Pi0

The Pi0 DAgger config uses separate student and expert model paths:

actor:
  model:
    model_path: /path/to/student_model

rollout:
  model:
    model_path: /path/to/student_model
  expert_model:
    model_path: /path/to/expert_model

In the same way, ou can find pretrained Pi0 checkpoints on Hugging Face for student initialization. For example:

# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf/RLinf-Pi0-RoboTwin-SFT-adjust_bottle --local-dir /path/to/model

The expert model checkpoint could also come from a PPO run in RL on π0 and π0.5 Models.

In addition, the RoboTwin environment requires separate configuration of the RoboTwin code and corresponding Assets. Refer to RL with RoboTwin Benchmark for details, and then configure the corresponding paths in your YAML file.

env:
  train:
    assets_path: /path/to/robotwin_assets
  eval:
    assets_path: /path/to/robotwin_assets

Run It#

1. Configuration files

We currently support DAgger training with the following configs:

  • MLP + ManiSkill: examples/embodiment/config/maniskill_dagger_mlp.yaml

  • Pi0 + LIBERO: examples/embodiment/config/libero_spatial_dagger_openpi.yaml

  • Pi0 + RoboTwin: examples/embodiment/config/robotwin_adjust_bottle_dagger_openpi.yaml

2. Key DAgger Parameters

algorithm:
  dagger:
    only_save_expert: False   # Expert acts with probability beta and also relabels student steps
    init_beta: 1.0
    beta_schedule: "exponential"
    beta_decay: 0.99
    beta_min: 0.05            # Optional; defaults to 0.05 in code

  replay_buffer:
    enable_cache: True
    cache_size: 2000
    min_buffer_size: 16
    sample_window_size: 2000

For the MLP ManiSkill example, the config uses a larger replay buffer and beta_decay: 0.98 by default. Check the YAML file you launch for the exact values.

3. Launch Commands

Use the same config name with either launcher:

Sync Mode

bash examples/embodiment/run_embodiment.sh maniskill_dagger_mlp
bash examples/embodiment/run_embodiment.sh libero_spatial_dagger_openpi
bash examples/embodiment/run_embodiment.sh robotwin_adjust_bottle_dagger_openpi
# For RoboTwin, add the following two commands before running the .sh file:
# export ROBOT_PLATFORM=ALOHA export ROBOTWIN_PATH=/path/to/RoboTwin

Async Mode

bash examples/embodiment/run_async.sh maniskill_dagger_mlp
bash examples/embodiment/run_async.sh libero_spatial_dagger_openpi
bash examples/embodiment/run_async.sh robotwin_adjust_bottle_dagger_openpi
# For RoboTwin, add the following two commands before running the .sh file:
# export ROBOT_PLATFORM=ALOHA export ROBOTWIN_PATH=/path/to/RoboTwin

Visualization and Visualization and Results#

1. TensorBoard Logs

tensorboard --logdir ./logs

2. Useful Monitoring Metrics

For metric definitions, see Training metrics. DAgger-specific metrics:

  • env/success_once: Recommended success metric for embodied DAgger runs.

  • train/dagger/actor_loss: Supervised DAgger loss on replayed expert-labeled samples.

  • train/actor/lr: Learning rate.

  • train/actor/grad_norm: Gradient norm.

  • train/replay_buffer/num_trajectories: Number of trajectories stored in the replay buffer.

  • train/replay_buffer/total_samples: Number of replay-buffer samples available for training.

  • train/replay_buffer/cache_size: Number of cached flattened trajectories.

Visualization and Results#

Configuration

Student init SR

Expert SR

Training Time

Student final SR

MLP + ManiSkill

0%

100%

20min

100%

Pi0 + LIBERO

60%

95%

17h

93%