DAgger for Embodied Policies#
A DAgger training loop.#
DAgger (Dataset Aggregation) is an imitation-learning algorithm that lets the student policy interact with the environment, asks an expert policy to relabel the visited states, and aggregates those expert-labeled trajectories for further training. This page documents RLinf’s simulator-based embodied DAgger workflow. Current DAgger support covers MLP and Pi0 models, and both sync and async training pipelines.
For the real-world Franka pipeline, see Using HG-DAgger with Franka.
Overview#
DAgger-finetune an embodied policy: the student acts, an expert relabels visited states, and the aggregated expert data trains the student.
DAgger
MLP · π₀
ManiSkill · LIBERO · RoboTwin
Sync · Async
run_embodiment.sh (or run_async.sh) → watch env/success_once.Supported Configurations#
Model |
Environment |
Config |
|---|---|---|
MLP |
ManiSkill (pick-cube) |
|
π₀ |
LIBERO-Spatial |
|
π₀ |
RoboTwin (adjust-bottle) |
|
How DAgger Works#
DAgger Pipeline
Mixed Rollout Policy
During training, the rollout worker chooses the expert action with probability
beta.During evaluation, RLinf always uses the student policy.
Expert Relabeling
If the student acts in the environment, RLinf runs an extra expert forward pass on the same observation.
The expert action is stored as the supervision target for that step.
Replay-Buffer Training
Expert-labeled trajectories are written into the replay buffer.
The actor then optimizes the
embodied_daggerloss on replayed samples.
Beta Scheduling
init_betacontrols the initial expert-action probability.beta_scheduleandbeta_decaycontrol how quickly execution shifts from expert to student.beta_minis optional and sets the lower bound ofbeta.
Installation#
For installation details, please first refer to Installation. The DAgger examples below use the embodied image or the equivalent local environment.
Option 1: Docker Image
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
# For mainland China users, you can use the following for better download speed:
# docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
Please switch to the corresponding virtual environment via the built-in
switch_env utility in the image:
source switch_env openpi
Option 2: Custom Environment
# For mainland China users, you can add the `--use-mirror` flag for better download speed.
bash requirements/install.sh embodied --model openpi --env maniskill_libero
# For robotwin environment, please use the following command:
# bash requirements/install.sh embodied --model openpi --env robotwin
source .venv/bin/activate
Checkpoint Setup#
Before launch, fill in the student and expert paths in the chosen YAML file.
1. ManiSkill + MLP
The MLP DAgger config uses a student checkpoint and an expert checkpoint under
runner:
runner:
ckpt_path: null # Optional student warm start
expert_ckpt_path: /path/to/expert_ckpt
The expert model in expert_ckpt_path could be produced by a PPO run in
MLP Policy Reinforcement Learning Training.
2. LIBERO Spatial + Pi0
The Pi0 DAgger config uses separate student and expert model paths:
actor:
model:
model_path: /path/to/student_model
rollout:
model:
model_path: /path/to/student_model
expert_model:
model_path: /path/to/expert_model
You can find pretrained Pi0 checkpoints on Hugging Face for student initialization. For example:
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Pi0-LIBERO-Spatial-Object-Goal-SFT --local-dir /path/to/model
The expert model checkpoint could also come from a PPO run in RL on π0 and π0.5 Models.
3. RoboTwin + Pi0
The Pi0 DAgger config uses separate student and expert model paths:
actor:
model:
model_path: /path/to/student_model
rollout:
model:
model_path: /path/to/student_model
expert_model:
model_path: /path/to/expert_model
In the same way, ou can find pretrained Pi0 checkpoints on Hugging Face for student initialization. For example:
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf/RLinf-Pi0-RoboTwin-SFT-adjust_bottle --local-dir /path/to/model
The expert model checkpoint could also come from a PPO run in RL on π0 and π0.5 Models.
In addition, the RoboTwin environment requires separate configuration of the RoboTwin code and corresponding Assets. Refer to RL with RoboTwin Benchmark for details, and then configure the corresponding paths in your YAML file.
env:
train:
assets_path: /path/to/robotwin_assets
eval:
assets_path: /path/to/robotwin_assets
Run It#
1. Configuration files
We currently support DAgger training with the following configs:
MLP + ManiSkill:
examples/embodiment/config/maniskill_dagger_mlp.yamlPi0 + LIBERO:
examples/embodiment/config/libero_spatial_dagger_openpi.yamlPi0 + RoboTwin:
examples/embodiment/config/robotwin_adjust_bottle_dagger_openpi.yaml
2. Key DAgger Parameters
algorithm:
dagger:
only_save_expert: False # Expert acts with probability beta and also relabels student steps
init_beta: 1.0
beta_schedule: "exponential"
beta_decay: 0.99
beta_min: 0.05 # Optional; defaults to 0.05 in code
replay_buffer:
enable_cache: True
cache_size: 2000
min_buffer_size: 16
sample_window_size: 2000
For the MLP ManiSkill example, the config uses a larger replay buffer and
beta_decay: 0.98 by default. Check the YAML file you launch for the exact
values.
3. Launch Commands
Use the same config name with either launcher:
Sync Mode
bash examples/embodiment/run_embodiment.sh maniskill_dagger_mlp
bash examples/embodiment/run_embodiment.sh libero_spatial_dagger_openpi
bash examples/embodiment/run_embodiment.sh robotwin_adjust_bottle_dagger_openpi
# For RoboTwin, add the following two commands before running the .sh file:
# export ROBOT_PLATFORM=ALOHA export ROBOTWIN_PATH=/path/to/RoboTwin
Async Mode
bash examples/embodiment/run_async.sh maniskill_dagger_mlp
bash examples/embodiment/run_async.sh libero_spatial_dagger_openpi
bash examples/embodiment/run_async.sh robotwin_adjust_bottle_dagger_openpi
# For RoboTwin, add the following two commands before running the .sh file:
# export ROBOT_PLATFORM=ALOHA export ROBOTWIN_PATH=/path/to/RoboTwin
Visualization and Visualization and Results#
1. TensorBoard Logs
tensorboard --logdir ./logs
2. Useful Monitoring Metrics
For metric definitions, see Training metrics. DAgger-specific metrics:
env/success_once: Recommended success metric for embodied DAgger runs.train/dagger/actor_loss: Supervised DAgger loss on replayed expert-labeled samples.train/actor/lr: Learning rate.train/actor/grad_norm: Gradient norm.train/replay_buffer/num_trajectories: Number of trajectories stored in the replay buffer.train/replay_buffer/total_samples: Number of replay-buffer samples available for training.train/replay_buffer/cache_size: Number of cached flattened trajectories.
Visualization and Results#
Configuration |
Student init SR |
Expert SR |
Training Time |
Student final SR |
|---|---|---|---|---|
MLP + ManiSkill |
0% |
100% |
20min |
100% |
Pi0 + LIBERO |
60% |
95% |
17h |
93% |