RL on Dexbotic Models#
This document provides a guide to fine-tuning the Dexbotic VLA models with reinforcement learning using the RLinf framework. Dexbotic (dexmal/dexbotic) is an open-source Vision-Language-Action toolbox from Dexmal, a unified implementation of various embodied models. This example covers the LIBERO Spatial benchmark with the Dexbotic π0model and the DM0 model.
The primary objective is to develop a model capable of robotic manipulation by:
Visual Understanding: Processing RGB images from the robot’s camera.
Language Comprehension: Interpreting natural-language task descriptions.
Action Generation: Producing precise robotic actions via flow-based diffusion denoising.
Reinforcement Learning: Optimizing the policy via PPO with environment feedback.
Environment#
LIBERO Environment
Environment: LIBERO simulation benchmark built on top of robosuite (MuJoCo).
Task: Command a 7-DoF robotic arm to perform household manipulation skills (pick-and-place, stacking, spatial rearrangement).
Observation: RGB images (typical resolutions 128 × 128 or 224 × 224) captured by off-screen cameras placed around the workspace.
Action Space: 7-dimensional continuous actions - 3D end-effector position control (x, y, z) - 3D rotation control (roll, pitch, yaw) - Gripper control (open / close)
Task Description Format
Dexbotic uses the environment-provided natural-language task description as the language model input.
Data Structure
Images: Main-view and wrist-view RGB tensors, each of shape
[batch_size, 224, 224, 3]States: End-effector pose (position + orientation) and gripper state.
Task Descriptions: Natural-language instructions
Actions: Action chunks of length 50 (configurable); actions are replanned every N steps.
Algorithm#
Core Algorithm Components
PPO (Proximal Policy Optimization)
Advantage estimation using GAE (Generalized Advantage Estimation)
Policy clipping with ratio limits
Value function clipping
Entropy regularization
Dexbotic (Ï€0.5-based VLA)
Flow-matching / flow-SDE action generation
Diffusion denoising for action chunks
Value head for critic function
Configurable
noise_method(e.g.flow_sde),noise_level, andnum_stepsfor denoising
Dependency Installation#
1. Clone RLinf Repository#
git clone https://github.com/RLinf/RLinf.git
cd RLinf
2. Install Dependencies#
Option 1: Docker Image
Use the Docker image for LIBERO-based embodied training:
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.2-maniskill_libero
Please switch to the corresponding virtual environment via the built-in switch_env utility in the image:
source switch_env dexbotic
Option 2: Custom Environment
Install dependencies directly in your environment:
bash requirements/install.sh embodied --model dexbotic --env maniskill_libero
source .venv/bin/activate
Model Download#
Ï€0 model
Before starting training, download the Dexbotic π0SFT model from HuggingFace:
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/Dexmal/libero-db-pi0
# Method 2: Using huggingface-hub
pip install huggingface-hub
huggingface-cli download Dexmal/libero-db-pi0 --local-dir libero-db-pi0
Then set rollout.model.model_path and actor.model.model_path in
examples/embodiment/config/libero_spatial_ppo_dexbotic_pi0.yaml to the
local path (e.g. ./libero-db-pi0).
DM0 model
Download the DM0 SFT model from HuggingFace:
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/Dexmal/DM0-libero
# Method 2: Using huggingface-hub
pip install huggingface-hub
huggingface-cli download Dexmal/DM0-libero --local-dir DM0-libero
Then set rollout.model.model_path and actor.model.model_path in
examples/embodiment/config/libero_spatial_ppo_dexbotic_dm0.yaml to the
local path (e.g. ./DM0-libero).
Quick Start#
Ï€0Model#
Configuration File
examples/embodiment/config/libero_spatial_ppo_dexbotic_pi0.yaml
Key Config Snippets
rollout:
model:
model_path: "/path/to/model/libero-db-pi0" # https://huggingface.co/Dexmal/libero-db-pi0
actor:
model:
model_path: "/path/to/model/libero-db-pi0"
num_action_chunks: 5
num_steps: 4
action_dim: 7
add_value_head: True
dexbotic:
num_images_in_input: 2
noise_level: 0.5
noise_method: "flow_sde"
train_expert_only: True
detach_critic_input: True
Launch Command
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_dexbotic_pi0
DM0 Model#
Configuration File
examples/embodiment/config/libero_spatial_ppo_dexbotic_dm0.yaml
Key Config Snippets
rollout:
model:
model_path: "/path/to/model/DM0-libero" # https://huggingface.co/Dexmal/DM0-libero
actor:
model:
model_path: "/path/to/model/DM0-libero"
num_action_chunks: 10
num_steps: 3
action_dim: 7
add_value_head: True
dexbotic:
num_images_in_input: 2
noise_level: 0.5
noise_method: "flow_sde"
train_expert_only: True
detach_critic_input: True
Launch Command
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_dexbotic_dm0
Evaluation#
Ï€0Model#
python toolkits/eval_scripts_dexbotic/libero_eval.py \
--config_name db_pi0_libero \
--pretrained_path /path/to/checkpoint \
--task_suite_name libero_spatial \
--num_trials_per_task 50 \
--action_chunk 5 \
--num_steps 10
DM0 Model#
python toolkits/eval_scripts_dexbotic/libero_eval.py \
--config_name dm0_libero \
--pretrained_path /path/to/checkpoint \
--task_suite_name libero_spatial \
--num_trials_per_task 50 \
--action_chunk 10 \
--num_steps 10
You can also use RLinf’s unified VLA evaluation flow; refer to the VLA Evaluation Documentation for details.
Note
The --action_chunk argument controls the replan interval (how many
steps the policy executes before re-querying the model). π0uses
5 and DM0 uses 10 by default, matching their respective
num_action_chunks training settings.
Visualization and Results#
TensorBoard Logging
tensorboard --logdir ./logs --port 6006
Key Metrics
Training:
train/actor/policy_loss,train/critic/value_loss,train/actor/approx_klEnvironment:
env/success_once(episodic success rate),env/episode_len