RL with MetaWorld Benchmark#

This example provides a comprehensive guide to using the RLinf framework in the MetaWorld environment to finetune OpenVLA-OFT, π0, and π0.5 algorithms through reinforcement learning. It covers the entire process—from environment setup and core algorithm design to training configuration, evaluation, and visualization—along with reproducible commands and configuration snippets.

The primary objective is to develop a model capable of performing robotic manipulation:

  1. Visual Understanding: Processing RGB images from the robot’s camera.

  2. Language Comprehension: Interpreting natural-language task descriptions.

  3. Action Generation: Producing precise robotic actions (position, rotation, gripper control).

  4. Reinforcement Learning: Optimizing the policy via PPO with environment feedback.

Environment#

MetaWorld Environment

  • Environment: Multi-task simulation environment based on MuJoCo

  • Task: Control a 7-DOF robotic arm to perform various manipulation tasks

  • Observation: RGB images from off-screen cameras around the workspace

  • Action Space: 4-dimensional continuous actions - 3D end-effector position control (x, y, z) - Gripper control (open/close)

Data Structure

  • Images: RGB tensors [batch_size, 480, 480, 3]

  • Task Descriptions: Natural-language instructions

  • Actions: Normalized continuous values

  • Rewards: Sparse rewards based on task completion

Algorithm#

Core Algorithm Components

  1. PPO (Proximal Policy Optimization)

    • Advantage estimation using GAE (Generalized Advantage Estimation)

    • Policy clipping with ratio limits

    • Value function clipping

    • Entropy regularization

  2. GRPO (Group Relative Policy Optimization)

    • For every state / prompt the policy generates G independent actions

    • Compute the advantage of each action by subtracting the group’s mean reward

Dependency Installation#

1. Clone RLinf Repository#

# For mainland China users, you can use the following for better download speed:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf

2. Install Dependencies#

Option 1: Docker Image

Use Docker image for the experiment.

docker run -it --rm --gpus all \
   --shm-size 20g \
   --network host \
   --name rlinf \
   -v .:/workspace/RLinf \
   rlinf/rlinf:agentic-rlinf0.2-metaworld
   # For mainland China users, you can use the following for better download speed:
   # docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-metaworld

Please switch to the corresponding virtual environment via the built-in switch_env utility in the image:

# To train OpenPi models
source switch_env openpi

# To train OpenVLA-OFT models
# source switch_env openvla-oft

Option 2: Custom Environment

Install dependencies directly in your environment by running the following command:

# For mainland China users, you can add the `--use-mirror` flag to the install.sh command for better download speed.

# To train OpenPi models
bash requirements/install.sh embodied --model openpi --env metaworld

# To train OpenVLA-OFT models
# bash requirements/install.sh embodied --model openvla-oft --env metaworld

source .venv/bin/activate

Model Download#

Before starting training, you need to download the corresponding pretrained model:

# Download the model (choose either method)
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-Pi0-MetaWorld-SFT
git clone https://huggingface.co/RLinf/RLinf-Pi05-MetaWorld-SFT
git clone https://huggingface.co/RLinf/RLinf-OpenVLAOFT-MetaWorld-SFT

# Method 2: Using huggingface-hub
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Pi0-MetaWorld-SFT --local-dir RLinf-Pi0-MetaWorld-SFT
hf download RLinf/RLinf-Pi05-MetaWorld-SFT --local-dir RLinf-Pi05-MetaWorld-SFT
hf download RLinf/RLinf-OpenVLAOFT-MetaWorld-SFT --local-dir RLinf-OpenVLAOFT-MetaWorld-SFT

Alternatively, you can also download the model from ModelScope at https://www.modelscope.cn/models/RLinf/RLinf-Pi0-MetaWorld.

After downloading, make sure to correctly specify the model path in the configuration yaml file.

Running the Script#

1. Key Cluster Configuration

cluster:
   num_nodes: 1
   component_placement:
      env: 0-3
      rollout: 4-7
      actor: 0-7

rollout:
   pipeline_stage_num: 2

You can flexibly configure the GPU count for env, rollout, and actor components. Additionally, by setting pipeline_stage_num = 2 in the configuration, you can achieve pipeline overlap between rollout and env, improving rollout efficiency.

cluster:
   num_nodes: 1
   component_placement:
      env,rollout,actor: all

You can also reconfigure the layout to achieve full sharing, where env, rollout, and actor components all share all GPUs.

cluster:
   num_nodes: 1
   component_placement:
      env: 0-1
      rollout: 2-5
      actor: 6-7

You can also reconfigure the layout to achieve full separation, where env, rollout, and actor components each use their own GPUs with no interference, eliminating the need for offloading functionality.

2. Configuration Files

MetaWorld MT50 multi-task joint training configuration files (In this task setting, both training and inference are performed in a multi-task environment):

  • Ï€0+ PPO: examples/embodiment/config/metaworld_50_ppo_openpi.yaml

  • Ï€0.5+ PPO: examples/embodiment/config/metaworld_50_ppo_openpi_pi05.yaml

  • OpenVLA-OFT + GRPO: examples/embodiment/config/metaworld_50_grpo_openvlaoft.yaml

MetaWorld ML45 joint training configuration files (In this task setting, training is performed on 45 tasks, and inference is performed on 5 OOD tasks):

  • Ï€0+ PPO: examples/embodiment/config/metaworld_45_ppo_openpi.yaml

3. Launch Commands

To start training with the selected configuration, run the following command:

bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG

For example, to train the π0model using the PPO algorithm in the MetaWorld environment, run:

bash examples/embodiment/run_embodiment.sh metaworld_50_ppo_openpi

Visualization and Results#

1. TensorBoard Logging

# Launch TensorBoard
tensorboard --logdir ./logs --port 6006

2. Key Monitoring Metrics

  • Training Metrics

    • actor/loss: Policy loss

    • actor/value_loss: Value function loss (PPO)

    • actor/grad_norm: Gradient norm

    • actor/approx_kl: KL divergence between old and new policies

    • actor/pg_clipfrac: Policy clipping ratio

    • actor/value_clip_ratio: Value loss clipping ratio (PPO)

  • Rollout Metrics

    • rollout/returns_mean: Mean episode return

    • rollout/advantages_mean: Mean advantage value

  • Environment Metrics

    • env/episode_len: Mean episode length

    • env/success_once: Task success rate

3. Video Generation

video_cfg:
  save_video: True
  info_on_video: True
  video_base_dir: ${runner.logger.log_path}/video/train

4. WandB Integration

runner:
  task_type: embodied
  logger:
    log_path: "../results"
    project_name: rlinf
    experiment_name: "metaworld_50_ppo_openpi"
    logger_backends: ["tensorboard", "wandb"] # tensorboard, wandb, swanlab

MetaWorld Results#

The results for Diffusion Policy, TinyVLA, and SmolVLA in the table below are referenced from the SmolVLA paper. The SFT results for π0 and π0.5 are obtained by retraining using the official dataset provided by LeRobot.

MetaWorld-MT50 Performance Comparison (Success Rate, %)#

Methods

Easy

Medium

Hard

Very Hard

Avg.

Diffusion Policy

23.1

10.7

1.9

6.1

10.5

TinyVLA

77.6

21.5

11.4

15.8

31.6

SmolVLA

87.1

51.8

70.0

64.0

68.2

Ï€0

77.9

51.8

53.3

20.0

50.8

Ï€0 + PPO

92.1

74.6

61.7

84.0

78.1

Ï€0.5

68.2

37.3

41.7

28.0

43.8

Ï€0.5 + PPO

86.4

55.5

75.0

66.0

70.7