RL with OpenSora World Model#

This document provides a comprehensive guide to launching and managing the Vision-Language-Action Models (VLAs) training task within the RLinf framework, using the Action-conditioned OpenSora World Model (hereafter referred to as OpenSora) as the environment backend.

The primary objective is to train the policy in a closed-loop fashion without requiring real robots or traditional physics simulators, by leveraging a visual generation model to simulate how the environment evolves in response to actions.

Similar to finetuning VLAs in the LIBERO environment, this guide focuses on how to run reinforcement learning training tasks in the OpenSora-based simulation environment, highlighting the key capabilities of the model within this framework.

OpenSora aims to endow the model with the following capabilities:

  1. Visual Understanding: OpenSora generates future video frames from current observations and given action sequences, providing continuous visual feedback to the policy, enabling it to process RGB images from real robot cameras.

  2. Language Comprehension: Understanding natural-language task descriptions.

  3. Action Generation: Producing precise robotic actions (position, rotation, gripper control).

  4. Policy Improvement: Leveraging β€œimagined” trajectories generated by OpenSora to optimize the VLA policy using reinforcement learning methods such as PPO.

Environment#

As a world model, OpenSora can theoretically fit any environment for any task while maintaining a consistent interface. Using the LIBERO environment as an example, the environment interfaces and definitions are as follows:

OpenSora Simulating LIBERO Environment

  • Environment: Visual generation model

  • Task: Command a 7-DoF robotic arm to perform a variety of household manipulation skills (pick-and-place, stacking, opening drawers, spatial rearrangement)

  • Observation: Images returned by the visual generation model

  • Action Space: 7-dimensional continuous actions - 3D end-effector position control (x, y, z) - 3D rotation control (roll, pitch, yaw) - Gripper control (open / close)

OpenSora Simulating LIBERO Environment Reset

Unlike traditional simulators that can reset directly via reset(), OpenSora requires initialization frames and task descriptions for initialization and reset. Therefore, we need to download the corresponding initialization dataset in advance and specify the path to the initialization dataset.

Data Structure

  • Images: RGB tensors [batch_size, 256, 256, 3]

  • Task Descriptions: Natural-language instructions

  • Actions: Normalized continuous values converted to discrete tokens

  • Rewards: Provided by the reward classifier in the world model, ranging from 0 to 1

Algorithm#

Core Algorithm Components

  1. PPO (Proximal Policy Optimization)

    • Advantage estimation using GAE (Generalized Advantage Estimation)

    • Policy clipping with ratio limits

    • Value function clipping

    • Entropy regularization

  2. GRPO (Group Relative Policy Optimization)

    • For every state / prompt, the policy generates G independent actions

    • Compute the advantage of each action by subtracting the group’s mean reward

  3. Vision-Language-Action Model

    • OpenVLA architecture with multimodal fusion

    • Action tokenization and de-tokenization

    • Value head for critic function

Dependency Installation#

1. Clone RLinf Repository#

# For mainland China users, you can use the following for better download speed:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf

2. Install Dependencies#

Option 1: Docker Image

Use Docker image for the experiment.

docker run -it --rm --gpus all \
   --shm-size 20g \
   --network host \
   --name rlinf \
   -v .:/workspace/RLinf \
   rlinf/rlinf:agentic-rlinf0.2-opensora
   # For mainland China users, you can use the following for better download speed:
   # docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-opensora

Option 2: Custom Environment

Install dependencies directly in your environment by running the following command:

# For mainland China users, you can add the `--use-mirror` flag to the install.sh command for better download speed.

bash requirements/install.sh embodied --model openvla-oft --env opensora
source .venv/bin/activate

VLA Model Download#

Before starting training, you need to download the corresponding pretrained model:

# Download the model (choose either method)
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1

# Method 2: Using huggingface-hub
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download Haozhan72/Openvla-oft-SFT-libero-spatial-traj1 --local-dir Openvla-oft-SFT-libero-spatial-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-object-traj1 --local-dir Openvla-oft-SFT-libero-object-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-goal-traj1 --local-dir Openvla-oft-SFT-libero-goal-traj1
hf download Haozhan72/Openvla-oft-SFT-libero10-traj1 --local-dir Openvla-oft-SFT-libero10-traj1

After downloading, make sure to correctly specify the model path and the unnorm_key in the configuration yaml file.

rollout:
   model:
      model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
actor:
   model:
      model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
      unnorm_key: libero_90_no_noops_trajall # or libero_130_no_noops_trajall for the RLinf-OpenVLAOFT-LIBERO-130-Base-Lora model

WM (World Model) Model Download#

In addition to the VLA model, you need to download the OpenSora weights and the dataset for simulation initialization. Currently, RLinf only provides weights and data for libero-spatial and libero-object. The download methods are as follows:

# Download the weights and initialization data
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-OpenSora-LIBERO-Spatial
git clone https://huggingface.co/RLinf/RLinf-OpenSora-LIBERO-Object

# Method 2: Using huggingface-hub
pip install huggingface-hub
hf download RLinf/RLinf-OpenSora-LIBERO-Spatial --local-dir RLinf-OpenSora-LIBERO-Spatial
hf download RLinf/RLinf-OpenSora-LIBERO-Object --local-dir RLinf-OpenSora-LIBERO-Object

The directory structure of RLinf-OpenSora-LIBERO-Spatial is as follows:

RLinf-OpenSora-LIBERO-Spatial/
    β”œβ”€β”€ dataset_statistics.json             # Dataset normalization statistics
    β”œβ”€β”€ dataset/                            # Simulation initialization dataset
    β”‚   β”œβ”€β”€ traj0.npy
    β”‚   β”œβ”€β”€ traj1.npy
    β”‚   β”œβ”€β”€ ...
    β”‚   └── trajN.npy
    β”œβ”€β”€ model-00001.safetensors              # World model weight files
    β”œβ”€β”€ model.safetensors.index.json
    β”œβ”€β”€ config.json
    β”œβ”€β”€ resnet_rm.pth                        # Reward model weight file
    └── vae/                                 # VAE model weight files

After downloading, make sure to correctly specify the model path in the configuration yaml file.

env:
    train:
        opensora_wm_hf_ckpt_path: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/

Running the Script#

Please ensure you have activated the correct Python virtual environment (venv) before running the commands below. If you are using the official Docker image, switch to the openvla-oft environment with source switch_env openvla-oft.

1. Key Parameters Configuration

Taking the OpenVLA-OFT model as an example, configure the following key parameters in actor.model:

actor:
  model:
    model_path: "/path/to/model/Openvla-oft-SFT-libero-spatial-traj1/"    # SFT model path
    model_type: "openvla_oft"                                             # Model type set to openvla_oft
    use_proprio: False                                                    # Whether to use proprioceptive inputs
    num_images_in_input: 1                                                # Number of input images
    num_action_chunks: 8                                                  # Number of action chunks
    unnorm_key: "libero_spatial_no_noops"                                 # Action normalization key (match SFT). For RLinf-OpenVLAOFT-LIBERO-130-Base-Lora model, use libero_130_no_noops_trajall. For RLinf-OpenVLAOFT-LIBERO-90-Base-Lora model, use libero_90_no_noops_trajall.

It is worth noting that since the world model does not provide proprioception, does not generate a wrist view, and uses a fixed chunk length, use_proprio defaults to False, num_images_in_input defaults to 1, and num_action_chunks defaults to 8.

2. Environment Configuration

In the environment configuration file, set the following key parameters:

# Override in CHOSEN_CONFIG

# Recommend opensora_libero_spatial for training and libero_spatial for evaluation
env/train: opensora_libero_spatial
env/eval: libero_spatial
env:
   train:
      opensora_wm_hf_ckpt_path: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/

# In env/train/opensora_libero_spatial.yaml:

env_type: opensora_wm
wm_env_type: libero
# Initial image path for world model initialization
initial_image_path: ${env.train.opensora_wm_hf_ckpt_path}/dataset_for_rlinf_world_model_init/base_policy_rollout_buffer
# It is not recommended to modify any parameters in world_model_cfg
world_model_cfg:
   # Path to dataset statistics for normalization in the world model
   stats_path: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/best_wm_ckpt/base_policy/dataset_statistics.json
   chunk: 8                     # Align with training and VLA inference length; default 8
   condition_frame_length: 4    # Align with training; context memory length, default 4
   model:
   # Pretrained weights
      from_pretrained: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/best_wm_ckpt/base_policy/model

3. Configuration Files

We support the OpenVLA-OFT model with the GRPO algorithm. The corresponding configuration file is:

  • OpenVLA-OFT + GRPO: examples/embodiment/config/opensora_libero_spatial_grpo_openvlaoft.yaml

4. Launch Commands

After choosing a configuration, run the following command to start training:

bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG

For example, to train OpenVLA-OFT with GRPO in the OpenSora environment:

bash examples/embodiment/run_embodiment.sh opensora_libero_spatial_grpo_openvlaoft

Visualization and Results#

1. TensorBoard Logging

# Start TensorBoard
tensorboard --logdir ./logs --port 6006

2. Key Metrics Tracked

  • Training Metrics:

    • train/actor/approx_kl: Approximate KL divergence to monitor policy update magnitude.

    • train/actor/clip_fraction: Fraction of updates where the probability ratio triggered PPO clipping.

    • train/actor/clipped_ratio: Mean of the clipped probability ratios, measuring how much policy updates are affected by clipping.

    • train/actor/grad_norm: Gradient norm.

    • train/actor/lr: Learning rate.

    • train/actor/policy_loss: PPO/GRPO policy loss.

    • train/critic/value_loss: Value function loss.

    • train/critic/value_clip_ratio: Fraction of value targets whose update was clipped in PPO-style value function clipping.

    • train/critic/explained_variance: Explained variance of value function predictions; closer to 1 is better.

    • train/entropy_loss: Policy entropy.

    • train/loss: Total training loss (actor_loss + critic_loss + entropy_loss regularization).

  • Rollout Metrics:

    • rollout/advantages_max: Maximum of the advantage function.

    • rollout/advantages_mean: Mean of the advantage function.

    • rollout/advantages_min: Minimum of the advantage function.

    • rollout/rewards: Chunk of reward (refer to L414 in libero_env.py).

  • Environment Metrics:

    • env/episode_len: Number of environment steps elapsed in the episode (unit: step).

    • env/return: Episode return. In LIBERO’s sparse-reward setting, this metric is not informative since the reward is almost always 0 until the terminal success step.

    • env/reward: Step-level reward (0 for all intermediate steps and 1 only at successful termination). The logged value is normalized by the number of episode steps, which makes it difficult to interpret as real task performance during training.

    • env/success_once: Recommended metric to monitor training performance. It directly reflects the unnormalized episodic success rate and better represents the true performance of the policy.

3. Video Generation

env:
   eval:
      video_cfg:
         save_video: True
         video_base_dir: ${runner.logger.log_path}/video/eval

4. Train Log Tool Integration

runner:
   task_type: embodied
   logger:
      log_path: "../results"
      project_name: rlinf
      experiment_name: "libero_10_grpo_openvlaoft"
      logger_backends: ["tensorboard"] # wandb, swanlab

LIBERO Partial Results#

Currently, we have only tested using OpenSora to simulate libero-spatial and libero-object environments and trained VLA models. More environments are still under testing.

For each LIBERO suite, we evaluate every combination of task_id and trial_id. For the Object and Spatial suites, we evaluate 500 environments in total (10 tasks Γ— 50 trials).

We evaluate each model according to its training configuration: For the SFT-trained (LoRA-base) models, we set do_sample = False. For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2 to elicit the best performance of the RL-tuned policy.

Note

The motivation for choosing OpenSora as a world model simulator comes from WMPO. In the actual training of the world model, we referred to WMPO and OpenSora.

Evaluation results on LIBERO task groups using OpenSora simulator#

Model

Object

Spatial

huggingface OpenVLA-OFT (LoRA-base)

50.20%

51.61%

OpenVLA-OFT (RLinf-GRPO with OpenSora as world model simulator)

75.5%

64.5%

Improvement

+25.3%

+12.9%