RL with OpenSora World Model#

OpenSora as an action-conditioned world model.#

Train a VLA policy closed-loop without real robots or a physics simulator by using the action-conditioned OpenSora world model as the environment backend. OpenSora generates future video frames from the current observation and an action sequence, so the policy can be optimized on “imagined” rollouts with RL (GRPO/PPO).

Overview#

Train OpenVLA-OFT with GRPO on LIBERO suites simulated by the OpenSora world model.

Environments

LIBERO

Algorithms

GRPO

Tasks

Spatial · Object

Hardware

1 node · GPUs

You’ll do: install → download the VLA model → download the OpenSora world-model weights + init data → launch run_embodiment.sh → watch env/success_once.

Prerequisites: Installation · an OpenVLA-OFT SFT checkpoint · OpenSora world-model weights and init dataset (steps below).

Tasks#

As a world model, OpenSora can in principle fit any task behind a consistent interface. RLinf currently ships weights and init data for two LIBERO suites:

Environment	Task / Suite	Config / Weights	Focus
OpenSora	LIBERO-Spatial	`RLinf/RLinf-OpenSora-LIBERO-Spatial`	Use OpenSora as a learned simulator for LIBERO spatial tasks.
OpenSora	LIBERO-Object	`RLinf/RLinf-OpenSora-LIBERO-Object`	Roll out object manipulation dynamics in the video world model.

Observation and Action#

Field	Description
Observation	RGB frames generated by the world model, `[B, 256, 256, 3]`, seeded from initialization frames.
Action	7-D continuous actions normalized and tokenized to condition generation.
Reward	World-model reward classifier output in `[0, 1]`.
Prompt	Natural-language task description used to condition the video world model.

Unlike a traditional simulator, OpenSora has no reset(): it requires initialization frames and a task description, so you download an initialization dataset and point the config at it.

Installation#

First, clone the RLinf repository:

# Mainland China users can use a mirror for faster cloning:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf

Then set up the dependencies with one of the two methods below — a prebuilt Docker image (recommended) or a custom environment. The general setup (prerequisites, GPU drivers, the in-image switch_env helper, mirrors, and troubleshooting) is documented once in Installation; the commands in this recipe only differ in the Docker image tag and the --env value.

Option 1: Docker image — image tag agentic-rlinf0.3-opensora:

docker run -it --rm --gpus all \
   --shm-size 20g \
   --network host \
   --name rlinf \
   -v .:/workspace/RLinf \
   rlinf/rlinf:agentic-rlinf0.3-opensora
   # Mainland China mirror: docker.1ms.run/rlinf/rlinf:agentic-rlinf0.3-opensora

# Inside the container, switch to the OpenVLA-OFT virtual environment:
source switch_env openvla-oft

Option 2: Custom environment — install bundle --env opensora:

# Add --use-mirror for faster downloads in mainland China.
bash requirements/install.sh embodied --model openvla-oft --env opensora
source .venv/bin/activate

Download the VLA Model#

Download the OpenVLA-OFT SFT checkpoints:

# Method 1: git clone
git lfs install
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1

# Method 2: huggingface-hub (set HF_ENDPOINT=https://hf-mirror.com in mainland China)
pip install huggingface-hub
hf download Haozhan72/Openvla-oft-SFT-libero-spatial-traj1 --local-dir Openvla-oft-SFT-libero-spatial-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-object-traj1 --local-dir Openvla-oft-SFT-libero-object-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-goal-traj1 --local-dir Openvla-oft-SFT-libero-goal-traj1
hf download Haozhan72/Openvla-oft-SFT-libero10-traj1 --local-dir Openvla-oft-SFT-libero10-traj1

After downloading, set the model path and unnorm_key in the config:

rollout:
   model:
      model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
actor:
   model:
      model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
      unnorm_key: libero_90_no_noops_trajall # or libero_130_no_noops_trajall for the RLinf-OpenVLAOFT-LIBERO-130-Base-Lora model

Download the World Model#

Download the OpenSora weights and the simulation-initialization dataset. RLinf currently provides weights and data for libero-spatial and libero-object:

# Method 1: git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-OpenSora-LIBERO-Spatial
git clone https://huggingface.co/RLinf/RLinf-OpenSora-LIBERO-Object

# Method 2: huggingface-hub
pip install huggingface-hub
hf download RLinf/RLinf-OpenSora-LIBERO-Spatial --local-dir RLinf-OpenSora-LIBERO-Spatial
hf download RLinf/RLinf-OpenSora-LIBERO-Object --local-dir RLinf-OpenSora-LIBERO-Object

The directory structure of RLinf-OpenSora-LIBERO-Spatial is:

RLinf-OpenSora-LIBERO-Spatial/
    ├── dataset_statistics.json             # Dataset normalization statistics
    ├── dataset/                            # Simulation initialization dataset
    │   ├── traj0.npy
    │   ├── traj1.npy
    │   ├── ...
    │   └── trajN.npy
    ├── model-00001.safetensors              # World model weight files
    ├── model.safetensors.index.json
    ├── config.json
    ├── resnet_rm.pth                        # Reward model weight file
    └── vae/                                 # VAE model weight files

After downloading, set the world-model path in the config:

env:
    train:
        opensora_wm_hf_ckpt_path: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/

Run It#

1. Model parameters

Configure actor.model (OpenVLA-OFT example):

actor:
  model:
    model_path: "/path/to/model/Openvla-oft-SFT-libero-spatial-traj1/"    # SFT model path
    model_type: "openvla_oft"                                             # Model type set to openvla_oft
    use_proprio: False                                                    # Whether to use proprioceptive inputs
    num_images_in_input: 1                                                # Number of input images
    num_action_chunks: 8                                                  # Number of action chunks
    unnorm_key: "libero_spatial_no_noops"                                 # Action normalization key (match SFT)

Because the world model does not provide proprioception, does not generate a wrist view, and uses a fixed chunk length, use_proprio defaults to False, num_images_in_input to 1, and num_action_chunks to 8.

2. Environment configuration

# Recommend opensora_libero_spatial for training and libero_spatial for evaluation
env/train: opensora_libero_spatial
env/eval: libero_spatial
env:
   train:
      opensora_wm_hf_ckpt_path: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/

# In env/train/opensora_libero_spatial.yaml:
env_type: opensora_wm
wm_env_type: libero
# Initial image path for world model initialization
initial_image_path: ${env.train.opensora_wm_hf_ckpt_path}/dataset_for_rlinf_world_model_init/base_policy_rollout_buffer
# It is not recommended to modify any parameters in world_model_cfg
world_model_cfg:
   stats_path: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/best_wm_ckpt/base_policy/dataset_statistics.json
   chunk: 8                     # Align with training and VLA inference length; default 8
   condition_frame_length: 4    # Context memory length; default 4
   model:
      from_pretrained: /Pathto/model/RLinf-OpenSora-LIBERO-Spatial/best_wm_ckpt/base_policy/model

3. Launch

OpenVLA-OFT + GRPO uses examples/embodiment/config/opensora_libero_spatial_grpo_openvlaoft.yaml:

bash examples/embodiment/run_embodiment.sh opensora_libero_spatial_grpo_openvlaoft

Visualization and Results#

Watch ``env/success_once`` for the unnormalized episodic success rate. For every logged metric, see Training metrics. Enable generated-rollout videos with:

env:
   eval:
      video_cfg:
         save_video: True
         video_base_dir: ${runner.logger.log_path}/video/eval

We evaluate every task_id × trial_id combination — 500 environments (10 tasks × 50 trials) per suite. SFT (LoRA-base) models use do_sample = False; RL-trained models use do_sample = True and temperature_train = 1.6 in rollout.sampling_params, with env.train.rollout_epoch=2.

Note

The motivation for choosing OpenSora as a world-model simulator comes from WMPO; we also referred to OpenSora when training the world model.

**Evaluation results on LIBERO task groups using the OpenSora simulator**#
Model	Object	Spatial
OpenVLA-OFT (LoRA-base)	50.20%	51.61%
OpenVLA-OFT (RLinf-GRPO with OpenSora as world model simulator)	75.5%	64.5%
Improvement	+25.3%	+12.9%