RL with Wan World Model#

This document provides a complete guide for launching and managing Vision-Language-Action Model (VLA) training in RLinf, using the action-conditioned Wan world model (hereafter, Wan) as the environment backend.

The main goal is to run closed-loop policy optimization without real robots or traditional physics simulators by using a video generation model to simulate environment dynamics conditioned on actions.

Similar to VLA finetuning in LIBERO, this guide focuses on running RL training in a Wan-based simulation environment and highlights key capabilities supported by this framework.

Wan primarily aims to provide the following capabilities:

  1. Visual Understanding: Wan predicts future video frames from current observations and action sequences, providing continuous visual feedback for policy learning.

  2. Language Understanding: Understand natural-language task descriptions.

  3. Action Generation: Produce precise robot actions (position, rotation, gripper control).

  4. Policy Improvement: Use imagined trajectories generated by Wan to optimize VLA policies with RL methods such as PPO/GRPO.

Environment#

As a world model, Wan can theoretically fit many task settings while exposing a consistent environment interface. Using LIBERO as an example, the setup is:

Wan Simulating LIBERO

  • Environment: Visual generation model

  • Task: Control a 7-DoF robot arm to execute household manipulation skills (pick-and-place, stacking, opening drawers, spatial rearrangement, etc.)

  • Observation: Images generated by the world model

  • Action Space: 7D continuous actions - 3D end-effector position control (x, y, z) - 3D rotation control (roll, pitch, yaw) - Gripper control (open / close)

Wan Environment Reset

Unlike conventional simulators that reset directly via reset(), Wan requires initialization frames and task descriptions. Therefore, you need to download and configure the initialization dataset in advance.

Data Structure

  • Images: RGB tensors [batch_size, 256, 256, 3]

  • Task Descriptions: Natural-language instructions

  • Actions: Normalized continuous values that are tokenized for the policy

  • Rewards: Predicted by the world model reward classifier, range [0, 1]

Algorithm#

Core algorithmic components

  1. PPO (Proximal Policy Optimization)

    • GAE (Generalized Advantage Estimation)

    • Ratio-based policy clipping

    • Value clipping

    • Entropy regularization

  2. GRPO (Group Relative Policy Optimization)

    • For each state/prompt, sample G independent actions

    • Compute relative advantages using group mean reward as baseline

  3. Vision-Language-Action Model

    • OpenVLA architecture with multimodal fusion

    • Action tokenization/de-tokenization

    • Critic/value-head support

Dependency Installation#

1. Clone RLinf#

# For better download speed in mainland China, you may use:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf

2. Install dependencies#

Option 1: Docker image

Run experiments in Docker.

docker run -it --rm --gpus all \
   --shm-size 20g \
   --network host \
   --name rlinf \
   -v .:/workspace/RLinf \
   rlinf/rlinf:agentic-rlinf0.2-wan
   # For better image download speed in mainland China:
   # docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-wan

Option 2: Custom local environment

Install directly in your environment:

# For better dependency download speed in mainland China, add --use-mirror
bash requirements/install.sh embodied --model openvla-oft --env wan
source .venv/bin/activate

VLA Model Download#

Before training, download pretrained VLA checkpoints:

# Method 1: git clone
git lfs install
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1


# Method 2: huggingface-hub
# For better download speed in mainland China:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download Haozhan72/Openvla-oft-SFT-libero-spatial-traj1 --local-dir Openvla-oft-SFT-libero-spatial-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-object-traj1 --local-dir Openvla-oft-SFT-libero-object-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-goal-traj1 --local-dir Openvla-oft-SFT-libero-goal-traj1
hf download Haozhan72/Openvla-oft-SFT-libero10-traj1 --local-dir Openvla-oft-SFT-libero10-traj1

After download, make sure model_path and unnorm_key are correctly set in yaml.

rollout:
   model:
      model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
actor:
   model:
      model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
      unnorm_key: libero_90_no_noops_trajall # For RLinf-OpenVLAOFT-LIBERO-130-Base-Lora, use libero_130_no_noops_trajall

WM (World Model) Model Download#

Besides the VLA model, you also need Wan checkpoints and initialization data. RLinf currently provides data/checkpoints for three suites: libero-spatial, libero-object, and libero-goal. For each suite, Wan checkpoints are built from 1500 trajectories generated by VLA rollout.

# Method 1: git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-Wan-LIBERO-Spatial
git clone https://huggingface.co/RLinf/RLinf-Wan-LIBERO-Object
git clone https://huggingface.co/RLinf/RLinf-Wan-LIBERO-Goal

# Method 2: huggingface-hub
# For better download speed in mainland China:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Wan-LIBERO-Spatial --local-dir RLinf-Wan-LIBERO-Spatial
hf download RLinf/RLinf-Wan-LIBERO-Object --local-dir RLinf-Wan-LIBERO-Object
hf download RLinf/RLinf-Wan-LIBERO-Goal --local-dir RLinf-Wan-LIBERO-Goal

The directory structure of RLinf-Wan-LIBERO-Spatial is:

RLinf-Wan-LIBERO-Spatial/
    ├── dataset/                            # Initialization dataset for simulation
    │   ├── traj0.npy                       # Trajectories containing initial frame only
    │   ├── traj1.npy
    │   ├── ...
    │   └── trajN.npy
    │   ├── traj0_kir.npy                   # Trajectories with pre-keyframe context
    │   ├── traj1_kir.npy
    │   ├── ...
    │   └── trajN_kir.npy
    ├── model-00001.safetensors             # World model checkpoint
    ├── resnet_rm.pth                       # Reward model checkpoint
    └── Wan2.2_VAE.pth                      # VAE checkpoint

After download, make sure model paths are correctly configured in yaml.

env:
    train:
        wan_wm_hf_ckpt_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/

Running the Script#

Before running commands below, ensure the correct Python virtual environment is activated. If you use the official Docker image, switch to openvla-oft with: source switch_env openvla-oft.

1. Key model parameters

For OpenVLA-OFT, configure actor.model as follows:

actor:
  model:
    model_path: "/path/to/model/Openvla-oft-SFT-libero-spatial-traj1/"    # SFT model path
    model_type: "openvla_oft"                                             # model type
    use_proprio: False                                                    # whether to use proprioception
    num_images_in_input: 1                                                # number of image inputs
    num_action_chunks: 8                                                  # number of action chunks
    unnorm_key: "libero_spatial_no_noops"                                 # normalization key (aligned with SFT). RLinf-OpenVLAOFT-LIBERO-130-Base-Lora uses libero_130_no_noops_trajall; RLinf-OpenVLAOFT-LIBERO-90-Base-Lora uses libero_90_no_noops_trajall.

Note: world model training here does not provide proprioception, does not render wrist views, and uses fixed chunk length. Therefore, use_proprio=False, num_images_in_input=1, and num_action_chunks=8 are recommended defaults.

2. Environment configuration

Set key parameters in env config:

# Override in CHOSEN_CONFIG

# Recommended: wan_libero_spatial for train, libero_spatial for eval
env/train: wan_libero_spatial
env/eval: libero_spatial

# In env/train/wan_libero_spatial.yaml:
simulator_type: libero
task_suite_name: libero_spatial
# Whether to enable KeyFrame-Init Rollout
enable_kir: True
# Initialization dataset path for world model reset
initial_image_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/dataset
# VAE weights
VAE_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/Wan2.2_VAE.pth
# Pretrained world model weights
model_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/model-00001.safetensors
# Reward model
reward_model:
  type: ResnetRewModel
  from_pretrained: /Pathto/model/RLinf-Wan-LIBERO-Spatial/resnet_rm.pth

Key parameter notes in environment config:

  • enable_kir: Whether to enable KIR (KeyFrame-Init Rollout). If disabled, environment reset samples only .npy files whose names do not include _kir; if enabled, reset samples from all initialization files in dataset/.

  • reward_model.type: Reward model class. Multiple options are supported, including ResnetRewModel and TaskEmbedResnetRewModel.

3. Configuration files

Currently supported: OpenVLA-OFT + GRPO.

  • OpenVLA-OFT + GRPO: examples/embodiment/config/wan_libero_spatial_grpo_openvlaoft.yaml

4. Launch command

Run:

bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG

For example:

bash examples/embodiment/run_embodiment.sh wan_libero_spatial_grpo_openvlaoft

Visualization and Results#

1. TensorBoard

tensorboard --logdir ./logs --port 6006

2. Key metrics

  • Training metrics:

    • train/actor/approx_kl: approximate KL for policy update magnitude

    • train/actor/clip_fraction: fraction of PPO clipping

    • train/actor/clipped_ratio: mean clipped probability ratio

    • train/actor/grad_norm: gradient norm

    • train/actor/lr: learning rate

    • train/actor/policy_loss: PPO/GRPO policy loss

    • train/critic/value_loss: value loss

    • train/critic/value_clip_ratio: clipped value-update fraction

    • train/critic/explained_variance: value fit quality (closer to 1 is better)

    • train/entropy_loss: policy entropy

    • train/loss: total loss

  • Rollout metrics:

    • rollout/advantages_max: max advantage

    • rollout/advantages_mean: mean advantage

    • rollout/advantages_min: min advantage

    • rollout/rewards: chunk of reward (refer to L414 in libero_env.py)

  • Environment metrics:

    • env/episode_len: episode length in steps

    • env/return: episodic return (in sparse LIBERO reward settings, mostly 0 until success)

    • env/reward: step-level reward (typically 0 except success terminal step)

    • env/success_once: recommended metric for tracking true success rate

3. Video generation

env:
   eval:
      video_cfg:
         save_video: True
         video_base_dir: ${runner.logger.log_path}/video/eval

4. Train Log Tool Integration

runner:
   task_type: embodied
   logger:
      log_path: "../results"
      project_name: rlinf
      experiment_name: "libero_10_grpo_openvlaoft"
      logger_backends: ["tensorboard"] # wandb, swanlab

LIBERO Partial Results#

Current evaluation covers Wan simulation on LIBERO Spatial/Object/Goal suites. More environments are still under testing.

For each LIBERO suite, we evaluate all combinations of task_id and trial_id. Across Object, Spatial, and Goal suites, this totals 1500 environments (10 tasks x 150 trials).

Evaluation settings follow training configurations: for both SFT and RL-trained models, we use do_sample=True and temperature=1.6.

Note

Wan training and inference are built on top of the Diffsynth-Studio framework. In the evaluation results below, we only use a frozen world model to serve the RL training of the VLA model, without co-evolution between the world model and the VLA. Users can manually implement co-evolution to achieve further performance gains.

Evaluation results on LIBERO suites with Wan simulator#

Model

Spatial

Object

Goal

OpenVLA-OFT (LoRA-base)

61.2%

36.7%

48.2%

OpenVLA-OFT (RLinf-GRPO with Wan as world model)

71.5%

77.9%

60.1%

Improvement

+10.3%

+41.2%

+11.9%