RL with Wan World Model#
This document provides a complete guide for launching and managing Vision-Language-Action Model (VLA) training in RLinf, using the action-conditioned Wan world model (hereafter, Wan) as the environment backend.
The main goal is to run closed-loop policy optimization without real robots or traditional physics simulators by using a video generation model to simulate environment dynamics conditioned on actions.
Similar to VLA finetuning in LIBERO, this guide focuses on running RL training in a Wan-based simulation environment and highlights key capabilities supported by this framework.
Wan primarily aims to provide the following capabilities:
Visual Understanding: Wan predicts future video frames from current observations and action sequences, providing continuous visual feedback for policy learning.
Language Understanding: Understand natural-language task descriptions.
Action Generation: Produce precise robot actions (position, rotation, gripper control).
Policy Improvement: Use imagined trajectories generated by Wan to optimize VLA policies with RL methods such as PPO/GRPO.
Environment#
As a world model, Wan can theoretically fit many task settings while exposing a consistent environment interface. Using LIBERO as an example, the setup is:
Wan Simulating LIBERO
Environment: Visual generation model
Task: Control a 7-DoF robot arm to execute household manipulation skills (pick-and-place, stacking, opening drawers, spatial rearrangement, etc.)
Observation: Images generated by the world model
Action Space: 7D continuous actions - 3D end-effector position control (x, y, z) - 3D rotation control (roll, pitch, yaw) - Gripper control (open / close)
Wan Environment Reset
Unlike conventional simulators that reset directly via reset(),
Wan requires initialization frames and task descriptions.
Therefore, you need to download and configure the initialization dataset in advance.
Data Structure
Images: RGB tensors
[batch_size, 256, 256, 3]Task Descriptions: Natural-language instructions
Actions: Normalized continuous values that are tokenized for the policy
Rewards: Predicted by the world model reward classifier, range
[0, 1]
Algorithm#
Core algorithmic components
PPO (Proximal Policy Optimization)
GAE (Generalized Advantage Estimation)
Ratio-based policy clipping
Value clipping
Entropy regularization
GRPO (Group Relative Policy Optimization)
For each state/prompt, sample G independent actions
Compute relative advantages using group mean reward as baseline
Vision-Language-Action Model
OpenVLA architecture with multimodal fusion
Action tokenization/de-tokenization
Critic/value-head support
Dependency Installation#
1. Clone RLinf#
# For better download speed in mainland China, you may use:
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf
2. Install dependencies#
Option 1: Docker image
Run experiments in Docker.
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.2-wan
# For better image download speed in mainland China:
# docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-wan
Option 2: Custom local environment
Install directly in your environment:
# For better dependency download speed in mainland China, add --use-mirror
bash requirements/install.sh embodied --model openvla-oft --env wan
source .venv/bin/activate
VLA Model Download#
Before training, download pretrained VLA checkpoints:
# Method 1: git clone
git lfs install
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1
git clone https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1
# Method 2: huggingface-hub
# For better download speed in mainland China:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download Haozhan72/Openvla-oft-SFT-libero-spatial-traj1 --local-dir Openvla-oft-SFT-libero-spatial-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-object-traj1 --local-dir Openvla-oft-SFT-libero-object-traj1
hf download Haozhan72/Openvla-oft-SFT-libero-goal-traj1 --local-dir Openvla-oft-SFT-libero-goal-traj1
hf download Haozhan72/Openvla-oft-SFT-libero10-traj1 --local-dir Openvla-oft-SFT-libero10-traj1
After download, make sure model_path and unnorm_key are correctly set in yaml.
rollout:
model:
model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
actor:
model:
model_path: Pathto/RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora
unnorm_key: libero_90_no_noops_trajall # For RLinf-OpenVLAOFT-LIBERO-130-Base-Lora, use libero_130_no_noops_trajall
WM (World Model) Model Download#
Besides the VLA model, you also need Wan checkpoints and initialization data.
RLinf currently provides data/checkpoints for three suites:
libero-spatial, libero-object, and libero-goal.
For each suite, Wan checkpoints are built from 1500 trajectories generated by VLA rollout.
# Method 1: git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-Wan-LIBERO-Spatial
git clone https://huggingface.co/RLinf/RLinf-Wan-LIBERO-Object
git clone https://huggingface.co/RLinf/RLinf-Wan-LIBERO-Goal
# Method 2: huggingface-hub
# For better download speed in mainland China:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-Wan-LIBERO-Spatial --local-dir RLinf-Wan-LIBERO-Spatial
hf download RLinf/RLinf-Wan-LIBERO-Object --local-dir RLinf-Wan-LIBERO-Object
hf download RLinf/RLinf-Wan-LIBERO-Goal --local-dir RLinf-Wan-LIBERO-Goal
The directory structure of RLinf-Wan-LIBERO-Spatial is:
RLinf-Wan-LIBERO-Spatial/
├── dataset/ # Initialization dataset for simulation
│ ├── traj0.npy # Trajectories containing initial frame only
│ ├── traj1.npy
│ ├── ...
│ └── trajN.npy
│ ├── traj0_kir.npy # Trajectories with pre-keyframe context
│ ├── traj1_kir.npy
│ ├── ...
│ └── trajN_kir.npy
├── model-00001.safetensors # World model checkpoint
├── resnet_rm.pth # Reward model checkpoint
└── Wan2.2_VAE.pth # VAE checkpoint
After download, make sure model paths are correctly configured in yaml.
env:
train:
wan_wm_hf_ckpt_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/
Running the Script#
Before running commands below, ensure the correct Python virtual environment is activated.
If you use the official Docker image, switch to openvla-oft with:
source switch_env openvla-oft.
1. Key model parameters
For OpenVLA-OFT, configure actor.model as follows:
actor:
model:
model_path: "/path/to/model/Openvla-oft-SFT-libero-spatial-traj1/" # SFT model path
model_type: "openvla_oft" # model type
use_proprio: False # whether to use proprioception
num_images_in_input: 1 # number of image inputs
num_action_chunks: 8 # number of action chunks
unnorm_key: "libero_spatial_no_noops" # normalization key (aligned with SFT). RLinf-OpenVLAOFT-LIBERO-130-Base-Lora uses libero_130_no_noops_trajall; RLinf-OpenVLAOFT-LIBERO-90-Base-Lora uses libero_90_no_noops_trajall.
Note: world model training here does not provide proprioception, does not render wrist views,
and uses fixed chunk length. Therefore, use_proprio=False, num_images_in_input=1,
and num_action_chunks=8 are recommended defaults.
2. Environment configuration
Set key parameters in env config:
# Override in CHOSEN_CONFIG
# Recommended: wan_libero_spatial for train, libero_spatial for eval
env/train: wan_libero_spatial
env/eval: libero_spatial
# In env/train/wan_libero_spatial.yaml:
simulator_type: libero
task_suite_name: libero_spatial
# Whether to enable KeyFrame-Init Rollout
enable_kir: True
# Initialization dataset path for world model reset
initial_image_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/dataset
# VAE weights
VAE_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/Wan2.2_VAE.pth
# Pretrained world model weights
model_path: /Pathto/model/RLinf-Wan-LIBERO-Spatial/model-00001.safetensors
# Reward model
reward_model:
type: ResnetRewModel
from_pretrained: /Pathto/model/RLinf-Wan-LIBERO-Spatial/resnet_rm.pth
Key parameter notes in environment config:
enable_kir: Whether to enable KIR (KeyFrame-Init Rollout). If disabled, environment reset samples only.npyfiles whose names do not include_kir; if enabled, reset samples from all initialization files indataset/.reward_model.type: Reward model class. Multiple options are supported, includingResnetRewModelandTaskEmbedResnetRewModel.
3. Configuration files
Currently supported: OpenVLA-OFT + GRPO.
OpenVLA-OFT + GRPO:
examples/embodiment/config/wan_libero_spatial_grpo_openvlaoft.yaml
4. Launch command
Run:
bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG
For example:
bash examples/embodiment/run_embodiment.sh wan_libero_spatial_grpo_openvlaoft
Visualization and Results#
1. TensorBoard
tensorboard --logdir ./logs --port 6006
2. Key metrics
Training metrics:
train/actor/approx_kl: approximate KL for policy update magnitudetrain/actor/clip_fraction: fraction of PPO clippingtrain/actor/clipped_ratio: mean clipped probability ratiotrain/actor/grad_norm: gradient normtrain/actor/lr: learning ratetrain/actor/policy_loss: PPO/GRPO policy losstrain/critic/value_loss: value losstrain/critic/value_clip_ratio: clipped value-update fractiontrain/critic/explained_variance: value fit quality (closer to 1 is better)train/entropy_loss: policy entropytrain/loss: total loss
Rollout metrics:
rollout/advantages_max: max advantagerollout/advantages_mean: mean advantagerollout/advantages_min: min advantagerollout/rewards: chunk of reward (refer to L414 in libero_env.py)
Environment metrics:
env/episode_len: episode length in stepsenv/return: episodic return (in sparse LIBERO reward settings, mostly 0 until success)env/reward: step-level reward (typically 0 except success terminal step)env/success_once: recommended metric for tracking true success rate
3. Video generation
env:
eval:
video_cfg:
save_video: True
video_base_dir: ${runner.logger.log_path}/video/eval
4. Train Log Tool Integration
runner:
task_type: embodied
logger:
log_path: "../results"
project_name: rlinf
experiment_name: "libero_10_grpo_openvlaoft"
logger_backends: ["tensorboard"] # wandb, swanlab
LIBERO Partial Results#
Current evaluation covers Wan simulation on LIBERO Spatial/Object/Goal suites. More environments are still under testing.
For each LIBERO suite, we evaluate all combinations of task_id and trial_id.
Across Object, Spatial, and Goal suites, this totals 1500 environments
(10 tasks x 150 trials).
Evaluation settings follow training configurations:
for both SFT and RL-trained models, we use do_sample=True and temperature=1.6.
Note
Wan training and inference are built on top of the Diffsynth-Studio framework. In the evaluation results below, we only use a frozen world model to serve the RL training of the VLA model, without co-evolution between the world model and the VLA. Users can manually implement co-evolution to achieve further performance gains.
Model |
Spatial |
Object |
Goal |
|---|---|---|---|
OpenVLA-OFT (LoRA-base) |
61.2% |
36.7% |
48.2% |
OpenVLA-OFT (RLinf-GRPO with Wan as world model) |
71.5% |
77.9% |
60.1% |
Improvement |
+10.3% |
+41.2% |
+11.9% |