ManiSkill OOD Evaluation#
ManiSkill OOD evaluation measures how well VLA policies generalize to out-of-distribution ManiSkill scenes. It is built on the Put-on-Plate task family (placing a carrot on a plate) and follows the OOD test protocol from rl4vla, with scenes grouped into Vision, Semantic, and Execution categories.
Related training doc: RL with ManiSkill Benchmark
Environment Setup#
bash requirements/install.sh embodied --model openvla-oft --env maniskill_libero
source .venv/bin/activate
evaluations/maniskill/ currently ships only an OpenVLA-OFT example config. OpenVLA, OpenPI, and other models have ManiSkill training configs but no dedicated eval YAML yet — see Advanced Usage below.
Download ManiSkill assets if not already present:
cd rlinf/envs/maniskill
# For faster downloads in China, you can set:
# export HF_ENDPOINT=https://hf-mirror.com
hf download --repo-type dataset RLinf/maniskill_assets --local-dir ./assets
Models and Checkpoints#
OpenVLA-OFT evaluation typically requires two weight sources:
Base model
rollout.model.model_path: e.g. RLinf/Openvla-oft-SFT-libero10-trajallManiSkill LoRA
rollout.model.lora_path: e.g. RLinf/RLinf-OpenVLAOFT-ManiSkill-Base-Lora
To evaluate an RL-trained policy, pass the .pt checkpoint via runner.ckpt_path or CKPT_PATH; it overrides model initialization.
Example Configs#
Available under evaluations/maniskill/:
Config file |
Description |
Model |
|---|---|---|
|
OOD generalization template (default: training scene) |
OpenVLA-OFT |
End-to-End Workflow#
Step 1: Activate the environment
source .venv/bin/activate
Step 2: Edit the config
Copy or edit the target YAML and set at least rollout.model.model_path and rollout.model.lora_path. See Configuration Reference (env.eval Field Reference) for general env.eval fields; see Evaluation Configuration below for ManiSkill scene selection and protocol.
Step 3: Launch evaluation
bash evaluations/run_eval.sh maniskill maniskill_ood_openvlaoft_eval \
rollout.model.model_path=/path/to/model \
rollout.model.lora_path=/path/to/lora
Step 4: Check results
The terminal prints eval/success_once; see Logs and Results for logs and videos.
Evaluation Configuration#
ManiSkill evaluation selects scenes via id (environment ID) and obj_set (object split) under env.eval.init_params, and reports eval/success_once (fraction of trajectories with at least one success).
Protocol Overview#
RLinf’s ManiSkill OOD protocol matches rl4vla for fair comparison with published results.
In-distribution:
PutOnPlateInScene25Main-v3+obj_set=train(the plate-25-main training task);Out-of-distribution: 13 variant environments +
obj_set=test, split into Vision / Semantic / Execution;Supplementary runs: the
mani-oodmode also runs 3 Semantic tasks withobj_set=train.
Each trajectory is identified by episode_id (i.e. reset_state_id), which fixes the object, plate, pose, and (for some scenes) visual perturbation. With use_fixed_reset_state_ids=True, the env loads deterministic initial conditions from episode_id; with auto_reset=True, the next episode_id is assigned sequentially after each episode.
OOD Scene List#
Category |
Environment ID ( |
Description |
|---|---|---|
Vision |
|
Background image perturbation |
|
Texture perturbation (strength 0.3 / 0.5) |
|
|
Whole-scene visual perturbation (strength 0.3 / 0.5) |
|
Semantic |
|
Unseen carrot objects |
|
Unseen plates |
|
|
Changed language instructions |
|
|
Multiple carrots / multiple plates |
|
Execution |
|
Changed object initial positions |
|
Changed robot initial pose |
|
|
Dynamic target position changes |
Key Environment Parameters#
These fields live under env.eval.init_params:
Field |
Purpose |
|---|---|
|
ManiSkill registered env name; selects the OOD variant (see table above). The default template uses |
|
Object split: |
|
Observation mode; VLA eval uses |
|
Simulation backend; default |
|
Action-space setup; OpenVLA-OFT uses |
See env.eval Field Reference in Configuration Reference for general env.eval fields (total_num_envs, max_episode_steps, auto_reset, etc.). ManiSkill eval examples typically use max_episode_steps=80, max_steps_per_rollout_epoch=80, and ignore_terminations=True.
OpenVLA-OFT Model Fields#
In addition to model_path, maniskill_ood_openvlaoft_eval.yaml requires:
rollout:
model:
model_type: openvla_oft
unnorm_key: bridge_orig
is_lora: True
lora_path: /path/to/RLinf-OpenVLAOFT-ManiSkill-Base-Lora
add_value_head: True
max_prompt_length: 30
Single-Scene Evaluation#
Override init_params via Hydra to evaluate the default training scene or any OOD scene.
In-distribution (plate-25-main)
bash evaluations/run_eval.sh maniskill maniskill_ood_openvlaoft_eval \
env.eval.init_params.id=PutOnPlateInScene25Main-v3 \
env.eval.init_params.obj_set=train \
rollout.model.model_path=/path/to/model \
rollout.model.lora_path=/path/to/lora
Single OOD scene (Vision example)
bash evaluations/run_eval.sh maniskill maniskill_ood_openvlaoft_eval \
env.eval.init_params.id=PutOnPlateInScene25VisionImage-v1 \
env.eval.init_params.obj_set=test \
rollout.model.model_path=/path/to/model \
runner.ckpt_path=/path/to/checkpoint.pt
Covering the Full Test Set#
total_num_trials per scene depends on object count, plate count, and pose combinations. With sufficient resources, increase total_num_envs; when resources are limited, set max_steps_per_rollout_epoch to a multiple of max_episode_steps under auto_reset=True so each rollout_epoch serially covers more episode_id values:
env:
eval:
total_num_envs: 16
max_episode_steps: 80
max_steps_per_rollout_epoch: 320 # 4 * 80; ~4 * total_num_envs trajectories per epoch
auto_reset: True
ignore_terminations: True
use_fixed_reset_state_ids: True
rollout_epoch: 1
Batch OOD Evaluation (mani-ood mode)#
The mani-ood mode runs evaluation on all 13 OOD scenes (obj_set=test) plus 3 Semantic scenes (obj_set=train), for 16 runs total — matching the full OOD protocol in the training docs.
Required environment variables
Variable |
Description |
|---|---|
|
Batch eval name; logs go to |
|
Path to RL-trained |
|
Parallel env count; maps to |
|
Eval epochs; maps to |
export EVAL_NAME=my_ood_eval
export CKPT_PATH=/path/to/checkpoint.pt
export TOTAL_NUM_ENVS=16
export EVAL_ROLLOUT_EPOCH=1
bash evaluations/run_eval.sh mani-ood maniskill_ood_openvlaoft_eval
Batch logs: logs/eval/<EVAL_NAME>/<timestamp>-<env_id>-<obj_set>/run_ppo.log
The mani-ood mode sets HF_ENDPOINT automatically (default https://hf-mirror.com); override it before running if needed.
Advanced Usage#
Deriving eval configs from training
ManiSkill also supports training tasks such as PickCube-v1 and PutCarrotOnPlateInScene-v2 (see examples/embodiment/config/env/), but there are no dedicated eval YAMLs yet. Copy the training config and set:
runner.task_type: embodied_evalrunner.only_eval: TrueRemove training sections (
algorithm,actor, etc.) and keepenv.evalandrollout
Adjust parallelism
bash evaluations/run_eval.sh maniskill maniskill_ood_openvlaoft_eval \
env.eval.total_num_envs=32 \
rollout.model.model_path=/path/to/model
Load RL checkpoint
bash evaluations/run_eval.sh maniskill maniskill_ood_openvlaoft_eval \
runner.ckpt_path=/path/to/checkpoint.pt \
rollout.model.model_path=/path/to/model
FAQ#
Asset path: Ensure ManiSkill assets are downloaded to
rlinf/envs/maniskill/assets.GPU simulation:
sim_backend: gpurequires an NVIDIA GPU;run_eval.shsetsMUJOCO_GL=osmesaetc. for headless environments.LoRA path: OpenVLA-OFT eval requires
lora_path; without it the ManiSkill policy cannot load correctly.Checkpoint: Batch mode passes
.ptweights viaCKPT_PATH; single runs userunner.ckpt_path.Scene selection: The default YAML points to the training scene
PutOnPlateInScene25Main-v3; for OOD scenes, explicitly overrideenv.eval.init_params.idandobj_set, or usemani-oodmode.