LIBERO Evaluation#
LIBERO is a robotic manipulation simulation benchmark built on robosuite (MuJoCo), with suites including Spatial, Object, Goal, and Long. RLinf supports parallel VLA policy evaluation on LIBERO with task-level success metrics.
Related training docs: RL with LIBERO Benchmarks, LIBERO-Pro & LIBERO-Plus
Environment Setup#
bash requirements/install.sh embodied --model openpi --env libero
source .venv/bin/activate
With --env libero, the installer clones LIBERO into .venv/libero (or reuses an existing checkout when LIBERO_PATH is set) and appends it to PYTHONPATH in .venv/bin/activate.
Supported models include openpi, openvla-oft, starvla, and dreamzero — replace --model accordingly during installation.
Example Configs#
Available under evaluations/libero/:
Config file |
Task suite |
Model |
|---|---|---|
|
Spatial |
π₀.₅ |
|
Spatial |
StarVLA |
|
Spatial |
DreamZero |
|
Object |
π₀.₅ |
|
Object |
OpenVLA-OFT |
|
Goal |
π₀ |
|
Goal |
OpenVLA-OFT |
|
Long (libero_10) |
π₀.₅ |
|
Long (libero_10) |
OpenVLA-OFT |
End-to-End Workflow#
Step 1: Activate the environment
source .venv/bin/activate
Step 2: Edit the config
Copy or edit the target YAML and set at least rollout.model.model_path. See Configuration Reference (env.eval Field Reference) for env.eval field descriptions; see Evaluation Configuration below for the LIBERO eval protocol and suite-specific settings.
Step 3: Launch evaluation
bash evaluations/run_eval.sh libero libero_spatial_openpi_pi05_eval \
rollout.model.model_path=/path/to/model
Step 4: Check results
The terminal prints eval/success_once; see Logs and Results for logs.
Evaluation Configuration#
LIBERO evaluation runs one trajectory per (task_id, trial_id) pair in the suite and reports eval/success_once (fraction of trajectories with at least one success). The fields below are all under env.eval and together control parallelism, trajectory length, and test-set coverage.
Evaluation Protocol#
LIBERO provides a fixed set of initial states per task (task_suite.get_task_init_states(task_id), loaded from .pruned_init files).
The official repo defines four standard eval suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long (libero_10), each with 10 tasks and ~50 initial states per task—~500 trajectories to fully evaluate one suite.
RLinf evaluations/libero/ examples cover these four task_suite_name values: libero_spatial, libero_object, libero_goal, and libero_10.
In RLinf’s LiberoEnv, each eval trajectory is uniquely identified by (task_id, trial_id):
task_id: task index within the currenttask_suite_name(0 … n_tasks-1), determining the language instruction and BDDL scene;trial_id: initial-state index for that task, loaded viaget_task_init_states(task_id)[trial_id]into the MuJoCo configuration.
Internally, trials from all tasks are concatenated into a global reset_state_id, which is decoded back into task_id and trial_id.
In eval mode (is_eval=True), all reset_state_id values are traversed in interleaved order—(task0, trial0), (task1, trial0), …, (task0, trial1), …—so parallel envs advance trials evenly across tasks; on auto_reset, the next reset_state_id is assigned in order.
One rollout_epoch should cover every (task_id, trial_id) pair in the suite. Two approaches:
High parallelism: Set
total_num_envs≥ total init states andmax_steps_per_rollout_epoch = max_episode_stepsso each parallel env runs exactly one trajectory (seelibero_spatial_openpi_pi05_eval.yaml).Auto-reset: With
auto_reset=True, finished episodes immediately load the next init state. Setmax_steps_per_rollout_epochto N ×max_episode_stepsto evaluate roughlyN × total_num_envstrajectories per epoch (seelibero_spatial_dreamzero_eval.yaml).
Per-Suite max_episode_steps Reference#
The longest training demo per suite sets a lower bound for the step limit. RLinf eval YAML values vary by model action frequency but should be ≥ this bound and match the training config:
Suite |
Lower bound |
RLinf example values |
Example config |
|---|---|---|---|
|
220 |
240 / 480 |
|
|
280 |
280 / 512 |
|
|
300 |
320 / 512 |
|
|
520 |
520 |
|
See env.eval Field Reference in Configuration Reference for detailed env.eval field descriptions.
Covering the Full Test Set#
Let S = total init states in the suite, E = total_num_envs, T = max_episode_steps.
Option 1: High parallelism (auto_reset optional)#
env:
eval:
total_num_envs: 500 # S for Spatial / Object / Goal / Long
max_episode_steps: 240
max_steps_per_rollout_epoch: 240 # equals max_episode_steps
auto_reset: True # optional when E >= S
rollout_epoch: 1
Option 2: Auto-reset (recommended when memory is limited)
env:
eval:
total_num_envs: 128
max_episode_steps: 480
# N = ceil(S / E); Spatial: ceil(500/128) = 4
max_steps_per_rollout_epoch: 1920 # N * max_episode_steps = 4 * 480
auto_reset: True
ignore_terminations: True
use_fixed_reset_state_ids: True
use_ordered_reset_state_ids: True
rollout_epoch: 1
Trajectories per rollout_epoch ≈ N × total_num_envs where N = max_steps_per_rollout_epoch / max_episode_steps. For example, Spatial (S=500) with E=128 requires N = ceil(500/128) = 4.
Multi-epoch averaging
env:
eval:
rollout_epoch: 2 # same seed, two passes, metrics averaged
Notes#
max_steps_per_rollout_epochmust be divisible byrollout.model.num_action_chunks; startup validation will fail otherwise.Env workers use seed offset
seed + rank × stage_num + stage_idso each worker receives a distinct init-state subset.eval/success_oncein the terminal is the success rate over completed trajectories; withauto_reset, metrics are recorded only when a new episode finishes, avoiding double counting.
Advanced Usage#
LIBERO-PRO
export LIBERO_TYPE=pro
export LIBERO_PERTURBATION=all
bash evaluations/run_eval.sh libero libero_10_openvlaoft_eval
LIBERO-PLUS
export LIBERO_TYPE=plus
export LIBERO_SUFFIX=all
bash evaluations/run_eval.sh libero libero_10_openvlaoft_eval
Adjust parallel scale
bash evaluations/run_eval.sh libero libero_spatial_openpi_pi05_eval \
env.eval.total_num_envs=64 \
rollout.model.model_path=/path/to/model
FAQ#
Rendering issues: On headless systems, try
export MUJOCO_GL=osmesaandexport PYOPENGL_PLATFORM=osmesa(run_eval.shsets these by default).Test coverage: See Evaluation Configuration above; the key is coordinating
total_num_envs,auto_reset, andmax_steps_per_rollout_epoch.