LIBERO Evaluation#

LIBERO is a robotic manipulation simulation benchmark built on robosuite (MuJoCo), with suites including Spatial, Object, Goal, and Long. RLinf supports parallel VLA policy evaluation on LIBERO with task-level success metrics.

Related training docs: RL with LIBERO Benchmarks, LIBERO-Pro & LIBERO-Plus

Environment Setup#

bash requirements/install.sh embodied --model openpi --env libero
source .venv/bin/activate

With --env libero, the installer clones LIBERO into .venv/libero (or reuses an existing checkout when LIBERO_PATH is set) and appends it to PYTHONPATH in .venv/bin/activate.

Supported models include openpi, openvla-oft, starvla, and dreamzero — replace --model accordingly during installation.

Example Configs#

Available under evaluations/libero/:

Config file	Task suite	Model
`libero_spatial_openpi_pi05_eval.yaml`	Spatial	π₀.₅
`libero_spatial_starvla_eval.yaml`	Spatial	StarVLA
`libero_spatial_dreamzero_eval.yaml`	Spatial	DreamZero
`libero_object_openpi_pi05_eval.yaml`	Object	π₀.₅
`libero_object_openvlaoft_eval.yaml`	Object	OpenVLA-OFT
`libero_goal_openpi_eval.yaml`	Goal	π₀
`libero_goal_openvlaoft_eval.yaml`	Goal	OpenVLA-OFT
`libero_10_openpi_pi05_eval.yaml`	Long (libero_10)	π₀.₅
`libero_10_openvlaoft_eval.yaml`	Long (libero_10)	OpenVLA-OFT

End-to-End Workflow#

Step 1: Activate the environment

source .venv/bin/activate

Step 2: Edit the config

Copy or edit the target YAML and set at least rollout.model.model_path. See Configuration Reference (env.eval Field Reference) for env.eval field descriptions; see Evaluation Configuration below for the LIBERO eval protocol and suite-specific settings.

Step 3: Launch evaluation

bash evaluations/run_eval.sh libero libero_spatial_openpi_pi05_eval \
  rollout.model.model_path=/path/to/model

Step 4: Check results

The terminal prints eval/success_once; see Logs and Results for logs.

Evaluation Configuration#

LIBERO evaluation runs one trajectory per (task_id, trial_id) pair in the suite and reports eval/success_once (fraction of trajectories with at least one success). The fields below are all under env.eval and together control parallelism, trajectory length, and test-set coverage.

Evaluation Protocol#

LIBERO provides a fixed set of initial states per task (task_suite.get_task_init_states(task_id), loaded from .pruned_init files). The official repo defines four standard eval suites: LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long (libero_10), each with 10 tasks and ~50 initial states per task—~500 trajectories to fully evaluate one suite. RLinf evaluations/libero/ examples cover these four task_suite_name values: libero_spatial, libero_object, libero_goal, and libero_10.

In RLinf’s LiberoEnv, each eval trajectory is uniquely identified by (task_id, trial_id):

task_id: task index within the current task_suite_name (0 … n_tasks-1), determining the language instruction and BDDL scene;
trial_id: initial-state index for that task, loaded via get_task_init_states(task_id)[trial_id] into the MuJoCo configuration.

Internally, trials from all tasks are concatenated into a global reset_state_id, which is decoded back into task_id and trial_id. In eval mode (is_eval=True), all reset_state_id values are traversed in interleaved order—(task0, trial0), (task1, trial0), …, (task0, trial1), …—so parallel envs advance trials evenly across tasks; on auto_reset, the next reset_state_id is assigned in order.

One rollout_epoch should cover every (task_id, trial_id) pair in the suite. Two approaches:

High parallelism: Set total_num_envs ≥ total init states and max_steps_per_rollout_epoch = max_episode_steps so each parallel env runs exactly one trajectory (see libero_spatial_openpi_pi05_eval.yaml).
Auto-reset: With auto_reset=True, finished episodes immediately load the next init state. Set max_steps_per_rollout_epoch to N × max_episode_steps to evaluate roughly N × total_num_envs trajectories per epoch (see libero_spatial_dreamzero_eval.yaml).

Per-Suite `max_episode_steps` Reference#

The longest training demo per suite sets a lower bound for the step limit. RLinf eval YAML values vary by model action frequency but should be ≥ this bound and match the training config:

Suite	Lower bound	RLinf example values	Example config
`libero_spatial`	220	240 / 480	`libero_spatial_openpi_pi05_eval` / `libero_spatial_dreamzero_eval`
`libero_object`	280	280 / 512	`libero_object_openpi_pi05_eval` / `libero_object_openvlaoft_eval`
`libero_goal`	300	320 / 512	`libero_goal_openpi_eval` / `libero_goal_openvlaoft_eval`
`libero_10`	520	520	`libero_10_openpi_pi05_eval` / `libero_10_openvlaoft_eval`

See env.eval Field Reference in Configuration Reference for detailed env.eval field descriptions.

Covering the Full Test Set#

Let S = total init states in the suite, E = total_num_envs, T = max_episode_steps.

Option 1: High parallelism (`auto_reset` optional)#

env:
  eval:
    total_num_envs: 500        # S for Spatial / Object / Goal / Long
    max_episode_steps: 240
    max_steps_per_rollout_epoch: 240   # equals max_episode_steps
    auto_reset: True           # optional when E >= S
    rollout_epoch: 1

Option 2: Auto-reset (recommended when memory is limited)

env:
  eval:
    total_num_envs: 128
    max_episode_steps: 480
    # N = ceil(S / E); Spatial: ceil(500/128) = 4
    max_steps_per_rollout_epoch: 1920   # N * max_episode_steps = 4 * 480
    auto_reset: True
    ignore_terminations: True
    use_fixed_reset_state_ids: True
    use_ordered_reset_state_ids: True
    rollout_epoch: 1

Trajectories per rollout_epoch ≈ N × total_num_envs where N = max_steps_per_rollout_epoch / max_episode_steps. For example, Spatial (S=500) with E=128 requires N = ceil(500/128) = 4.

Multi-epoch averaging

env:
  eval:
    rollout_epoch: 2           # same seed, two passes, metrics averaged

Notes#

max_steps_per_rollout_epoch must be divisible by rollout.model.num_action_chunks; startup validation will fail otherwise.
Env workers use seed offset seed + rank × stage_num + stage_id so each worker receives a distinct init-state subset.
eval/success_once in the terminal is the success rate over completed trajectories; with auto_reset, metrics are recorded only when a new episode finishes, avoiding double counting.

Advanced Usage#

LIBERO-PRO

export LIBERO_TYPE=pro
export LIBERO_PERTURBATION=all
bash evaluations/run_eval.sh libero libero_10_openvlaoft_eval

LIBERO-PLUS

export LIBERO_TYPE=plus
export LIBERO_SUFFIX=all
bash evaluations/run_eval.sh libero libero_10_openvlaoft_eval

Adjust parallel scale

bash evaluations/run_eval.sh libero libero_spatial_openpi_pi05_eval \
  env.eval.total_num_envs=64 \
  rollout.model.model_path=/path/to/model

FAQ#

Rendering issues: On headless systems, try export MUJOCO_GL=osmesa and export PYOPENGL_PLATFORM=osmesa (run_eval.sh sets these by default).
Test coverage: See Evaluation Configuration above; the key is coordinating total_num_envs, auto_reset, and max_steps_per_rollout_epoch.