BEHAVIOR-1K Evaluation#

BEHAVIOR-1K is a large-scale household scene simulation benchmark built on OmniGibson and Isaac Sim. It tasks a dual-arm R1 Pro robot with manipulation skills such as pick-and-place, stacking, and tidying. RLinf supports parallel evaluation of OpenPI and other VLA policies in BEHAVIOR environments and reports metrics such as eval/success_once.

Related training doc: RL with Behavior Benchmark

Environment Setup#

Install dependencies

bash requirements/install.sh embodied --model openpi --env behavior
source .venv/bin/activate

evaluations/behavior/ currently ships an OpenPI π₀.₅ example only. Training also supports OpenVLA-OFT; you can derive an eval YAML from examples/embodiment/config/ (see Configuration Reference).

Hardware and Isaac Sim

BEHAVIOR depends on Isaac Sim 4.5 and has additional GPU and driver requirements; see the Isaac Sim requirements in the training doc. Key points:

A GPU with Ray Tracing support (e.g. RTX 30/40 series) is recommended. GPUs without RT (A100, H100, etc.) produce poor rendering quality with visible artifacts.
Hopper and newer GPUs require NVIDIA driver 570 or later.

You can also run evaluation inside the official Docker image rlinf/rlinf:agentic-rlinf0.2-behavior; see RL with Behavior Benchmark.

Environment variables

Set ISAAC_PATH and OmniGibson data paths before every run (run_eval.sh auto-fills derived variables such as OMNIGIBSON_DATASET_PATH, EXP_PATH, and CARB_APP_PATH):

export ISAAC_PATH=/path/to/isaac-sim
export OMNIGIBSON_DATA_PATH=/path/to/BEHAVIOR-1K-datasets
export OMNIGIBSON_DATASET_PATH=${OMNIGIBSON_DATA_PATH}/behavior-1k-assets/
export OMNIGIBSON_KEY_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson.key
export OMNIGIBSON_ASSET_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson-robot-assets/

BEHAVIOR assets exceed 30 GB; see the “Resource download” section in RL with Behavior Benchmark for download and license setup.

Example Configs#

The following example is available under evaluations/behavior/:

Config file	Env preset	Model
`behavior_openpi_pi05_eval.yaml`	`behavior_r1pro`	π₀.₅

If evaluations/behavior/<config>.yaml is missing, run_eval.sh falls back to examples/embodiment/config/ with the same name (e.g. behavior_ppo_openpi_pi05_eval). Fallback configs include actor / algorithm sections but still work for evaluation when runner.only_eval: True.

End-to-End Workflow#

Step 1: Activate the environment and set paths

source .venv/bin/activate
export ISAAC_PATH=/path/to/isaac-sim
export OMNIGIBSON_DATA_PATH=/path/to/BEHAVIOR-1K-datasets
export OMNIGIBSON_DATASET_PATH=${OMNIGIBSON_DATA_PATH}/behavior-1k-assets/
export OMNIGIBSON_KEY_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson.key
export OMNIGIBSON_ASSET_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson-robot-assets/

Step 2: Prepare the model

Recommended checkpoint: RLinf/RLinf-Pi0-Behavior (download commands in the training doc). Third-party OpenPI weights (e.g. OpenPI-Comet) must be converted to PyTorch format before setting rollout.model.model_path.

Step 3: Edit the config

Copy or edit the target YAML and set at least rollout.model.model_path. Generic env.eval fields are documented in Configuration Reference (env.eval Field Reference); BEHAVIOR-specific fields and the evaluation protocol are covered in Evaluation Configuration below.

The OpenPI fields in behavior_openpi_pi05_eval.yaml must match training (action_dim: 23, num_action_chunks: 32, openpi.config_name: pi05_behavior, etc.).

Step 4: Launch evaluation

bash evaluations/run_eval.sh behavior behavior_openpi_pi05_eval \
  rollout.model.model_path=/path/to/model

Step 5: Check results

The terminal prints eval/success_once; see Logs and Results for logs and videos.

Evaluation Configuration#

BEHAVIOR evaluation runs one task per launch (selected by omni_config.task.activity_name). A single run does not automatically sweep all 50 tasks. The fields below control parallel scale, trajectory length, and initial scene instances.

Evaluation protocol#

BEHAVIOR-1K defines 50 household tasks (names listed in rlinf/envs/behavior/behavior_task.jsonl). The behavior_r1pro preset defaults to turning_on_radio on scene house_double_floor_lower.

Each evaluation trajectory is determined by:

omni_config.task.activity_name: task name (language instruction and BDDL definition);
omni_config.task.activity_definition_id: task definition variant (usually 0);
omni_config.task.activity_instance_id and instance_resample_mode: initial object layout and robot pose.

instance_resample_mode supports three values:

disabled (default): every reset loads the fixed instance for activity_instance_id; if activity_instance_dir is set, the matching JSON is read from that directory.
offline: every reset randomly picks a cached instance from activity_instance_dir (download official 2025-challenge-task-instances or generate files with instance_generator.py).
online: online object resampling on reset (requires online_object_sampling: True and use_presampled_robot_pose: False; slower startup).

Note

A single launch does not automatically sweep all tasks or init states. To evaluate multiple tasks, change activity_name and rerun, or wrap launches in a batch script. For multiple instances, use instance_resample_mode: offline and average over rollout_epoch.

Generic `env.eval` fields#

Field	BEHAVIOR guidance
`total_num_envs`	Global parallel env count. Each BEHAVIOR env uses roughly 10 GiB VRAM; the example defaults to `8`.
`rollout_epoch`	Number of eval rounds with the same config; metrics are averaged. The example defaults to `2`.
`max_episode_steps`	Max steps per trajectory. The π₀.₅ example uses `4096` (the preset default `2000` may be too short for long-horizon tasks).
`max_steps_per_rollout_epoch`	Total interaction steps per rollout round; must be divisible by `rollout.model.num_action_chunks`. Without `auto_reset`, usually equals `max_episode_steps`.
`num_env_subprocess`	Isaac sim subprocesses per env worker (default `1`). Increasing this can reduce stepping bottlenecks but multiplies VRAM and process overhead; `total_num_envs` must be divisible by `num_env_subprocess × pipeline_stage_num`.
`skip_intermediate_obs_in_chunk`	When `True`, skips intermediate observations inside action chunks for faster stepping; saved videos only contain chunk-boundary frames.

Key `omni_config` fields#

These live under env.eval.omni_config (inherited from examples/embodiment/config/env/behavior_r1pro.yaml and overridable in eval YAML):

env:
  eval:
    omni_config:
      task:
        activity_name: turning_on_radio
        activity_definition_id: 0
        activity_instance_id: 0
        activity_instance_dir: null          # directory of cached instance JSON files
        instance_file_format: tro_state        # template | tro_state
        instance_resample_mode: disabled       # disabled | offline | online
      scene:
        scene_model: house_double_floor_lower
        partial_scene_load: true               # load task-relevant rooms only

For full field descriptions (partial_scene_load, instance_generator.py, etc.), see “behavior_r1pro.yaml key settings” in RL with Behavior Benchmark.

GPU and cluster placement#

BEHAVIOR stepping is slow; allocate enough GPUs to env workers and share or split placement with rollout:

cluster:
  component_placement:
    rollout,env: all          # share all GPUs (example default)

You can also place env and rollout on separate GPUs to reduce memory pressure; see “Key cluster settings” in the training doc.

Advanced Usage#

Switch evaluation task

bash evaluations/run_eval.sh behavior behavior_openpi_pi05_eval \
  rollout.model.model_path=/path/to/model \
  env.eval.omni_config.task.activity_name=picking_up_trash

Random offline instance sampling

env:
  eval:
    omni_config:
      task:
        activity_instance_dir: ${oc.env:OMNIGIBSON_DATA_PATH}/2025-challenge-task-instances
        instance_file_format: tro_state
        instance_resample_mode: offline
    rollout_epoch: 5

Adjust parallel scale

bash evaluations/run_eval.sh behavior behavior_openpi_pi05_eval \
  rollout.model.model_path=/path/to/model \
  env.eval.total_num_envs=4 \
  env.eval.num_env_subprocess=2

Evaluate from a training config

bash evaluations/run_eval.sh behavior behavior_ppo_openpi_pi05_eval \
  rollout.model.model_path=/path/to/model

FAQ#

Data download: BEHAVIOR assets are large; complete Isaac Sim, OmniGibson asset, and license setup per RL with Behavior Benchmark before evaluation.
ISAAC_PATH not set: run_eval.sh defaults to /path/to/isaac-sim; Isaac Sim will fail to start without a valid path.
Headless mode: run_eval.sh sets OMNIGIBSON_HEADLESS=1 by default.
Out of memory: Lower total_num_envs or num_env_subprocess; each env uses about 10 GiB VRAM.
Blurry or blocky rendering: The GPU lacks Ray Tracing; use RTX 30/40 series or newer.
Very slow startup: First load of a large scene is expensive; keep partial_scene_load: true to load only task-relevant rooms.
Fewer video frames than expected: skip_intermediate_obs_in_chunk: True skips intermediate chunk frames and keeps only observations consumed by the policy.
Instance load failure: JSON filenames under activity_instance_dir must match activity_name, activity_definition_id, and scene_model; see rlinf/envs/behavior/instance_loader.py.
Step count validation error: max_steps_per_rollout_epoch must be divisible by rollout.model.num_action_chunks.