BEHAVIOR-1K Evaluation#
BEHAVIOR-1K is a large-scale household scene simulation benchmark built on OmniGibson and Isaac Sim. It tasks a dual-arm R1 Pro robot with manipulation skills such as pick-and-place, stacking, and tidying. RLinf supports parallel evaluation of OpenPI and other VLA policies in BEHAVIOR environments and reports metrics such as eval/success_once.
Related training doc: RL with Behavior Benchmark
Environment Setup#
Install dependencies
bash requirements/install.sh embodied --model openpi --env behavior
source .venv/bin/activate
evaluations/behavior/ currently ships an OpenPI π₀.₅ example only. Training also supports OpenVLA-OFT; you can derive an eval YAML from examples/embodiment/config/ (see Configuration Reference).
Hardware and Isaac Sim
BEHAVIOR depends on Isaac Sim 4.5 and has additional GPU and driver requirements; see the Isaac Sim requirements in the training doc. Key points:
A GPU with Ray Tracing support (e.g. RTX 30/40 series) is recommended. GPUs without RT (A100, H100, etc.) produce poor rendering quality with visible artifacts.
Hopper and newer GPUs require NVIDIA driver 570 or later.
You can also run evaluation inside the official Docker image rlinf/rlinf:agentic-rlinf0.2-behavior; see RL with Behavior Benchmark.
Environment variables
Set ISAAC_PATH and OmniGibson data paths before every run (run_eval.sh auto-fills derived variables such as OMNIGIBSON_DATASET_PATH, EXP_PATH, and CARB_APP_PATH):
export ISAAC_PATH=/path/to/isaac-sim
export OMNIGIBSON_DATA_PATH=/path/to/BEHAVIOR-1K-datasets
export OMNIGIBSON_DATASET_PATH=${OMNIGIBSON_DATA_PATH}/behavior-1k-assets/
export OMNIGIBSON_KEY_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson.key
export OMNIGIBSON_ASSET_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson-robot-assets/
BEHAVIOR assets exceed 30 GB; see the “Resource download” section in RL with Behavior Benchmark for download and license setup.
Example Configs#
The following example is available under evaluations/behavior/:
Config file |
Env preset |
Model |
|---|---|---|
|
|
π₀.₅ |
If evaluations/behavior/<config>.yaml is missing, run_eval.sh falls back to examples/embodiment/config/ with the same name (e.g. behavior_ppo_openpi_pi05_eval). Fallback configs include actor / algorithm sections but still work for evaluation when runner.only_eval: True.
End-to-End Workflow#
Step 1: Activate the environment and set paths
source .venv/bin/activate
export ISAAC_PATH=/path/to/isaac-sim
export OMNIGIBSON_DATA_PATH=/path/to/BEHAVIOR-1K-datasets
export OMNIGIBSON_DATASET_PATH=${OMNIGIBSON_DATA_PATH}/behavior-1k-assets/
export OMNIGIBSON_KEY_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson.key
export OMNIGIBSON_ASSET_PATH=${OMNIGIBSON_DATA_PATH}/omnigibson-robot-assets/
Step 2: Prepare the model
Recommended checkpoint: RLinf/RLinf-Pi0-Behavior (download commands in the training doc). Third-party OpenPI weights (e.g. OpenPI-Comet) must be converted to PyTorch format before setting rollout.model.model_path.
Step 3: Edit the config
Copy or edit the target YAML and set at least rollout.model.model_path. Generic env.eval fields are documented in Configuration Reference (env.eval Field Reference); BEHAVIOR-specific fields and the evaluation protocol are covered in Evaluation Configuration below.
The OpenPI fields in behavior_openpi_pi05_eval.yaml must match training (action_dim: 23, num_action_chunks: 32, openpi.config_name: pi05_behavior, etc.).
Step 4: Launch evaluation
bash evaluations/run_eval.sh behavior behavior_openpi_pi05_eval \
rollout.model.model_path=/path/to/model
Step 5: Check results
The terminal prints eval/success_once; see Logs and Results for logs and videos.
Evaluation Configuration#
BEHAVIOR evaluation runs one task per launch (selected by omni_config.task.activity_name). A single run does not automatically sweep all 50 tasks. The fields below control parallel scale, trajectory length, and initial scene instances.
Evaluation protocol#
BEHAVIOR-1K defines 50 household tasks (names listed in rlinf/envs/behavior/behavior_task.jsonl). The behavior_r1pro preset defaults to turning_on_radio on scene house_double_floor_lower.
Each evaluation trajectory is determined by:
omni_config.task.activity_name: task name (language instruction and BDDL definition);omni_config.task.activity_definition_id: task definition variant (usually0);omni_config.task.activity_instance_idandinstance_resample_mode: initial object layout and robot pose.
instance_resample_mode supports three values:
disabled(default): every reset loads the fixed instance foractivity_instance_id; ifactivity_instance_diris set, the matching JSON is read from that directory.offline: every reset randomly picks a cached instance fromactivity_instance_dir(download official2025-challenge-task-instancesor generate files withinstance_generator.py).online: online object resampling on reset (requiresonline_object_sampling: Trueanduse_presampled_robot_pose: False; slower startup).
Note
A single launch does not automatically sweep all tasks or init states. To evaluate multiple tasks, change activity_name and rerun, or wrap launches in a batch script. For multiple instances, use instance_resample_mode: offline and average over rollout_epoch.
Generic env.eval fields#
Field |
BEHAVIOR guidance |
|---|---|
|
Global parallel env count. Each BEHAVIOR env uses roughly 10 GiB VRAM; the example defaults to |
|
Number of eval rounds with the same config; metrics are averaged. The example defaults to |
|
Max steps per trajectory. The π₀.₅ example uses |
|
Total interaction steps per rollout round; must be divisible by |
|
Isaac sim subprocesses per env worker (default |
|
When |
Key omni_config fields#
These live under env.eval.omni_config (inherited from examples/embodiment/config/env/behavior_r1pro.yaml and overridable in eval YAML):
env:
eval:
omni_config:
task:
activity_name: turning_on_radio
activity_definition_id: 0
activity_instance_id: 0
activity_instance_dir: null # directory of cached instance JSON files
instance_file_format: tro_state # template | tro_state
instance_resample_mode: disabled # disabled | offline | online
scene:
scene_model: house_double_floor_lower
partial_scene_load: true # load task-relevant rooms only
For full field descriptions (partial_scene_load, instance_generator.py, etc.), see “behavior_r1pro.yaml key settings” in RL with Behavior Benchmark.
GPU and cluster placement#
BEHAVIOR stepping is slow; allocate enough GPUs to env workers and share or split placement with rollout:
cluster:
component_placement:
rollout,env: all # share all GPUs (example default)
You can also place env and rollout on separate GPUs to reduce memory pressure; see “Key cluster settings” in the training doc.
Advanced Usage#
Switch evaluation task
bash evaluations/run_eval.sh behavior behavior_openpi_pi05_eval \
rollout.model.model_path=/path/to/model \
env.eval.omni_config.task.activity_name=picking_up_trash
Random offline instance sampling
env:
eval:
omni_config:
task:
activity_instance_dir: ${oc.env:OMNIGIBSON_DATA_PATH}/2025-challenge-task-instances
instance_file_format: tro_state
instance_resample_mode: offline
rollout_epoch: 5
Adjust parallel scale
bash evaluations/run_eval.sh behavior behavior_openpi_pi05_eval \
rollout.model.model_path=/path/to/model \
env.eval.total_num_envs=4 \
env.eval.num_env_subprocess=2
Evaluate from a training config
bash evaluations/run_eval.sh behavior behavior_ppo_openpi_pi05_eval \
rollout.model.model_path=/path/to/model
FAQ#
Data download: BEHAVIOR assets are large; complete Isaac Sim, OmniGibson asset, and license setup per RL with Behavior Benchmark before evaluation.
ISAAC_PATH not set:
run_eval.shdefaults to/path/to/isaac-sim; Isaac Sim will fail to start without a valid path.Headless mode:
run_eval.shsetsOMNIGIBSON_HEADLESS=1by default.Out of memory: Lower
total_num_envsornum_env_subprocess; each env uses about 10 GiB VRAM.Blurry or blocky rendering: The GPU lacks Ray Tracing; use RTX 30/40 series or newer.
Very slow startup: First load of a large scene is expensive; keep
partial_scene_load: trueto load only task-relevant rooms.Fewer video frames than expected:
skip_intermediate_obs_in_chunk: Trueskips intermediate chunk frames and keeps only observations consumed by the policy.Instance load failure: JSON filenames under
activity_instance_dirmust matchactivity_name,activity_definition_id, andscene_model; seerlinf/envs/behavior/instance_loader.py.Step count validation error:
max_steps_per_rollout_epochmust be divisible byrollout.model.num_action_chunks.