Evaluation#

RLinf provides a unified embodied evaluation entry point. It runs parallel rollouts in simulation or on real robots and reports task-level metrics such as success rate. This module covers environment setup, a quick first evaluation, and end-to-end workflows per benchmark.

Supported Benchmarks

The table below lists benchmarks that have example configs under evaluations/ and can be launched directly with run_eval.sh.

Benchmark

Task / env preset

Example config

RealWorld

realworld_franka_sft_env, realworld_bin_relocation

realworld/realworld_eval.yaml, realworld/realworld_pnp_eval.yaml, realworld/realworld_pnp_eval_dreamzero.yaml

BEHAVIOR-1K

behavior_r1pro

behavior/behavior_openpi_pi05_eval.yaml

LIBERO

libero_spatial, libero_object, libero_goal, libero_10

libero/libero_spatial_openpi_pi05_eval.yaml, etc.

ManiSkill OOD

maniskill_ood_template (out-of-distribution generalization)

maniskill/maniskill_ood_openvlaoft_eval.yaml

PolaRiS

polaris_droid_tapeintocontainer, polaris_droid_movelattecup, etc.

polaris/polaris_tapeintocontainer_openpi_pi05_eval.yaml, polaris/polaris_movelattecup_openpi_eval.yaml

RoboTwin

robotwin_place_empty_cup, robotwin_adjust_bottle, robotwin_place_shoe, robotwin_click_bell

robotwin/robotwin_place_empty_cup_openvlaoft_eval.yaml, etc.

LIBERO variants: Standard LIBERO, LIBERO-PRO, and LIBERO-PLUS are supported via environment variables (see LIBERO Evaluation).

Config fallback: If evaluations/<benchmark>/<config>.yaml does not exist, run_eval.sh falls back to examples/embodiment/config/ with the same config name, so training configs can be reused for evaluation.

Get Started#

Page

What you get

Overview

Evaluation architecture and the evaluations/ layout.

Installation

Environment setup and benchmark-specific variables.

Quick Tour

Run your first LIBERO Spatial evaluation in ~5 minutes.

Guides#

End-to-end evaluation workflows per benchmark (setup → config → launch → results):

Benchmark

Workflow

RealWorld

Franka real-robot evaluation and deployment.

BEHAVIOR-1K

BEHAVIOR-1K evaluation.

LIBERO

LIBERO / LIBERO-PRO / LIBERO-PLUS.

ManiSkill OOD

ManiSkill out-of-distribution evaluation.

PolaRiS

PolaRiS tabletop manipulation.

RoboTwin

RoboTwin bimanual manipulation.

Reference#

Page

What you get

Configuration

Hydra YAML structure and required fields.

CLI

run_eval.sh usage and Hydra overrides.

Models

Supported models and example configs.

Results

Logs, metrics, and video output.