Evaluation#

RLinf provides a unified embodied evaluation entry point. It runs parallel rollouts in simulation or on real robots and reports task-level metrics such as success rate. This module covers environment setup, a quick first evaluation, and end-to-end workflows per benchmark.

Supported Benchmarks

The table below lists benchmarks that have example configs under evaluations/ and can be launched directly with run_eval.sh.

Benchmark	Task / env preset	Example config
RealWorld	`realworld_franka_sft_env`, `realworld_bin_relocation`	`realworld/realworld_eval.yaml`, `realworld/realworld_pnp_eval.yaml`, `realworld/realworld_pnp_eval_dreamzero.yaml`
BEHAVIOR-1K	`behavior_r1pro`	`behavior/behavior_openpi_pi05_eval.yaml`
LIBERO	`libero_spatial`, `libero_object`, `libero_goal`, `libero_10`	`libero/libero_spatial_openpi_pi05_eval.yaml`, etc.
ManiSkill OOD	`maniskill_ood_template` (out-of-distribution generalization)	`maniskill/maniskill_ood_openvlaoft_eval.yaml`
PolaRiS	`polaris_droid_tapeintocontainer`, `polaris_droid_movelattecup`, etc.	`polaris/polaris_tapeintocontainer_openpi_pi05_eval.yaml`, `polaris/polaris_movelattecup_openpi_eval.yaml`
RoboTwin	`robotwin_place_empty_cup`, `robotwin_adjust_bottle`, `robotwin_place_shoe`, `robotwin_click_bell`	`robotwin/robotwin_place_empty_cup_openvlaoft_eval.yaml`, etc.

LIBERO variants: Standard LIBERO, LIBERO-PRO, and LIBERO-PLUS are supported via environment variables (see LIBERO Evaluation).

Config fallback: If evaluations/<benchmark>/<config>.yaml does not exist, run_eval.sh falls back to examples/embodiment/config/ with the same config name, so training configs can be reused for evaluation.

Get Started#

Page	What you get
Overview	Evaluation architecture and the `evaluations/` layout.
Installation	Environment setup and benchmark-specific variables.
Quick Tour	Run your first LIBERO Spatial evaluation in ~5 minutes.

Guides#

End-to-end evaluation workflows per benchmark (setup → config → launch → results):

Benchmark	Workflow
RealWorld	Franka real-robot evaluation and deployment.
BEHAVIOR-1K	BEHAVIOR-1K evaluation.
LIBERO	LIBERO / LIBERO-PRO / LIBERO-PLUS.
ManiSkill OOD	ManiSkill out-of-distribution evaluation.
PolaRiS	PolaRiS tabletop manipulation.
RoboTwin	RoboTwin bimanual manipulation.

Reference#

Page	What you get
Configuration	Hydra YAML structure and required fields.
CLI	`run_eval.sh` usage and Hydra overrides.
Models	Supported models and example configs.
Results	Logs, metrics, and video output.

Resource	Where
Per-benchmark setup and training examples	RL with Embodied Simulators
Installation details	Installation
Math reasoning LLM evaluation (non-embodied)	LLMEvalKit
Model-specific standalone eval scripts (outside the unified entry)	`toolkits/standalone_eval_scripts/`

Evaluation#

Get Started#

Guides#

Reference#

Related Documentation#