Real-World Evaluation#

RLinf supports evaluating and deploying VLA policies on Franka arms, including Bin-relocation (pick-and-place) tasks and a YAML-configurable generic environment (FrankaEnv-v1) for custom tasks.

Environment Setup#

Hardware

Franka Emika Panda arm + Intel RealSense cameras (default)
One GPU node (training / rollout) and one robot control node (direct Franka and camera access, no GPU required)
All nodes on the same LAN; the arm only needs to reach the control node

For ZED cameras or Robotiq grippers, see Using ZED + Robotiq with Franka.

Dependencies

Install dependencies separately on the control node and the GPU node:

# Robot control node (Franka + cameras + ROS)
bash requirements/install.sh embodied --env franka
source .venv/bin/activate
source <your_catkin_ws>/devel/setup.bash

# GPU / rollout node (π₀ evaluation)
bash requirements/install.sh embodied --model openpi --env franka
source .venv/bin/activate

DreamZero real-robot evaluation also requires DreamZero dependencies on the GPU node; see DreamZero Supervised Fine-Tuning and Franka Real-World Deployment.

Node topology

Real-robot evaluation typically uses a 1 GPU node + 1 Franka control node heterogeneous layout:

`RLINF_NODE_RANK`	Role	Notes
rank 0	GPU / head	Runs `rollout`; submit `run_eval.sh` on this node
rank 1	Robot control	Runs `env` workers with direct Franka and camera access

realworld_pnp_eval.yaml and realworld_pnp_eval_dreamzero.yaml use this two-node layout; realworld_eval.yaml (custom tasks) is a single-node layout with both env and rollout on the Franka node.

For full Ray cluster setup, firmware versions, and libfranka compatibility, see Real-World RL with Franka and Real-World Robot Training Launch.

Example Configs#

The following examples live under evaluations/realworld/:

Config file	Task	Model
`realworld_pnp_eval.yaml`	Bin-relocation (PnP)	π₀
`realworld_pnp_eval_dreamzero.yaml`	Bin-relocation (PnP)	DreamZero
`realworld_eval.yaml`	Custom task (`FrankaEnv-v1`)	π₀

If evaluations/realworld/<config>.yaml is missing, run_eval.sh falls back to the same name under examples/embodiment/config/ (set runner.task_type: embodied_eval and runner.only_eval: True). See CLI Reference. Dual Franka deployment currently uses this fallback path with realworld_eval_dual_franka.

Pre-flight Checks#

Before running evaluation, complete these checks in order:

Camera connection (control node):
```
python -m toolkits.realworld_check.test_franka_camera
```
Record the camera serials and set env.eval.override_cfg.camera_serials.
Target pose (PnP tasks, control node):
```
export FRANKA_ROBOT_IP=<robot_ip>
python -m toolkits.realworld_check.test_franka_controller
# Enter getpos_euler to read the target end-effector pose
```
Set the result in env.eval.override_cfg.target_ee_pose. For PnP, this pose is the lowest point in the motion workspace and is used for success checking and workspace clipping; see Running Pi0 SFT with Franka.
Ray cluster (all nodes):
```
ray status
```
Both the GPU node and the Franka node should appear online.
Dummy mode (optional): See examples/embodiment/config/realworld_dummy_franka_sac_cnn.yaml and set is_dummy: True in override_cfg to verify cluster wiring without real robot motion.

Warning

Verify workspace limits (ee_pose_limit_min / ee_pose_limit_max) and the emergency stop before real-robot evaluation. Use a small env.eval.rollout_epoch on the first run.

Starting the Ray Cluster#

On each node, before ray start, align environment variables (you can use ray_utils/realworld/setup_before_ray.sh):

source ray_utils/realworld/setup_before_ray.sh
export RLINF_NODE_RANK=<0|1>          # unique within the cluster
# On multi-NIC hosts, pin the reachable interface:
# export RLINF_COMM_NET_DEVICES=<network_interface>

On the control node, also source the ROS / franka workspace (unless already in the setup script):

source <your_catkin_ws>/devel/setup.bash

Then start Ray (let <head_ip> be the head node IP):

# GPU node (rank 0, head)
export RLINF_NODE_RANK=0
ray start --head --port=6379 --node-ip-address=<head_ip>

# Franka control node (rank 1)
export RLINF_NODE_RANK=1
ray start --address=<head_ip>:6379

Important

ray start freezes the Python interpreter and environment variables at launch time. Complete venv, ROS, and PYTHONPATH setup on every node before starting Ray.

End-to-End Workflow (PnP / π₀)#

Step 1: Start the Ray cluster

Start Ray on all nodes as above and confirm both nodes appear in ray status.

Step 2: Prepare the model

π₀ PnP evaluation requires:

rollout.model.model_path: Pi0 base model directory containing <repo_id>/norm_stats.json (generated during SFT prep; see Running Pi0 SFT with Franka)
runner.ckpt_path: SFT-exported full_weights.pt

Step 3: Edit the config

Update evaluations/realworld/realworld_pnp_eval.yaml at minimum:

cluster:
  node_groups:
    - label: "4090"
      node_ranks: 0          # GPU node rank
    - label: franka
      node_ranks: 1          # Franka control node rank
      hardware:
        type: Franka
        configs:
          - robot_ip: ROBOT_IP
            node_rank: 1

runner:
  ckpt_path: /path/to/full_weights.pt

env:
  eval:
    rollout_epoch: 20
    override_cfg:
      target_ee_pose: [0.50, 0.00, 0.01, 3.14, 0.0, 0.0]
      camera_serials: ["CAMERA_SERIAL_1", "CAMERA_SERIAL_2"]
      task_description: "pick up the object and place it into the container"

rollout:
  model:
    model_path: /path/to/pi0-model
    openpi:
      config_name: "pi0_realworld"

node_ranks and component_placement must match the actual RLINF_NODE_RANK on each machine.

Step 4: Launch evaluation

On the GPU / head node:

bash evaluations/run_eval.sh realworld_pnp_eval

You can also override fields via Hydra without editing the YAML:

bash evaluations/run_eval.sh realworld_pnp_eval \
  rollout.model.model_path=/path/to/pi0-model \
  runner.ckpt_path=/path/to/full_weights.pt

Step 5: Check results

The policy runs with runner.only_eval: True; task metrics appear in the terminal. Logs are described in Logs and Results.

Evaluation Config Reference#

The following applies to realworld_pnp_eval.yaml; custom-task evaluation is covered below, and DreamZero in DreamZero Supervised Fine-Tuning and Franka Real-World Deployment.

Required fields#

Field	Location	Notes
`robot_ip`	`cluster.node_groups[].hardware.configs`	Franka robot IP
`node_ranks` / `component_placement`	`cluster`	Must match `RLINF_NODE_RANK`; env on Franka node, rollout on GPU node
`target_ee_pose`	`env.eval.override_cfg`	PnP target pose `[x,y,z,rx,ry,rz]`; affects success checks and motion clipping
`camera_serials`	`env.eval.override_cfg`	RealSense serial list (not a `node_groups` field)
`task_description`	`env.eval.override_cfg`	Language instruction; must match SFT training
`model_path`	`rollout.model`	Pi0 base model directory (with `norm_stats.json`)
`ckpt_path`	`runner`	SFT checkpoint (`full_weights.pt`)
`config_name`	`rollout.model.openpi`	`"pi0_realworld"` for PnP

Key `env.eval` fields#

Field	Notes
`rollout_epoch`	Number of evaluation rounds; default 20
`max_episode_steps`	Max steps per trajectory; default 200
`max_steps_per_rollout_epoch`	Steps per rollout round; must be divisible by `rollout.model.num_action_chunks` (default 10 for PnP)
`total_num_envs`	Parallel env count; typically 1 on real hardware
`use_spacemouse`	Enable spacemouse intervention; usually `False` for eval

`run_eval.sh` behavior#

The realworld benchmark does not call setup_sim_env (no simulation env vars needed)
Logs go to logs/<timestamp>-<config_name>/eval_embodiment.log
Submit the eval entry script on the head (GPU) node

For DreamZero real-robot evaluation (realworld_pnp_eval_dreamzero.yaml), see DreamZero Supervised Fine-Tuning and Franka Real-World Deployment.

Custom Real-Robot Task Evaluation#

This section is for tasks you define yourself, not the built-in Bin-relocation (PnP) task above.

RLinf ships a generic real-robot environment FrankaEnv-v1: set task_description, goal/reset poses, and workspace limits in YAML—no new Python env class required. It is commonly used to deploy policies trained with supervised fine-tuning (SFT) on your own demonstration data (see Running Pi0 SFT with Franka). The env template is examples/embodiment/config/env/realworld_franka_sft_env.yaml; the eval config is evaluations/realworld/realworld_eval.yaml.

Node topology

realworld_eval.yaml is single-node: both env and rollout run on the Franka node (num_nodes: 1), suitable when GPU and control share one machine.

Key `override_cfg` fields#

override_cfg:
  task_description: "pick up the object and place it into the container"
  target_ee_pose: [0.5, 0.0, 0.1, -3.14, 0.0, 0.0]   # goal pose
  reset_ee_pose:  [0.5, 0.0, 0.2, -3.14, 0.0, 0.0]    # reset pose (above goal)
  max_num_steps: 300
  reward_threshold: [0.01, 0.01, 0.01, 0.2, 0.2, 0.2]
  action_scale: [1.0, 1.0, 1.0]
  ee_pose_limit_min: [0.4, -0.2, 0.05, -3.64, -0.5, -0.5]
  ee_pose_limit_max: [0.6,  0.2, 0.35, -2.64,  0.5,  0.5]

Launch evaluation

Replace ROBOT_IP and MODEL_PATH, then run:

bash evaluations/run_eval.sh realworld_eval

For data collection, SFT training, and deployment on custom tasks, see Running Pi0 SFT with Franka.

Dual Franka Deployment#

Dual Franka SFT deployment reuses the unified evaluation launcher with the fallback config examples/embodiment/config/realworld_eval_dual_franka.yaml. Set rollout.model.model_path to the staged checkpoint directory and actor.model.openpi_data.repo_id to the repo id that contains norm_stats.json.

bash evaluations/run_eval.sh realworld_eval_dual_franka \
    rollout.model.model_path=/path/to/deploy/global_step_<N> \
    actor.model.openpi_data.repo_id=<repo_id>/tcp_rot6d_v1 \
    env.eval.override_cfg.task_description="handover the object"

For the full collection, SFT, checkpoint staging, and pedal-control workflow, see Using Dual Franka.

Viewing Results#

Terminal metrics: task success rate, return, etc. (exact fields vary by environment)
Logs: logs/<timestamp>-<config_name>/eval_embodiment.log
Videos: when env.eval.video_cfg.save_video: True, videos go to video_base_dir or <log_path>/video/eval/

See Logs and Results for details.

FAQ#

Safety: Verify workspace limits and emergency stop before evaluation; use a small rollout_epoch on the first run.
Node topology: env workers must run on nodes with direct Franka access; node_ranks must match RLINF_NODE_RANK. PnP and Dual Franka use two nodes; custom-task eval uses one.
Cameras not found: Run python -m toolkits.realworld_check.test_franka_camera on the control node and verify camera_serials.
Abnormal actions: Check that norm_stats.json is under model_path/<repo_id>/ and that openpi.config_name matches training.
Ray shows only one node: Check firewall rules, RLINF_COMM_NET_DEVICES, and that the head IP is reachable from other nodes.
Step-count errors: Ensure max_steps_per_rollout_epoch is divisible by num_action_chunks.
ZED / Robotiq: See Using ZED + Robotiq with Franka.