Evaluation Tutorial 1: Embodied VLA#
Introduction#
RLinf provides out-of-the-box evaluation scripts to evaluate the performance of embodied agents in both in-distribution and out-of-distribution tasks. List of currently supported evaluation environments:
All startup scripts for evaluation are located in the examples/embodiment/ directory.
Quick Start#
Evaluation Launch Command
# run eval:
bash examples/embodiment/eval_embodiment.sh libero_10_grpo_openvlaoft_eval
Key YAML Configuration Fields
Any YAML file can be used for evaluation with the eval_embodiment.sh script, provided it includes the relevant env.eval configuration. Taking examples/embodiment/config/libero_10_grpo_openvlaoft_eval.yaml as an example here, you can modify the following in the configuration file as needed:
Adjust model path : Modify the following two parameters to load the model to be evaluated:
rollout.model.model_path
runner.ckpt_path(Optional) –.ptformat file path, set this parameter if you want to evaluate a specific checkpoint.
Note
If you need to convert checkpoint files from .distcp format to .pt format, please refer to the Checkpoint Convertor documentation for detailed instructions.
Control environment random seed: You can adjust
env.seedto change the environment’s random function for result reproducibility, etc.
Note
When multiple workers launch environments, the seeds in different workers have a fixed offset: seed = seed + self._rank * self.stage_num + stage_id.
Adjust evaluation epochs: We can adjust
algorithm.eval_rollout_epochto control the number of evaluation epochs. Note that we assume each epoch should complete the evaluation of the entire test set. Furthermore, since the random seeds are identical for each evaluation, the final evaluation result is equivalent to the average result of the Policy evaluated over multiple rounds on the same test set.Adjust the number of evaluation environments per epoch: To fully evaluate the entire test set (for example, Libero-10 has 500 initial states while Libero-90 has 4500 initial states), we can adjust the number of loaded environments in the following two ways:
The first method is to increase
env.eval.total_num_envsto control the number of environments loaded in parallel (distributed evenly across all workers). However, this can easily lead to OOM (Out Of Memory) issues in resource-constrained settings, for instance, if you only have a single 40GB GPU;Therefore, we offer an alternative: enable
env.eval.auto_reset=Trueand then adjustmax_steps_per_rollout_epochto be N times the value ofmax_episode_steps. In this case, the total number of environments in eacheval_rollout_epochevaluated will beN * env.eval.total_num_envs;
Adjust max interaction steps per trajectory: You can adjust
env.eval.max_episode_stepsto control the maximum number of interaction steps in a single trajectory.Record environment video: You can set
env.eval.video_cfg.save_video=Trueto record videos of the environment during evaluation.Control evaluation sampling method: You can adjust
algorithm.sampling_paramsto control the sampling method during rollout evaluation.Eval is set True: You can set the value
cfg.runner.only_eval=True, and we also set it automatically ineval_embodied_agent.pywhich is called byeval_embodimen.sh.
Evaluation Launch Script
#! /bin/bash
export EMBODIED_PATH="$( cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd )"
export REPO_PATH=$(dirname $(dirname "$EMBODIED_PATH"))
export SRC_FILE="${EMBODIED_PATH}/eval_embodied_agent.py"
export MUJOCO_GL="osmesa"
export PYOPENGL_PLATFORM="osmesa"
export PYTHONPATH=${REPO_PATH}:$PYTHONPATH
# Base path to the BEHAVIOR dataset, which is the BEHAVIOR-1k repo's dataset folder
# Only required when running the behavior experiment.
export OMNIGIBSON_DATA_PATH=$OMNIGIBSON_DATA_PATH
export OMNIGIBSON_DATASET_PATH=${OMNIGIBSON_DATASET_PATH:-$OMNIGIBSON_DATA_PATH/behavior-1k-assets/}
export OMNIGIBSON_KEY_PATH=${OMNIGIBSON_KEY_PATH:-$OMNIGIBSON_DATA_PATH/omnigibson.key}
export OMNIGIBSON_ASSET_PATH=${OMNIGIBSON_ASSET_PATH:-$OMNIGIBSON_DATA_PATH/omnigibson-robot-assets/}
export OMNIGIBSON_HEADLESS=${OMNIGIBSON_HEADLESS:-1}
# Base path to Isaac Sim, only required when running the behavior experiment.
export ISAAC_PATH=${ISAAC_PATH:-/path/to/isaac-sim}
export EXP_PATH=${EXP_PATH:-$ISAAC_PATH/apps}
export CARB_APP_PATH=${CARB_APP_PATH:-$ISAAC_PATH/kit}
export HYDRA_FULL_ERROR=1
if [ -z "$1" ]; then
CONFIG_NAME="maniskill_ppo_openvlaoft"
else
CONFIG_NAME=$1
fi
LOG_DIR="${REPO_PATH}/logs/$(date +'%Y%m%d-%H:%M:%S')" #/$(date +'%Y%m%d-%H:%M:%S')"
MEGA_LOG_FILE="${LOG_DIR}/eval_embodiment.log"
mkdir -p "${LOG_DIR}"
CMD="python ${SRC_FILE} --config-path ${EMBODIED_PATH}/config/ --config-name ${CONFIG_NAME} runner.logger.log_path=${LOG_DIR}"
echo ${CMD}
${CMD} 2>&1 | tee ${MEGA_LOG_FILE}
After finish the evaluation, Evaluation Results will be output to the terminal and log files:
[INFO 04:00:43 RLinf] {
'eval/success_once': array(0.11328125, dtype=float32),
'eval/return': array(0.43945312, dtype=float32),
'eval/episode_len': array(512., dtype=float32),
'eval/reward': array(0.00085831, dtype=float32),
'eval/success_at_end': array(0.08789062, dtype=float32)
}
The field success_once represents the success rate (i.e., completing the task at least once within a single episode trajectory).
If TensorBoard is enabled, these metrics will also be recorded in TensorBoard (TensorBoard is enabled by default).
Quick Start — ManiSkill3 OOD#
Note
Currently, only ManiSkill is supported for OOD test.
Launch Method
Modify variables such as EVAL_NAME, CKPT_PATH, and CONFIG_NAME (which can be set to maniskill_ppo_openvlaoft_quickstart for a quick test) in examples/embodiment/eval_mani_ood.sh.
Then, execute the following command in the terminal to start the evaluation.
bash examples/embodiment/eval_mani_ood.sh
Full Launch Script
#! /bin/bash
export HF_ENDPOINT=https://hf-mirror.com
export EMBODIED_PATH="$( cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd )"
export REPO_PATH=$(dirname $(dirname "$EMBODIED_PATH"))
export SRC_FILE="${EMBODIED_PATH}/eval_embodied_agent.py"
export HYDRA_FULL_ERROR=1
EVAL_NAME=YOUR_EVAL_NAME
CKPT_PATH=YOUR_CKPT_PATH # Optional: .pt file or None, if None, will use the checkpoint in rollout.model.model_path
CONFIG_NAME=YOUR_CFG_NAME # env.eval must be maniskill_ood_template
TOTAL_NUM_ENVS=YOUR_TOTAL_NUM_ENVS # total number of evaluation environments
EVAL_ROLLOUT_EPOCH=YOUR_EVAL_ROLLOUT_EPOCH # eval rollout epoch, total_trajectory_num = eval_rollout_epoch * total_num_envs
for env_id in \
"PutOnPlateInScene25VisionImage-v1" "PutOnPlateInScene25VisionTexture03-v1" "PutOnPlateInScene25VisionTexture05-v1" \
"PutOnPlateInScene25VisionWhole03-v1" "PutOnPlateInScene25VisionWhole05-v1" \
"PutOnPlateInScene25Carrot-v1" "PutOnPlateInScene25Plate-v1" "PutOnPlateInScene25Instruct-v1" \
"PutOnPlateInScene25MultiCarrot-v1" "PutOnPlateInScene25MultiPlate-v1" \
"PutOnPlateInScene25Position-v1" "PutOnPlateInScene25EEPose-v1" "PutOnPlateInScene25PositionChangeTo-v1" ; \
do
obj_set="test"
LOG_DIR="${REPO_PATH}/logs/eval/${EVAL_NAME}/$(date +'%Y%m%d-%H:%M:%S')-${env_id}-${obj_set}"
MEGA_LOG_FILE="${LOG_DIR}/run_ppo.log"
mkdir -p "${LOG_DIR}"
CMD="python ${SRC_FILE} --config-path ${EMBODIED_PATH}/config/ \
--config-name ${CONFIG_NAME} \
runner.logger.log_path=${LOG_DIR} \
algorithm.eval_rollout_epoch=${EVAL_ROLLOUT_EPOCH} \
env.eval.total_num_envs=${TOTAL_NUM_ENVS} \
env.eval.init_params.id=${env_id} \
env.eval.init_params.obj_set=${obj_set} \
runner.ckpt_path=${CKPT_PATH}"
echo ${CMD} > ${MEGA_LOG_FILE}
${CMD} 2>&1 | tee -a ${MEGA_LOG_FILE}
done
for env_id in \
"PutOnPlateInScene25Carrot-v1" "PutOnPlateInScene25MultiCarrot-v1" \
"PutOnPlateInScene25MultiPlate-v1" ; \
do
obj_set="train"
LOG_DIR="${REPO_PATH}/logs/eval/${EVAL_NAME}/$(date +'%Y%m%d-%H:%M:%S')-${env_id}-${obj_set}"
MEGA_LOG_FILE="${LOG_DIR}/run_ppo.log"
mkdir -p "${LOG_DIR}"
CMD="python ${SRC_FILE} --config-path ${EMBODIED_PATH}/config/ \
--config-name ${CONFIG_NAME} \
runner.logger.log_path=${LOG_DIR} \
algorithm.eval_rollout_epoch=${EVAL_ROLLOUT_EPOCH} \
env.eval.total_num_envs=${TOTAL_NUM_ENVS} \
env.eval.init_params.id=${env_id} \
env.eval.init_params.obj_set=${obj_set} \
runner.ckpt_path=${CKPT_PATH}"
echo ${CMD} > ${MEGA_LOG_FILE}
${CMD} 2>&1 | tee -a ${MEGA_LOG_FILE}
done
This script first evaluates 13 Out-Of-Distribution (OOD) tasks, and then evaluates 3 In-Distribution (ID) tasks.
All logs are saved in logs/eval/<EVAL_NAME>/…/run_ppo.log.