RL with Franka-Sim Benchmark#
This document provides a complete guide to launching and managing Vision-Language-Action Models (VLAs) training tasks in the RLinf framework. It also explains how to fine-tune a VLA model in the Franka-Sim simulation environment to perform robotic manipulation tasks.
The main goal is to enable the model to acquire the following capabilities:
Visual understanding: process RGB images captured from robot cameras;
Language understanding: interpret natural language task descriptions;
Action generation: produce accurate robot actions (position, rotation, gripper control);
Reinforcement learning: optimize policies with PPO using environment feedback.
Environment#
The Franka-Sim environments are built on top of the serl project. Two minimal Franka-Sim simulation tasks are provided:
PandaPickCube-v0PandaPickCubeVision-v0
Task Definition#
Task: control a Franka Panda robot arm to pick up a cube and move it to a target position;
Observation:
PandaPickCube-v0: proprioceptive states + target position;PandaPickCubeVision-v0: multi-view RGB images (third-person + wrist camera) + proprioceptive states;
Action Space: 4D continuous actions
3D end-effector position control (x, y, z)
gripper control (open/close)
Data Structure#
PandaPickCube-v0
States: proprioceptive states and target location
end-effector 3D position
end-effector 3D velocity
gripper open/close state (1D)
cube 3D position
PandaPickCubeVision-v0
Images: RGB tensors from a third-person view and a wrist camera view
States: proprioceptive states
end-effector 3D position
end-effector 3D velocity
gripper open/close state (1D)
Task Descriptions: natural language instructions
Actions: normalized continuous action values
Rewards: dense rewards based on task progress
Algorithms#
The core algorithm components include:
PPO (Proximal Policy Optimization)
use GAE (Generalized Advantage Estimation) for advantage estimation;
policy clipping with ratio constraints;
value function clipping;
entropy regularization.
SAC (Soft Actor-Critic)
Learning Q-values by Bellman backups and entropy regularization.
Learning policy to maximize entropy-regularized Q.
Learning temperature parameter for exploration-exploitation trade-off.
Dependency Installation#
1. Clone the RLinf repository#
# For faster downloads in mainland China (optional):
# git clone https://ghfast.top/github.com/RLinf/RLinf.git
git clone https://github.com/RLinf/RLinf.git
cd RLinf
2. Install dependencies#
Option 1: Docker image#
Run experiments using the official Docker image:
docker run -it --rm --gpus all \
--shm-size 20g \
--network host \
--name rlinf \
-v .:/workspace/RLinf \
rlinf/rlinf:agentic-rlinf0.2-frankasim
# For faster Docker pulls in mainland China (optional):
# docker.1ms.run/rlinf/rlinf:agentic-rlinf0.2-frankasim
Option 2: Custom environment#
# To accelerate dependency downloads in China, append --use-mirror to install.sh
bash requirements/install.sh embodied --model openvla --env frankasim
source .venv/bin/activate
Model Download#
If you are training the CNN policy (skip this section for the MLP policy), you need to first download the ResNet checkpoint we provided.
ResNet Checkpoint Download
# Download the ResNet checkpoint (choose either method)
# Method 1: Using git clone
git lfs install
git clone https://huggingface.co/RLinf/RLinf-ResNet10-pretrained
# Method 2: Using huggingface-hub
# For mainland China users, you can use the following for better download speed:
# export HF_ENDPOINT=https://hf-mirror.com
pip install huggingface-hub
hf download RLinf/RLinf-ResNet10-pretrained --local-dir RLinf-ResNet10-pretrained
After downloading, make sure the model_path in the config yaml points to this directory.
Update actor.model.model_path and rollout.model.model_path to the path of the model directory as follows.
rollout:
model:
model_path: Pathto/RLinf/RLinf-ResNet10-pretrained
actor:
model:
model_path: Pathto/RLinf/RLinf-ResNet10-pretrained
Running the Script#
1. Key configuration parameters#
Example 1: Pipeline overlap (recommended)#
cluster:
num_nodes: 2
component_placement:
env: 0-7
rollout: 8-15
actor: 0-15
rollout:
pipeline_stage_num: 2
This configuration enables pipeline overlap between rollout and env to increase throughput.
Example 3: Fully separated (no interference, usually no offload needed)#
cluster:
num_nodes: 2
component_placement:
env: 0-3
rollout: 4-7
actor: 8-15
This configuration isolates env, rollout, and actor on different GPU groups, so offload is usually unnecessary.
2. Launch command#
After selecting a configuration, start training in root directory:
bash examples/embodiment/run_embodiment.sh CHOSEN_CONFIG
Supports training an MLP policy using PPO or training a CNN policy using SAC in the Franka-Sim environment:
bash examples/embodiment/run_embodiment.sh frankasim_ppo_mlp
bash examples/embodiment/run_async.sh frankasim_sac_cnn_async
Visualization and Results#
1. TensorBoard logs#
tensorboard --logdir ./logs --port 6006
2. Key metrics to monitor#
Training metrics#
train/actor/approx_kl: approximate KL divergence, used to monitor policy update magnitudetrain/actor/clip_fraction: fraction of samples affected by PPO clippingtrain/actor/clipped_ratio: mean clipped probability ratiotrain/actor/grad_norm: gradient normtrain/actor/lr: learning ratetrain/actor/policy_loss: policy losstrain/critic/value_loss: value function losstrain/critic/value_clip_ratio: fraction of samples affected by value clippingtrain/critic/explained_variance: value fit quality, closer to 1 is bettertrain/entropy_loss: policy entropytrain/loss: total loss (actor + critic + entropy regularization)
Rollout metrics#
rollout/advantages_max: maximum advantagerollout/advantages_mean: mean advantagerollout/advantages_min: minimum advantagerollout/rewards: reward statistics per chunk
Environment metrics#
env/episode_len: episode length (steps)env/return: total episode return (less informative for sparse rewards)env/reward: step-level rewardenv/success_once: recommended metric, reflects unnormalized success rate
3. Video generation#
Video generation is currently supported only in PandaPickCubeVision-v0:
env:
eval:
video_cfg:
save_video: True
video_base_dir: ${runner.logger.log_path}/video/eval
4. Logging backend integration#
runner:
task_type: embodied
logger:
log_path: "../results"
project_name: rlinf
experiment_name: "maniskill_ppo_openvla"
logger_backends: ["tensorboard"] # wandb, swanlab
Simulation Results#
The following presents the training curves of asynchronous SAC+CNN in the simulation environment. Within one hour, the grasping strategy could be successfully learned and remained stable thereafter.
Success rate curve