MLP Policy Reinforcement Learning Training#
This example demonstrates the complete workflow for training Reinforcement Learning (RL) agents using MLP (Multi-Layer Perceptron) policy networks within the RLinf framework.
The MLP policy is primarily designed for robotics control tasks utilizing low-dimensional state inputs. It supports training across various simulation environments, including ManiSkill3, FrankaSim, and Libero-Spatial.
The current configuration covers PPO-MLP, SAC-MLP, and GRPO-MLP algorithm setups, enabling rapid validation of environments, training pipelines, and network architectures.
The primary goal is to equip the model with the following capabilities:
State Understanding: Process low-dimensional proprioceptive data from the environment (joint angles, end-effector pose, object states, etc.).
Action Generation: Produce continuous control actions (end-effector position deltas, joint targets, gripper commands, etc.).
Reinforcement Learning: Optimize policies using PPO or SAC based on environmental feedback.
Environments#
RLinf currently supports a diverse range of embodied intelligence environments. You can select different environment configurations via the defaults list using env/<env_name>@env.train and env/<env_name>@env.eval.
Specific parameters such as parallel environment count, episode length, reset protocols, and video recording can be overridden under the env.train / env.eval nodes.
Currently supported environments (covered in this example) include:
maniskill_pick_cube(ManiSkill3)libero_spatial(LIBERO Spatial)frankasim_pickcube_state(Mujoco / FrankaSim)
You can also train on custom tasks by referencing specific environment configurations:
Reference the environment in the configuration file via defaults (training and evaluation can be specified separately).
defaults:
- env/maniskill_pick_cube@env.train
- env/maniskill_pick_cube@env.eval
defaults:
- env/libero_spatial@env.train
- env/libero_spatial@env.eval
defaults:
- env/frankasim_pickcube_state@env.train
- env/frankasim_pickcube_state@env.eval
Algorithms#
Core Algorithm Components
PPO (Proximal Policy Optimization)
Adopts an on-policy Actor-Critic framework.
Uses GAE (Generalized Advantage Estimation) for advantage function estimation:
adv_type: gae.Utilizes ratio clipping to constrain policy updates, with optional KL divergence constraints.
SAC (Soft Actor-Critic)
Learns Q-values via Bellman backups and entropy regularization (off-policy).
Uses an MLP as the Actor policy network; ensure Q-related heads/structures are enabled in the configuration (
add_q_head: True).Supports Automatic Entropy Tuning via
entropy_tuning(e.g.,alpha_type: softplus) to balance exploration and exploitation.
GRPO (Group Relative Policy Optimization)
For each state/prompt, the policy generates G independent actions.
Uses the group average reward as a baseline to calculate the relative advantage of each action.
Installation & Dependencies#
For running in simulation environments, please refer to Installation for installation instructions.
This configuration series uses Hydra’s searchpath to load external configuration directories via environment variables:
hydra.searchpath: file://${oc.env:EMBODIED_PATH}/config/
Please ensure that EMBODIED_PATH is correctly set and that dependencies/resources for ManiSkill3 / FrankaSim are installed.
Running Scripts#
1. Configuration Files
RLinf provides several default MLP configurations covering different environments and algorithm settings:
ManiSkill + PPO + MLP:
maniskill_ppo_mlpManiSkill + SAC + MLP:
maniskill_sac_mlpFrankaSim + PPO + MLP:
franka_sim_ppo_mlp
2. Key Parameter Configuration
2.1 Model Parameters (Model)
The MLP model is introduced via model/mlp_policy@actor.model and can be overridden in different configurations. Key fields include:
model_type: "mlp_policy" # Use MLP policy network as actor (Multi-Layer Perceptron; fits low-dim state inputs)
model_path: ""
policy_setup: "panda-qpos" # Select action semantics and control mode; 'panda-qpos' usually implies joint space control (e.g., qpos/joint targets or deltas)
obs_dim: 42 # Input dimension of the state vector (must match environment state output)
action_dim: 8 # Output dimension of the action vector (must match environment action space)
num_action_chunks: 1 # Number of action chunks generated per forward pass
hidden_dim: 256 # Width/Channel size of MLP hidden layers
precision: "32" # Model parameter and computation precision
add_value_head: True # Whether to attach an additional value head to the policy network
is_lora: False # Whether to enable LoRA
lora_rank: 32 # LoRA rank dimension 'r'; only effective when is_lora=True
2.2 Cluster & Hardware Configuration (Cluster)
For real-robot training, a multi-node configuration is used, deploying the Actor/Policy on GPU servers and the Env/Robot on control machines (NUC/Industrial PC). For specific configurations, please refer to Real-World RL with Franka.
3. Launch Commands
ManiSkill (PPO-MLP)
bash examples/embodiment/run_embodiment.sh maniskill_ppo_mlp
ManiSkill (SAC-MLP)
bash examples/embodiment/run_embodiment.sh maniskill_sac_mlp
Libero-Spatial (GRPO-MLP)
bash examples/embodiment/run_embodiment.sh libero_spatial_0_grpo_mlp
FrankaSim (PPO-MLP)
bash examples/embodiment/run_embodiment.sh franka_sim_ppo_mlp
Visualization & Results#
1. TensorBoard Logs
# Launch TensorBoard
tensorboard --logdir ../results
2. Key Monitoring Metrics
Environment Metrics:
env/episode_len: Actual environment steps taken in an episode (Unit: step).env/return: Total cumulative return of the episode.env/reward: Step-level reward signal.env/success_once: Flag indicating if success was achieved at least once in the episode (if provided by environment).
Training Metrics (SAC):
train/sac/critic_loss: Q-function loss.train/sac/actor_loss: Policy loss.train/sac/alpha_loss: Temperature parameter loss.train/sac/alpha: Temperature parameter value.train/replay_buffer/size: Replay buffer size.
Training Metrics (PPO):
Policy Loss
Value Loss
Approx KL / KL (Estimated KL Divergence)
Clip Frac (Ratio clipping proportion)
Entropy (Policy entropy)