Implicit Q-Learning (IQL) Algorithm#
1. Introduction#
Implicit Q-Learning (IQL) is a classic algorithm for offline reinforcement learning (Offline RL). It learns high-quality policies from a fixed dataset without explicitly performing policy improvement on unseen actions.
The key idea of IQL is to decouple value learning from policy learning:
Value (State Value Function): learns a value function biased toward high-return regions via expectile regression.
Critic (Q-value Function): fits action values with Bellman regression.
Actor (Policy Model): updates the policy with advantage-weighted behavior cloning (Advantage-Weighted BC).
This design avoids aggressive extrapolation on out-of-distribution (OOD) actions and is stable on offline benchmarks such as D4RL.
For more details, see the original IQL paper IQL.
2. Objective Function#
Let the state value function be \(V_{\psi}(s)\), the Q function be \(Q_{\phi}(s, a)\), and the policy be \(\pi_{\theta}(a|s)\). IQL is usually trained with the following three objectives:
(1) Q-function regression
(2) Expectile regression for Value
where \(\rho_{\tau}(u)=|\tau-\mathbb{I}(u<0)|u^2\), and \(\tau\) is the expectile coefficient (e.g., 0.7).
(3) Advantage-weighted behavior cloning for Actor
where \(\beta\) is the temperature coefficient that controls the scale of advantage weights.
3. Configuration#
In RLinf, IQL can be used for offline embodied tasks (e.g., D4RL).
Using d4rl_iql_mujoco.yaml as an example, the key configuration is:
algorithm:
loss_type: "offline_iql"
batch_size: 256
actor_lr: 3.0e-4
value_lr: 3.0e-4
critic_lr: 3.0e-4
hidden_dims: [256, 256]
discount: 0.99
tau: 0.005
expectile: 0.7
temperature: 3.0
gamma: 0.99
dropout_rate: null
opt_decay_schedule: "cosine"
env:
dataset_type: "d4rl"
train:
env_type: "d4rl"
env_name: "halfcheetah-medium-v2"
actor:
worker_cls: "iql"