Multi-Node Ray Cluster Setup#
This guide explains how to start a Ray cluster across multiple machines and run RLinf training on top of it. It applies to all Ray-based RLinf workloads, including embodied RL, reasoning, and agents.
Prerequisites#
Before you begin, confirm that:
All nodes can reach the head at
<head_ip>:6379(or your chosen port).Every node has the same versions of Python, Ray (
ray>=2.47.0), and RLinf dependencies.cluster.num_nodesin your config matches the actual cluster size (see Basic Configuration).
Important
Ray freezes the Python interpreter path and environment variables when you run
ray start. All processes Ray launches on that node inherit that environment.
On each node, source your virtualenv and install dependencies before
ray start. Packages installed after ray start are not visible to Ray workers.
Step 1: Set environment variables on each node#
On every node, before ray start, set the node rank (required):
export RLINF_NODE_RANK=<0..N-1> # unique in the cluster; head is usually 0
If the machine has multiple NICs and Ray/collective communication should use a specific interface:
export RLINF_COMM_NET_DEVICES=<interface> # e.g. eth0, enp3s0
Use ip addr or ifconfig to find which interface is reachable from other nodes.
Step 2: Start the Ray cluster#
Option A: Manual ray start on each node#
Pick one machine as head (RLINF_NODE_RANK=0). Its IP must be reachable from all
workers; call it <head_ip>.
Head node:
export RLINF_NODE_RANK=0
# optional: export RLINF_COMM_NET_DEVICES=eth0
ray start --head --port=6379 --node-ip-address=<head_ip>
Worker nodes (RLINF_NODE_RANK = 1, 2, …, N-1):
export RLINF_NODE_RANK=1
# optional: export RLINF_COMM_NET_DEVICES=eth0
ray start --address='<head_ip>:6379'
--node-ip-address must be the address other nodes use to reach the head (private IP,
VPC IP, or overlay IP; see cloud platform notes below). Port 6379 can be changed if
unused; head and workers must use the same port.
Step 3: Enable code sync (optional)#
When the driver and workers do not share the same filesystem (cloud-edge,
heterogeneous datacenters, etc.), enable Ray task-level code sync before starting
the training script. The driver packs the rlinf/ package into runtime_env.py_modules;
workers do not need a matching local checkout.
export RLINF_CODE_WORKING_DIR=auto
Details:
RLINF_CODE_WORKING_DIR- unset /0/false: sync off (default; each node must have a consistent local tree). -auto: infer the repo root from the installedrlinfpackage or current directory. - absolute path: repo root containingpyproject.tomlandrlinf/, or therlinfpackage directory.
Only the rlinf/ subtree is synced—not examples/, docs/, etc. Example configs
and data paths must still be reachable on each node or via shared storage/NFS.
Note
When code sync is enabled
Do not store large files under ``rlinf/``: sync packages the entire local ``rlinf/`` directory from the launch node. Avoid logs, checkpoints, caches, datasets, or other large artifacts under
rlinf/to keep packaging and transfer fast.Prepare models and assets separately: model weights, simulator assets, datasets, etc. are not synced. Download them on each worker to the paths in your config, or mount via NFS/shared storage, and verify every node can access those paths.
These variables take effect on RLinf’s first ray.init (Cluster initialization). Do not call ray.init manually outside the training process.
Step 4: Verify cluster status#
On any node that has run ray start:
ray status
Confirm node count, per-node CPU/GPU, and state (ALIVE) match expectations.
For example, with 2 nodes and 8 GPUs each, you should see 16 GPUs total.
You can also wait for resources with (argument = total GPU count in the cluster):
bash ray_utils/check_ray.sh 16
If nodes are missing, check firewalls, whether --node-ip-address is reachable, and
worker ray start errors.
Step 5: Launch RLinf training#
Set
cluster.num_nodesin your task YAML to the actual node count (consistent withRLINF_NODE_RANK).On any node that has joined the Ray cluster,
cdto the RLinf repo and run an entry script, e.g. embodied:
cd <RLinf repo root>
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi
Reasoning example:
bash examples/reasoning/run_main_grpo_math.sh qwen2.5-1.5b-grpo-megatron
The launch node must:
Have run
ray startlocally (ray statusshows the full cluster);Reach config files, models, and data (shared storage or code sync);
If code sync is enabled, have
export RLINF_CODE_WORKING_DIR=...in the same terminal.
Note
For node_groups and cross-machine placement, see Worker Placement Strategy and Heterogenous Hardware Setup.
Cloud-edge component_placement examples are in Cloud-Edge Training Setup.
Stopping and rebuilding the cluster#
After ray stop on all nodes and clearing state, restart using this guide:
ray stop
rm -f ray_utils/ray_head_ip.txt # if you used start_ray.sh
After changing the Python environment or RLINF_NODE_RANK, ray stop on all affected
nodes and ray start again.
FAQ#
Worker cannot reach head
Check that <head_ip> is pingable/telnet-able from workers, security groups/iptables allow
port 6379, and the head did not bind --node-ip-address to 127.0.0.1.
``ray status`` shows fewer nodes than ``cluster.num_nodes``
Wait for workers to finish joining; confirm ray start succeeded on each node and you
are not mixing multiple independent Ray clusters.
Worker cannot import latest code
Check RLINF_CODE_WORKING_DIR; without sync, rlinf/ must match on all nodes. With
sync, only the rlinf/ package is shipped (avoid large files under rlinf/); configs
under examples/, models, and simulator assets must still be local or on shared storage.
Ray version mismatch
All nodes should use the same ray version (RLinf requires ray>=2.47.0).