Cluster Interface#
This section introduces the Cluster class in RLinf, which is responsible for launching remote nodes and GPUs. Based on the metadata obtained from the placement strategy, it leverages Ray to precisely schedule all training resources for distributed training.
Cluster#
- class rlinf.scheduler.cluster.Cluster#
A singleton class that manages the cluster resources for Ray workers.
- exception NamespaceConflictError#
Raised when there is a namespace conflict in Ray initialization.
- classmethod find_free_port()#
Find a free port on the node.
- classmethod has_initialized()#
Check if the cluster has been initialized.
- __init__(num_nodes=None, cluster_cfg=None, distributed_log_dir=None)#
Initialize the cluster.
- Parameters:
num_nodes (
int) – The number of nodes in the cluster. When you wish to acquire the cluster instance in a processes other than the main driver process, do not pass this argument. Instead, use theCluster()constructor without arguments. If num_nodes is 0, it will initialize the cluster with all ray-connected nodes.cluster_cfg (
Optional[DictConfig]) – The cluster’s configuration dictionary. If set, num_nodes will be ignored and inferred from the config.distributed_log_dir (
Optional[str]) – Output directory for split logs. This must be provided whendistributed_loggingis True.
- static get_full_env_var_name(var)#
Get the full environment variable name with system prefix.
- Parameters:
var (ClusterEnvVar)
- Return type:
str
- static get_sys_env_var(env_var, default=None)#
Get the system environment variable for the cluster.
- Parameters:
env_var (ClusterEnvVar)
default (str | None)
- Return type:
str | None
- property num_nodes#
Get the number of nodes in the cluster.
- property num_accelerators#
Get the number of accelerators in the cluster.
- property accelerator_ranks: list[list[int]]#
Get the global accelerator ranks for each node in the cluster.
- static get_alive_nodes()#
Get the list of alive nodes in the Ray cluster.
- get_node_group(label='cluster')#
Get the node group information by label.
- Parameters:
label (
Optional[str]) – The label of the node group.- Returns:
The node group information.
- Return type:
Optional[NodeGroupInfo]
- get_node_info(node_rank)#
Get the NodeInfo of a specific node rank.
- Parameters:
node_rank (int)
- get_node_ip(node_rank)#
Get the IP address of a specific node by its rank.
- Parameters:
node_rank (int)
- Return type:
str