Tips¶
Determining the rendezvous IP address and port¶
When launching a simulation across multiple ranks, it is often convenient to have rank 0 choose the Gloo rendezvous address and port, then share them with the other ranks through a small JSON file.
In the example below, rank 0 extracts the IPv4 address of the high-speed
network interface named hsn0, binds a socket to port 0 to let the OS
pick a free port, and writes both values to gloo_rdzv.json. All other
ranks wait until that file exists, then read it and reuse the same rendezvous
endpoint.
rdzv = "./gloo_rdzv.json"
if rank == 0:
ip = subprocess.check_output(
"ip -o -4 addr show hsn0 | awk '{print $4}' | cut -d/ -f1",
shell=True, text=True
).strip()
s = socket.socket(); s.bind((ip, 0))
port = s.getsockname()[1]; s.close()
json.dump({"ip": ip, "port": port}, open(rdzv, "w"))
else:
while not os.path.exists(rdzv): time.sleep(0.1)
data = json.load(open(rdzv))
master_addr, master_port = data["ip"], data["port"]