# Copyright (c) Facebook, Inc. and its affiliates. # All rights reserved. # # This source code is licensed under the BSD-style license found in the # LICENSE file in the root directory of this source tree. """ In the context of Torch Distributed Elastic we use the term *rendezvous* to refer to a particular functionality that combines a **distributed synchronization** primitive with **peer discovery**. It is used by Torch Distributed Elastic to gather participants of a training job (i.e. nodes) such that they all agree on the same list of participants and everyone's roles, as well as make a consistent collective decision on when training can begin/resume. Torch Distributed Elastic rendezvous provides the following critical functionalities: **Barrier**: Nodes performing rendezvous will all block until the rendezvous is considered complete - this happens when at least ``min`` total number of nodes have joined the rendezvous barrier (for the same job). This also implies the barrier is not necessarily of fixed size. There's an additional small waiting time after reaching ``min`` number of nodes - this is used to ensure the rendezvous is not completed "too quickly" (which could potentially exclude additional nodes attempting to join at approximately the same time). If ``max`` number of nodes is gathered at the barrier, the rendezvous is completed immediately. There's also an overall timeout which causes the rendezvous to fail if ``min`` number of nodes is never reached - this is meant to be a simple fail-safe to help release partially allocated job resources, in case there's a problem with the resource manager, and is meant to be interpreted as non-retryable. **Exclusivity**: A simple distributed barrier would not be sufficient, as we also need to ensure that only one group of nodes exists at any given time (for a given job). In other words, new nodes (i.e. joining late) should not be able to form a parallel independent group of workers for the same job. Torch Distributed Elastic rendezvous ensures that if a group of nodes has already completed a rendezvous (and hence might already be training), then additional "late" nodes attempting to rendezvous will only announce themselves as waiting, and will have to wait until the (previously completed) existing rendezvous is destroyed first. **Consistency**: When a rendezvous is completed, all its members will agree on the job membership and everyone's role in it. This role is represented using an integer, called rank, that is between between 0 and world size. Note that ranks are *not stable*, in the sense that the same node can be assigned a different rank in the next (re-)rendezvous. **Fault-tolerance**: Torch Distributed Elastic rendezvous is designed to tolerate node failures during the rendezvous process. Should a process crash (or lose network connectivity, etc), between joining the rendezvous and it being completed, then a re-rendezvous with remaining healthy nodes will happen automatically. A node can also fail *after* it has completed (or *has been observered* by other nodes to have completed) the rendezvous - this scenario will be handled by the Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a re-rendezvous). **Shared key-value store**: When the rendezvous is completed, a shared key-value store is created and returned. This store implements a ``torch.distributed.Store`` API (see `distributed communication docs `__). This store is only shared by the members of the completed rendezvous. It is intended to be used by Torch Distributed Elastic to exchange information necessary to initialize job control and data-planes. **Waiting workers and rendezvous closing**: Torch Distributed Elastic rendezvous handler object provides additional functionalities, which are technically not part of the rendezvous process: 1. Querying how many workers arrived late at the barrier, who can participate in *next* rendezvous. 2. Setting the rendezvous *closed* to signal all nodes not to participate in next rendezvous. **DynamicRendezvousHandler**: Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler` class that implements the rendezvous mechanism described above. It is a backend- agnostic type that expects a particular :py:class:`.RendezvousBackend` instance to be specified during construction. Torch distributed users can either implement their own backend type or use one of the following implementations that come with PyTorch: - :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default ``TCPStore``) as the rendezvous backend. The main advantage of using a C10d store is that it requires no 3rd-party dependency (such as etcd) to establish a rendezvous. - :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy :py:class:`.EtcdRendezvousHandler` class. Passing an :py:class:`.EtcdRendezvousBackend` instance to :py:class:`.DynamicRendezvousHandler` is functionally equivalent to instantiating an :py:class:`.EtcdRendezvousHandler`. :: store = TCPStore("localhost") backend = C10dRendezvousBackend(store, "my_run_id") rdzv_handler = DynamicRendezvousHandler.from_backend( run_id="my_run_id", store=store, backend=backend, min_nodes=2, max_nodes=4 ) """ from .api import * # noqa: F403 from .registry import _register_default_handlers _register_default_handlers() __all__ = [ "RendezvousClosedError", "RendezvousConnectionError", "RendezvousError", "RendezvousHandler", "RendezvousHandlerCreator", "RendezvousHandlerRegistry", "RendezvousParameters", "RendezvousStateError", "RendezvousTimeoutError", "rendezvous_handler_registry", ]