123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149 |
- # Copyright (c) Facebook, Inc. and its affiliates.
- # All rights reserved.
- #
- # This source code is licensed under the BSD-style license found in the
- # LICENSE file in the root directory of this source tree.
- """
- In the context of Torch Distributed Elastic we use the term *rendezvous* to
- refer to a particular functionality that combines a **distributed
- synchronization** primitive with **peer discovery**.
- It is used by Torch Distributed Elastic to gather participants of a training
- job (i.e. nodes) such that they all agree on the same list of participants and
- everyone's roles, as well as make a consistent collective decision on when
- training can begin/resume.
- Torch Distributed Elastic rendezvous provides the following critical
- functionalities:
- **Barrier**:
- Nodes performing rendezvous will all block until the rendezvous is considered
- complete - this happens when at least ``min`` total number of nodes have joined
- the rendezvous barrier (for the same job). This also implies the barrier is not
- necessarily of fixed size.
- There's an additional small waiting time after reaching ``min`` number of
- nodes - this is used to ensure the rendezvous is not completed "too quickly"
- (which could potentially exclude additional nodes attempting to join at
- approximately the same time).
- If ``max`` number of nodes is gathered at the barrier, the rendezvous is
- completed immediately.
- There's also an overall timeout which causes the rendezvous to fail if ``min``
- number of nodes is never reached - this is meant to be a simple fail-safe to
- help release partially allocated job resources, in case there's a problem with
- the resource manager, and is meant to be interpreted as non-retryable.
- **Exclusivity**:
- A simple distributed barrier would not be sufficient, as we also need to ensure
- that only one group of nodes exists at any given time (for a given job). In
- other words, new nodes (i.e. joining late) should not be able to form a parallel
- independent group of workers for the same job.
- Torch Distributed Elastic rendezvous ensures that if a group of nodes has
- already completed a rendezvous (and hence might already be training), then
- additional "late" nodes attempting to rendezvous will only announce themselves
- as waiting, and will have to wait until the (previously completed) existing
- rendezvous is destroyed first.
- **Consistency**:
- When a rendezvous is completed, all its members will agree on the job membership
- and everyone's role in it. This role is represented using an integer, called
- rank, that is between between 0 and world size.
- Note that ranks are *not stable*, in the sense that the same node can be
- assigned a different rank in the next (re-)rendezvous.
- **Fault-tolerance**:
- Torch Distributed Elastic rendezvous is designed to tolerate node failures
- during the rendezvous process. Should a process crash (or lose network
- connectivity, etc), between joining the rendezvous and it being completed, then
- a re-rendezvous with remaining healthy nodes will happen automatically.
- A node can also fail *after* it has completed (or *has been observered* by other
- nodes to have completed) the rendezvous - this scenario will be handled by the
- Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a
- re-rendezvous).
- **Shared key-value store**:
- When the rendezvous is completed, a shared key-value store is created and
- returned. This store implements a ``torch.distributed.Store`` API (see
- `distributed communication docs
- <https://pytorch.org/docs/stable/distributed.html>`__).
- This store is only shared by the members of the completed rendezvous. It
- is intended to be used by Torch Distributed Elastic to exchange information
- necessary to initialize job control and data-planes.
- **Waiting workers and rendezvous closing**:
- Torch Distributed Elastic rendezvous handler object provides additional
- functionalities, which are technically not part of the rendezvous process:
- 1. Querying how many workers arrived late at the barrier, who can participate in
- *next* rendezvous.
- 2. Setting the rendezvous *closed* to signal all nodes not to participate in
- next rendezvous.
- **DynamicRendezvousHandler**:
- Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler`
- class that implements the rendezvous mechanism described above. It is a backend-
- agnostic type that expects a particular :py:class:`.RendezvousBackend` instance
- to be specified during construction.
- Torch distributed users can either implement their own backend type or use one
- of the following implementations that come with PyTorch:
- - :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default
- ``TCPStore``) as the rendezvous backend. The main advantage of using a C10d
- store is that it requires no 3rd-party dependency (such as etcd) to establish
- a rendezvous.
- - :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy
- :py:class:`.EtcdRendezvousHandler` class. Passing an
- :py:class:`.EtcdRendezvousBackend` instance to
- :py:class:`.DynamicRendezvousHandler` is functionally equivalent to
- instantiating an :py:class:`.EtcdRendezvousHandler`.
- ::
- store = TCPStore("localhost")
- backend = C10dRendezvousBackend(store, "my_run_id")
- rdzv_handler = DynamicRendezvousHandler.from_backend(
- run_id="my_run_id",
- store=store,
- backend=backend,
- min_nodes=2,
- max_nodes=4
- )
- """
- from .api import * # noqa: F403
- from .registry import _register_default_handlers
- _register_default_handlers()
- __all__ = [
- "RendezvousClosedError",
- "RendezvousConnectionError",
- "RendezvousError",
- "RendezvousHandler",
- "RendezvousHandlerCreator",
- "RendezvousHandlerRegistry",
- "RendezvousParameters",
- "RendezvousStateError",
- "RendezvousTimeoutError",
- "rendezvous_handler_registry",
- ]
|