__init__.py 5.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
  1. # Copyright (c) Facebook, Inc. and its affiliates.
  2. # All rights reserved.
  3. #
  4. # This source code is licensed under the BSD-style license found in the
  5. # LICENSE file in the root directory of this source tree.
  6. """
  7. In the context of Torch Distributed Elastic we use the term *rendezvous* to
  8. refer to a particular functionality that combines a **distributed
  9. synchronization** primitive with **peer discovery**.
  10. It is used by Torch Distributed Elastic to gather participants of a training
  11. job (i.e. nodes) such that they all agree on the same list of participants and
  12. everyone's roles, as well as make a consistent collective decision on when
  13. training can begin/resume.
  14. Torch Distributed Elastic rendezvous provides the following critical
  15. functionalities:
  16. **Barrier**:
  17. Nodes performing rendezvous will all block until the rendezvous is considered
  18. complete - this happens when at least ``min`` total number of nodes have joined
  19. the rendezvous barrier (for the same job). This also implies the barrier is not
  20. necessarily of fixed size.
  21. There's an additional small waiting time after reaching ``min`` number of
  22. nodes - this is used to ensure the rendezvous is not completed "too quickly"
  23. (which could potentially exclude additional nodes attempting to join at
  24. approximately the same time).
  25. If ``max`` number of nodes is gathered at the barrier, the rendezvous is
  26. completed immediately.
  27. There's also an overall timeout which causes the rendezvous to fail if ``min``
  28. number of nodes is never reached - this is meant to be a simple fail-safe to
  29. help release partially allocated job resources, in case there's a problem with
  30. the resource manager, and is meant to be interpreted as non-retryable.
  31. **Exclusivity**:
  32. A simple distributed barrier would not be sufficient, as we also need to ensure
  33. that only one group of nodes exists at any given time (for a given job). In
  34. other words, new nodes (i.e. joining late) should not be able to form a parallel
  35. independent group of workers for the same job.
  36. Torch Distributed Elastic rendezvous ensures that if a group of nodes has
  37. already completed a rendezvous (and hence might already be training), then
  38. additional "late" nodes attempting to rendezvous will only announce themselves
  39. as waiting, and will have to wait until the (previously completed) existing
  40. rendezvous is destroyed first.
  41. **Consistency**:
  42. When a rendezvous is completed, all its members will agree on the job membership
  43. and everyone's role in it. This role is represented using an integer, called
  44. rank, that is between between 0 and world size.
  45. Note that ranks are *not stable*, in the sense that the same node can be
  46. assigned a different rank in the next (re-)rendezvous.
  47. **Fault-tolerance**:
  48. Torch Distributed Elastic rendezvous is designed to tolerate node failures
  49. during the rendezvous process. Should a process crash (or lose network
  50. connectivity, etc), between joining the rendezvous and it being completed, then
  51. a re-rendezvous with remaining healthy nodes will happen automatically.
  52. A node can also fail *after* it has completed (or *has been observered* by other
  53. nodes to have completed) the rendezvous - this scenario will be handled by the
  54. Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a
  55. re-rendezvous).
  56. **Shared key-value store**:
  57. When the rendezvous is completed, a shared key-value store is created and
  58. returned. This store implements a ``torch.distributed.Store`` API (see
  59. `distributed communication docs
  60. <https://pytorch.org/docs/stable/distributed.html>`__).
  61. This store is only shared by the members of the completed rendezvous. It
  62. is intended to be used by Torch Distributed Elastic to exchange information
  63. necessary to initialize job control and data-planes.
  64. **Waiting workers and rendezvous closing**:
  65. Torch Distributed Elastic rendezvous handler object provides additional
  66. functionalities, which are technically not part of the rendezvous process:
  67. 1. Querying how many workers arrived late at the barrier, who can participate in
  68. *next* rendezvous.
  69. 2. Setting the rendezvous *closed* to signal all nodes not to participate in
  70. next rendezvous.
  71. **DynamicRendezvousHandler**:
  72. Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler`
  73. class that implements the rendezvous mechanism described above. It is a backend-
  74. agnostic type that expects a particular :py:class:`.RendezvousBackend` instance
  75. to be specified during construction.
  76. Torch distributed users can either implement their own backend type or use one
  77. of the following implementations that come with PyTorch:
  78. - :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default
  79. ``TCPStore``) as the rendezvous backend. The main advantage of using a C10d
  80. store is that it requires no 3rd-party dependency (such as etcd) to establish
  81. a rendezvous.
  82. - :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy
  83. :py:class:`.EtcdRendezvousHandler` class. Passing an
  84. :py:class:`.EtcdRendezvousBackend` instance to
  85. :py:class:`.DynamicRendezvousHandler` is functionally equivalent to
  86. instantiating an :py:class:`.EtcdRendezvousHandler`.
  87. ::
  88. store = TCPStore("localhost")
  89. backend = C10dRendezvousBackend(store, "my_run_id")
  90. rdzv_handler = DynamicRendezvousHandler.from_backend(
  91. run_id="my_run_id",
  92. store=store,
  93. backend=backend,
  94. min_nodes=2,
  95. max_nodes=4
  96. )
  97. """
  98. from .api import * # noqa: F403
  99. from .registry import _register_default_handlers
  100. _register_default_handlers()
  101. __all__ = [
  102. "RendezvousClosedError",
  103. "RendezvousConnectionError",
  104. "RendezvousError",
  105. "RendezvousHandler",
  106. "RendezvousHandlerCreator",
  107. "RendezvousHandlerRegistry",
  108. "RendezvousParameters",
  109. "RendezvousStateError",
  110. "RendezvousTimeoutError",
  111. "rendezvous_handler_registry",
  112. ]