__init__.py 3.6 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
  1. #!/usr/bin/env/python3
  2. # Copyright (c) Facebook, Inc. and its affiliates.
  3. # All rights reserved.
  4. #
  5. # This source code is licensed under the BSD-style license found in the
  6. # LICENSE file in the root directory of this source tree.
  7. """
  8. Torchelastic agent and user worker failover contract:
  9. **TL;DR;**:
  10. * TE(torchelastic) expects user workers to finish with the 5 minutes drift
  11. * It is better to design DDP app to fail for all workers, rather than a single one.
  12. * TE does not synchronize number of restarts between agents
  13. * TE re-rendezvous does not trigger restart decrease
  14. * When a single agent finishes its job(successfully or not), it will close rendezvous.
  15. If other agents still have workers in progress, they will be terminated.
  16. * Based on above, scale down does not work if at least single agent finishes the job.
  17. * When Scale up is detected by agents, it will not decrease ``max_restarts``
  18. In general TE(torchelastic) can launch arbitrary user code, but there is some
  19. clarifications need to be done around what failover mechanism torchelastic
  20. provides and what failover mechanism it expects from user workers.
  21. Torchelastic currently supports DDP style applications. That means that
  22. TE expects *ALL* workers finish approximately at the same time. In practice,
  23. it is nearly to impossible to guarantee that all workers in arbitrary
  24. DDP application finish at the time, so TE provides a finalization barrier
  25. that waits for TIMEOUT(5 minutes) for worker finalization.
  26. **Worker Failure**
  27. When worker fails, TE will check the number of restarts
  28. available, if there is more than 0 restarts, TE will start a new rendezvous
  29. round and restart the worker process. New rendezvous round will other
  30. TE agents to terminate their workers.
  31. .. note:: The TE agent does not synchronize restarts between themselves.
  32. When a single agent performs restart, it will trigger a local ``max_restarts``
  33. decrease, other agent will not decrease their ``max_restarts``.
  34. the user to run the distributed application locally on a dev host.
  35. A single worker failure can cause the whole cluster to fail:
  36. If a single worker is constantly failing, it will cause the TE agent
  37. ``max_restarts`` to go to zero. This will cause an agent to finish its
  38. work and close rendezvous. If there are any other workers on different
  39. agents, they will be terminated.
  40. **Re-Rendezvous**
  41. Re-rendezvous occurs when TE agents detect a new node
  42. trying to joint a cluster. TE will not decrease ``max_restarts``. TE agents
  43. will terminate its workers and start a new rendezvous round.
  44. Note about DynamicRendezvous(etcd-v2, c10d-experimental): If the rendezvous
  45. has already max_nodes, the new node won't be added to the wait list right
  46. away since there is no need to tear down a rendezvous that is already fully
  47. utilized. The new node will wait until its timeout (600 secs by default)
  48. and periodically check the number of participants. If the number becomes
  49. less than max_nodes, it will be added to the wait list; otherwise, it will time out after 600 secs.
  50. *Scale up event*. When scale up event happens, torchelastic rendezvous
  51. will detect that there are new nodes trying to join. Torchelastic agent
  52. will stop all workers and perform re-rendezvous. Note: when scale up event
  53. happens, *``max_restarts``* will *not* decrease.
  54. *Scale down event*. When scale down event happens, rendezvous will not
  55. notify the torchelastic agent about it. If TE agent launched with ``max_restarts=0`` ,
  56. it relies on the underlying scheduler to handle job restart. If the ``max_restarts>0`` ,
  57. TE agent will terminate workers and start a new rdzv round, which is a *Scale up event*.
  58. """