__init__.py 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
  1. #!/usr/bin/env/python3
  2. # Copyright (c) Facebook, Inc. and its affiliates.
  3. # All rights reserved.
  4. #
  5. # This source code is licensed under the BSD-style license found in the
  6. # LICENSE file in the root directory of this source tree.
  7. """Metrics API
  8. **Overview**:
  9. The metrics API in torchelastic is used to publish telemetry metrics.
  10. It is designed to be used by torchelastic's internal modules to
  11. publish metrics for the end user with the goal of increasing visibility
  12. and helping with debugging. However you may use the same API in your
  13. jobs to publish metrics to the same metrics ``sink``.
  14. A ``metric`` can be thought of as timeseries data
  15. and is uniquely identified by the string-valued tuple
  16. ``(metric_group, metric_name)``.
  17. torchelastic makes no assumptions about what a ``metric_group`` is
  18. and what relationship it has with ``metric_name``. It is totally up
  19. to the user to use these two fields to uniquely identify a metric.
  20. .. note:: The metric group ``torchelastic`` is reserved by torchelastic for
  21. platform level metrics that it produces.
  22. For instance torchelastic may output the latency (in milliseconds)
  23. of a re-rendezvous operation from the agent as
  24. ``(torchelastic, agent.rendezvous.duration.ms)``
  25. A sensible way to use metric groups is to map them to a stage or module
  26. in your job. You may also encode certain high level properties
  27. the job such as the region or stage (dev vs prod).
  28. **Publish Metrics**:
  29. Using torchelastic's metrics API is similar to using python's logging
  30. framework. You first have to configure a metrics handler before
  31. trying to add metric data.
  32. The example below measures the latency for the ``calculate()`` function.
  33. ::
  34. import time
  35. import torch.distributed.elastic.metrics as metrics
  36. # makes all metrics other than the one from "my_module" to go /dev/null
  37. metrics.configure(metrics.NullMetricsHandler())
  38. metrics.configure(metrics.ConsoleMetricsHandler(), "my_module")
  39. def my_method():
  40. start = time.time()
  41. calculate()
  42. end = time.time()
  43. metrics.put_metric("calculate_latency", int(end-start), "my_module")
  44. You may also use the torch.distributed.elastic.metrics.prof` decorator
  45. to conveniently and succinctly profile functions
  46. ::
  47. # -- in module examples.foobar --
  48. import torch.distributed.elastic.metrics as metrics
  49. metrics.configure(metrics.ConsoleMetricsHandler(), "foobar")
  50. metrics.configure(metrics.ConsoleMetricsHandler(), "Bar")
  51. @metrics.prof
  52. def foo():
  53. pass
  54. class Bar():
  55. @metrics.prof
  56. def baz():
  57. pass
  58. ``@metrics.prof`` will publish the following metrics
  59. ::
  60. <leaf_module or classname>.success - 1 if the function finished successfully
  61. <leaf_module or classname>.failure - 1 if the function threw an exception
  62. <leaf_module or classname>.duration.ms - function duration in milliseconds
  63. **Configuring Metrics Handler**:
  64. `torch.distributed.elastic.metrics.MetricHandler` is responsible for emitting
  65. the added metric values to a particular destination. Metric groups can be
  66. configured with different metric handlers.
  67. By default torchelastic emits all metrics to ``/dev/null``.
  68. By adding the following configuration metrics,
  69. ``torchelastic`` and ``my_app`` metric groups will be printed out to
  70. console.
  71. ::
  72. import torch.distributed.elastic.metrics as metrics
  73. metrics.configure(metrics.ConsoleMetricHandler(), group = "torchelastic")
  74. metrics.configure(metrics.ConsoleMetricHandler(), group = "my_app")
  75. **Writing a Custom Metric Handler**:
  76. If you want your metrics to be emitted to a custom location, implement
  77. the `torch.distributed.elastic.metrics.MetricHandler` interface
  78. and configure your job to use your custom metric handler.
  79. Below is a toy example that prints the metrics to ``stdout``
  80. ::
  81. import torch.distributed.elastic.metrics as metrics
  82. class StdoutMetricHandler(metrics.MetricHandler):
  83. def emit(self, metric_data):
  84. ts = metric_data.timestamp
  85. group = metric_data.group_name
  86. name = metric_data.name
  87. value = metric_data.value
  88. print(f"[{ts}][{group}]: {name}={value}")
  89. metrics.configure(StdoutMetricHandler(), group="my_app")
  90. Now all metrics in the group ``my_app`` will be printed to stdout as:
  91. ::
  92. [1574213883.4182858][my_app]: my_metric=<value>
  93. [1574213940.5237644][my_app]: my_metric=<value>
  94. """
  95. from typing import Optional
  96. from .api import ( # noqa: F401
  97. ConsoleMetricHandler,
  98. MetricData,
  99. MetricHandler,
  100. MetricsConfig,
  101. NullMetricHandler,
  102. configure,
  103. get_elapsed_time_ms,
  104. getStream,
  105. prof,
  106. profile,
  107. publish_metric,
  108. put_metric,
  109. )
  110. def initialize_metrics(cfg: Optional[MetricsConfig] = None):
  111. pass
  112. try:
  113. from torch.distributed.elastic.metrics.static_init import * # type: ignore[import] # noqa: F401 F403
  114. except ModuleNotFoundError:
  115. pass