Metadata-Version: 2.3
Name: aidot-training-telemetry
Version: 1.0.0
Summary: Utilities for monitoring training of large foundation models
License: NVIDIA Mission Control, License no 744-SW7001+P1CMI36
Author: Stefania Alborghetti
Author-email: salborghetti@nvidia.com
Requires-Python: >=3.10
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Provides-Extra: ai-tools
Requires-Dist: numpy (>=1.24.0,<3.0.0) ; extra == "ai-tools"
Requires-Dist: nvtx (>=0.2.11,<0.3.0)
Requires-Dist: opentelemetry-api (>=1.30.0,<2.0.0)
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc (>=1.34.1,<2.0.0)
Requires-Dist: opentelemetry-exporter-otlp-proto-http (>=1.34.1,<2.0.0)
Requires-Dist: opentelemetry-sdk (>=1.30.0,<2.0.0)
Requires-Dist: torch (>=2.7.1,<3.0.0) ; extra == "ai-tools"
Requires-Dist: tzlocal (>=5.3,<6.0)
Description-Content-Type: text/markdown

# Training Telemetry

A Python library that records events, metrics, and errors during model training in standardized formats:

* structured key=value logs
* JSON for text files
* Open Telemetry (OTEL) trace spans
* NVTX code markers for Nvidia Nsight Systems

## Overview

The objective of this library is to provide a standard format for logging events, metrics and errors that can be adopted by existing frameworks and applications for training large AI models. The end result is that the runtime performance and errors of these training models can be monitored in a consistent manner, without impacting the training performance. Time spans provide detailed information on how each training process spends its time during startup, training and checkpoint saving. Errors can be analyzed and correlated with infrastructure events once the application fails, in order to provide users with more actionable information.

This library is lightweight and intentionally has very few dependencies, so as to facilitate integration with training frameworks that normally have a long list of dependencies. The API is provided on two levels:

* A context-based API, where monitoring can be done via context managers or function decorators
* A low-level recorder API with start/stop/event/error functions for callback implementations and other low-level requirements

The following events are currently supported:

- Application runtime and application-specific metrics
- Training loop progress and timing
- Individual iteration metrics, including loss, accuracy, TFLOPS, consumed samples, forward and backward times
- Checkpoint saves, including global and local checkpoints, async and sync checkpoint strategies
- Errors and exceptions  
- Model validation and testing
- Custom metrics and events

Events are logged by one or more of the following backends:

* A Python logger backend, logging events as messages using a logger at INFO level with structured log format
* A file logger backend, where each event is logged as a one-line JSON object
* An OpenTelemetry backend, where each event is converted to a span and sent to the OTEL collector

Events have metrics attached to them. A special class of events, error events, captures error messages and stack traces.

## Key Features

- Context managers for timing code blocks
- Event recording with customizable metrics
- Exception handling and error reporting
- Flexible backend system for storing/analyzing telemetry data as log messages, JSON objects or OTEL traces
- Low overhead monitoring

## Installation

The library package is stored in two locations:

* the [GitLab package repository](https://gitlab-master.nvidia.com/ai-efficiency/training_telemetry/-/packages)
* the [NVIDIA nv-shared-pypi-local repository on Artifactory](https://urm.nvidia.com/ui/repos/tree/General/nv-shared-pypi-local/)

To download from the GitLab package repository, a Gitlab access token is required, whereas anyone in NVIDIA can download from Artifactory. Downloads from GitLab are only intended for library developers, as they may change without incrementing the version, users should download from Artifactory.

```bash
pip install aidot-training-telemetry --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple
```

or

```bash
pip install aidot-training-telemetry --index-url=https://username:password@urm.nvidia.com/artifactory/api/pypi/sw-aidot-heimdall-pypi-local/simple --extra-index-url https://pypi.org/simple
```

where username and password are the artifactory username and passwords.

To add the URL only once:

```bash
python3 -m pip config set global.index-url https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple
```

and then simply:

```bash
pip install aidot-training-telemetry
```

If you want to download from Gitlab, chage the URL, and add your token at the beginning:

```bash
pip install aidot-training-telemetry --index-url https://__token__:<your_personal_token>@gitlab-master.nvidia.com/api/v4/projects/166461/packages/pypi/simple
```

where you need to replace `<your_personal_token>` with a Personal Access Token that has at least `read_registry` permissions, generated by following 
the instructions at this [page](https://archives.docs.gitlab.com/17.4/ee/user/profile/personal_access_tokens/).


If using Poetry, run the following commands to add the repository URL:

```bash
poetry config repos.nv-shared https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi/simple
```

and then add the dependency to your project:

```bash
poetry add nvidia aidot-training-telemetry
```

More instructions on this [page](https://confluence.nvidia.com/pages/viewpage.action?pageId=161553571).


## Usage

Using the context API, initialize the main function with:

```
def get_application_metrics():
    return ApplicationMetrics.create(
        rank=get_rank(),
        world_size=get_world_size(),
        node_name="localhost",
        timezone=str(get_localzone()),
        total_iterations=num_epochs * len(dataloader),
        checkpoint_enabled=True,
        checkpoint_strategy="sync",
    )


@application_running(metrics=get_application_metrics())
def main():
    [...]
```

This will capture any exceptions not handled by the application, and log them as an error event before re-raising them.

For the training loop and iterations:

```
with training_iteration() as training_iteration_span:
    [...]
    training_iteration_span.add_metrics(
        IterationMetrics.create(
            current_iteration=current_iteration,
            num_iterations=len(dataloader),
            loss=loss.item(),
            accuracy=accuracy.item(),
        )
    )
```

For checkpoint monitoring:

```
with checkpoint_save() as checkpoint_save_span:
    [...]
    checkpoint_save_span.add_metrics(
        CheckpointSaveMetrics.create(
            checkpoint_type=CheckPointType.LOCAL,
            current_iteration=current_iteration,
            checkpoint_directory=temp_dir,
            checkpoint_filename=os.path.basename(checkpoint_file_name),
        )
    )

```

For a concrete example refer to the [torch example](training_telemetry/torch/example.py) or usage examples.

It's also possible to manually create spans and events, refer to the [recorder](training_telemetry/recorder.py) API for how to do this.

## Contributing

### Set up

Install poetry and then install the dependencies with:

```bash
poetry install
poetry install --extras "ai-tools"
```

Run the second install command only if running outside of a container that already has torch and numpy installed, it's mostly for running examples and configuring backends by rank number.

### Run tests

Run the tests with `poetry run pytest`.

### Run mypy

Run the lint with `poetry run mypy training_telemetry\`.

### Run black

Run the black formatter with `poetry run black .`.

### Run isort

Run the isort formatter with `poetry run isort .`.  

Black and isort are also run automatically as a pre-commit hook.

### Run the torch example

Run the torch example with `poetry run python training_telemetry/torch/example.py`.

### Upload to NVIDIA Artifactory

Follow these [instructions](https://confluence.nvidia.com/pages/viewpage.action?pageId=161553571) to get yourself setup with an artifactory account and the permissions to upload.

Then, configure the nv-sahred and the sw-aidot-heimdall repo in poetry:

```bash
poetry config repos.nv-shared https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local
```

```bash
poetry config repos.sw-aidot-heimdall https://urm.nvidia.com/artifactory/api/pypi/sw-aidot-heimdall-pypi-local
```

The first repository is a shared Nvidia repository with anonymous downloads that we use to share the library internally in NVIDIA. The second one is a repository that was created to allow publishing the library externally through [Kitmaker](https://kitmaker.gitlab-master-pages.nvidia.com/kitmaker-docs/users/wheels/release.html). It does not currently support anonymous downloads, so we upload the library to both.

To store your credentials:

```bash
poetry config http-basic.nv-shared <username> <password>
```

```bash
poetry config http-basic.sw-aidot-heimdall <username> <password>
```

The username is the LDAP username (not the email) and the password is either the Artifactory encrypted password or an identity token, as explained in the [instructions](https://confluence.nvidia.com/pages/viewpage.action?pageId=161553571).

Build:

```bash
poetry build --clean
```

and upload with:

```
poetry publish --repository nv-shared 
```

```
poetry publish --repository sw-aidot-heimdall 
```

Once a release has been uploaded, it cannot be changed, or deleted, so be careful. Increment the version number in [pyproject.toml](./pyproject.toml) and in [version.py](./training_telemetry/version.py) after uploading.









