About traceml

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

p

Published by

README.md

TraceML

Find slow PyTorch training bottlenecks: DataLoader stalls, low GPU utilization, DDP/FSDP rank stragglers, memory creep, and run regressions.

Quickstart • Compare Runs • Read Output • Use With Your Stack • FAQ

Training bottleneck guides: Slow PyTorch Training • DataLoader Bottlenecks • Low GPU Utilization • DDP Rank Stragglers • Memory Creep

TraceML gives every PyTorch training run a structured performance fingerprint with low overhead (<2% in our current benchmark runs). It answers the questions that usually come before heavyweight operator-level profiling:

Are my GPUs waiting on a slow dataloader (input-bound)?
Is one distributed rank consistently slower than the others (straggler)?
Is memory usage silently creeping upward during the run (memory creep)?
Did a recent code or infrastructure change slow training down (regression)?

Where TraceML Fits in the Stack

TraceML does not replace torch.profiler. It is the low-overhead, always-on first pass that tells you where to aim heavier profiling tools.

Tool	Best used for	Output	Cost / overhead
TraceML	Classifying high-level bottlenecks: input, compute, wait, memory, rank skew	JSON fingerprint, text summary, live views	<2% in current benchmark runs; small code wrapper
`torch.profiler`	Inspecting expensive ops, kernels, and CUDA activity	Profiler trace	Higher overhead; requires profiler context
Nsight Systems	Debugging low-level CUDA and kernel behavior	GPU timeline	Separate profiler run
W&B / MLflow	Tracking training metrics and experiment history	Metrics dashboard / run history	Logging integration
`nvidia-smi`	Checking machine-level GPU health and utilization	Terminal metrics	No code changes

3-Minute Quickstart

1. Install the package

pip install traceml-ai

2. Wrap your training step

import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

3. Run your script

traceml run train.py

For DDP, FSDP, and multi-node runs, see Distributed Training.

What You Get: The Output

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

You can re-print a saved summary later without rerunning training:

traceml view logs/<run_name>/final_summary.json

Instead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went:

+----------------------------------------------------------------------------+
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms    |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
+----------------------------------------------------------------------------+

In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.

For experiment trackers, call traceml.summary() near the end of your script to get a flat dict of diagnosis statuses and average metrics. Keep final_summary.json when you want the full run artifact or an input for traceml compare.

Catching Regressions (Compare Mode)

Compare a slow run against a known good baseline to identify which metrics changed:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Metric                         A                B                Delta              |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
+--------------------------------------------------------------------------------------+

See Compare Runs for the full report format.

Display Modes

TraceML controls what you see during training with the --mode flag, without changing the final saved artifacts.

Mode flag	Experience during training	Supported topology
`--mode=summary` (default)	Silent execution	Single-node and multi-node multi-GPU
`--mode=cli`	Live terminal display	Single-node, including multi-GPU
`--mode=dashboard`	Live browser display	Single-node; requires `pip install "traceml-ai[dashboard]"`

Current support

Works today:

Single GPU training
Single-node multi-GPU DDP / FSDP
Multi-node DDP summary reports
Multi-node runs on Slurm (sbatch template + guide)
Run-to-run comparison from final_summary.json
Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train

On the roadmap:

Multi-node live CLI / browser dashboard
Explicit collective / NCCL timing

Overhead

Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.

Learn More

Feedback

For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.

GitHub issues: open an issue

If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.

Security reports: see SECURITY.md

Email: [email protected]

Contributing

Contributions are welcome, especially:

real slowdown examples and repros
distributed training edge cases
docs improvements
framework integrations

See CONTRIBUTING.md for development setup and contribution guidelines.

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

traceml