TraceML
Find slow PyTorch training bottlenecks: DataLoader stalls, low GPU utilization, DDP/FSDP rank stragglers, memory creep, and run regressions.
Quickstart • Compare Runs • Read Output • Use With Your Stack • FAQ
Training bottleneck guides: Slow PyTorch Training • DataLoader Bottlenecks • Low GPU Utilization • DDP Rank Stragglers • Memory Creep
TraceML gives every PyTorch training run a structured performance fingerprint with low overhead (<2% in our current benchmark runs). It answers the questions that usually come before heavyweight operator-level profiling:
- Are my GPUs waiting on a slow dataloader (input-bound)?
- Is one distributed rank consistently slower than the others (straggler)?
- Is memory usage silently creeping upward during the run (memory creep)?
- Did a recent code or infrastructure change slow training down (regression)?
Where TraceML Fits in the Stack
TraceML does not replace torch.profiler. It is the low-overhead, always-on
first pass that tells you where to aim heavier profiling tools.
| Tool | Best used for | Output | Cost / overhead |
|---|---|---|---|
| TraceML | Classifying high-level bottlenecks: input, compute, wait, memory, rank skew | JSON fingerprint, text summary, live views | <2% in current benchmark runs; small code wrapper |
torch.profiler |
Inspecting expensive ops, kernels, and CUDA activity | Profiler trace | Higher overhead; requires profiler context |
| Nsight Systems | Debugging low-level CUDA and kernel behavior | GPU timeline | Separate profiler run |
| W&B / MLflow | Tracking training metrics and experiment history | Metrics dashboard / run history | Logging integration |
nvidia-smi |
Checking machine-level GPU health and utilization | Terminal metrics | No code changes |
3-Minute Quickstart
1. Install the package
pip install traceml-ai
2. Wrap your training step
import traceml_ai as traceml
traceml.init(mode="auto")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
3. Run your script
traceml run train.py
For DDP, FSDP, and multi-node runs, see Distributed Training.
What You Get: The Output
TraceML writes two end-of-run artifacts:
logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt
You can re-print a saved summary later without rerunning training:
traceml view logs/<run_name>/final_summary.json
Instead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went:
+----------------------------------------------------------------------------+
| Step Time |
| - Diagnosis: INPUT STRAGGLER |
| - Scope: compared over last 460 aligned steps across 4 global ranks |
| - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms |
| - Why: r0 input was slower than median global rank (254.5/3.8ms). |
+----------------------------------------------------------------------------+
In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.
For experiment trackers, call traceml.summary() near the end of your script
to get a flat dict of diagnosis statuses and average metrics. Keep
final_summary.json when you want the full run artifact or an input for
traceml compare.
Catching Regressions (Compare Mode)
Compare a slow run against a known good baseline to identify which metrics changed:
traceml compare input_slow/final_summary.json input_fixed/final_summary.json
+--------------------------------------------------------------------------------------+
| TraceML Compare |
+--------------------------------------------------------------------------------------+
| Verdict: IMPROVEMENT |
| Why: Step time decreased by 95.6%. |
| |
| Metric A B Delta |
| Total step 294.0 ms 13.0 ms -280.9 ms (-95.6%) |
| Input 66.4 ms 2.7 ms -63.7 ms (-95.9%) |
+--------------------------------------------------------------------------------------+
See Compare Runs for the full report format.
Display Modes
TraceML controls what you see during training with the --mode flag, without
changing the final saved artifacts.
| Mode flag | Experience during training | Supported topology |
|---|---|---|
--mode=summary (default) |
Silent execution | Single-node and multi-node multi-GPU |
--mode=cli |
Live terminal display | Single-node, including multi-GPU |
--mode=dashboard |
Live browser display | Single-node; requires pip install "traceml-ai[dashboard]" |
Current support
Works today:
- Single GPU training
- Single-node multi-GPU DDP / FSDP
- Multi-node DDP summary reports
- Multi-node runs on Slurm (sbatch template + guide)
- Run-to-run comparison from
final_summary.json - Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train
On the roadmap:
- Multi-node live CLI / browser dashboard
- Explicit collective / NCCL timing
Overhead
Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.
Learn More
- Quickstart
- Find why PyTorch training is slow
- Find DataLoader Bottlenecks
- Debug Low GPU Utilization
- Debug DDP Rank Stragglers
- Find PyTorch Memory Creep
- Distributed Training
- Running on Slurm
- Use With Your Stack
- Compare Runs
- How to Read Output
- FAQ
Feedback
For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.
GitHub issues: open an issue
If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.
Security reports: see SECURITY.md
Email: [email protected]
Contributing
Contributions are welcome, especially:
- real slowdown examples and repros
- distributed training edge cases
- docs improvements
- framework integrations
See CONTRIBUTING.md for development setup and contribution guidelines.
License
Apache 2.0. See LICENSE.
TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).