Per-Step Energy
Using matcha wrap
$ matcha wrap torchrun --standalone --nproc_per_node=1 train_gpt.py
warmup_step:20/20
step:1/20000 train_loss:6.9357 train_time:438ms energy:106.7J/step
step:2/20000 train_loss:16.7414 train_time:833ms energy:154.0J/step
step:3/20000 train_loss:8.7524 train_time:1258ms energy:221.8J/step
...
matcha_energy gpus:NVIDIA H100 80GB HBM3 total:97271J (27.02Wh) duration:202.9s avg_power:479W peak_power:701WNon-step lines like config, warmup, and validation pass through unchanged.
At the end, the same summary line as matcha run is printed:
matcha_energy gpus:NVIDIA H100 80GB HBM3 total:97271J (27.02Wh) duration:202.9s avg_power:479W peak_power:701W samples:2025How step detection works
Matcha scans each line of stdout for patterns that indicate a training step. The following patterns are recognized:
| Pattern | Example |
|---|---|
step N | step 100 |
step:N | step:100/20000 |
iter N | iter 100 |
iteration N | iteration 100 |
[N/M] | [100/20000] |
Lines containing warmup are always skipped.
When Matcha sees step N, it ends the measurement for the previous step and starts a new one. Energy is computed from the NVML readings collected between the two step markers.
Step gaps
If your training script logs every 200 steps, for example step 10 and then step 200, Matcha measures the total energy across that window and divides by the number of steps:
step:10/20000 ... energy:242.1J/step avg_power:535W peak_power:606W
step:200/20000 ... energy:356.6J/step avg_power:581W peak_power:700WThe energy field always shows a per-step average, whether the gap is 1 step or 500.
Reading per-step output
Each step line gets three fields appended:
| Field | Meaning |
|---|---|
energy | Energy per step in joules, averaged over the step gap |
avg_power | Mean GPU power during this window |
peak_power | Peak GPU power during this window |
When to use wrap vs run
Use matcha run for benchmark and production runs. It has zero overhead.
Use matcha wrap for diagnosis, finding energy spikes, identifying inefficient phases, and comparing step-level behavior between configurations. It intercepts stdout so there is minor I/O overhead, though in our benchmarks this was within run-to-run variance.
matcha run(recommended) — total energy at the end, zero overhead, no stdout piping. Use for benchmarks and production.matcha wrap(diagnostic) — energy per training step, parses stdout for step markers. Use for finding inefficiencies.