Per-Step Energy

Using matcha wrap


$ matcha wrap torchrun --standalone --nproc_per_node=1 train_gpt.py
warmup_step:20/20
step:1/20000 train_loss:6.9357 train_time:438ms energy:106.7J/step
step:2/20000 train_loss:16.7414 train_time:833ms energy:154.0J/step
step:3/20000 train_loss:8.7524 train_time:1258ms energy:221.8J/step
...
 
matcha_energy gpus:NVIDIA H100 80GB HBM3 total:97271J (27.02Wh) duration:202.9s avg_power:479W peak_power:701W

Non-step lines like config, warmup, and validation pass through unchanged.

At the end, the same summary line as matcha run is printed:

How step detection works

Matcha scans each line of stdout for patterns that indicate a training step. The following patterns are recognized:

Lines containing warmup are always skipped.

When Matcha sees step N, it ends the measurement for the previous step and starts a new one. Energy is computed from the NVML readings collected between the two step markers.

Step gaps

If your training script logs every 200 steps, for example step 10 and then step 200, Matcha measures the total energy across that window and divides by the number of steps:

The energy field always shows a per-step average, whether the gap is 1 step or 500.

Reading per-step output

Each step line gets three fields appended:

Field	Meaning
`energy`	Energy per step in joules, averaged over the step gap
`avg_power`	Mean GPU power during this window
`peak_power`	Peak GPU power during this window

When to use wrap vs run

Use matcha run for benchmark and production runs. It has zero overhead.

Use matcha wrap for diagnosis, finding energy spikes, identifying inefficient phases, and comparing step-level behavior between configurations. It intercepts stdout so there is minor I/O overhead, though in our benchmarks this was within run-to-run variance.

Pattern	Example
`step N`	`step 100`
`step:N`	`step:100/20000`
`iter N`	`iter 100`
`iteration N`	`iteration 100`
`[N/M]`	`[100/20000]`