Skip to Content
Get StartedPer-Step Energy

Per-Step Energy

Using matcha wrap

MATCHA WRAP$matcha wrap torchrun -​-standalone -​-nproc_per_node=1 train_gpt.pywarmup_step:20/20step:1/20000 train_loss:6.9357 train_time:438msenergy:106.7J/stepstep:2/20000 train_loss:16.7414 train_time:833msenergy:154.0J/stepstep:3/20000 train_loss:8.7524 train_time:1258msenergy:221.8J/stepmatcha_energy gpus:NVIDIA H100 80GB HBM3 total:97271J(27.02Wh) duration:202.9s avg_power:479W peak_power:701W

Non-step lines like config, warmup, and validation pass through unchanged.

At the end, the same summary line as matcha run is printed:

SUMMARYmatcha_energy gpus:NVIDIA H100 80GB HBM3 total:97271J (27.02Wh) duration:202.9s avg_power:479W peak_power:701W samples:2025

How step detection works

Matcha scans each line of stdout for patterns that indicate a training step. The following patterns are recognized:

PatternExamplestep Nstep 100step:Nstep:100/20000iter Niter 100iteration Niteration 100[N/M][100/20000]

Lines containing warmup are always skipped.

When Matcha sees step N, it ends the measurement for the previous step and starts a new one. Energy is computed from the NVML readings collected between the two step markers.

Step gaps

If your training script logs every 200 steps, for example step 10 and then step 200, Matcha measures the total energy across that window and divides by the number of steps:

STEP GAPSstep:10/20000 … energy:242.1J/step avg_power:535W peak_power:606Wstep:200/20000 … energy:356.6J/step avg_power:581W peak_power:700W

The energy field always shows a per-step average, whether the gap is 1 step or 500.

Reading per-step output

Each step line gets three fields appended:

FieldMeaning
energyEnergy per step in joules, averaged over the step gap
avg_powerMean GPU power during this window
peak_powerPeak GPU power during this window

When to use wrap vs run

Use matcha run for benchmark and production runs. It has zero overhead.

Use matcha wrap for diagnosis, finding energy spikes, identifying inefficient phases, and comparing step-level behavior between configurations. It intercepts stdout so there is minor I/O overhead, though in our benchmarks this was within run-to-run variance.

RECOMMENDEDmatcha runTotal energy at the endZero overhead · No stdout pipingUse for benchmarks and productionDIAGNOSTICmatcha wrapEnergy per training stepParses stdout for step markersUse for finding inefficiencies