Quick Start

Run your training with Matcha

Prefix your training command with matcha run:


$ matcha run torchrun --standalone --nproc_per_node=1 train_gpt.py
warmup_step:20/20
step:1/20000 train_loss:6.9357 train_time:409ms
step:1312/20000 val_loss:2.2944 train_time:600036ms
stopping_early: wallclock_cap step:1312/20000
 
matcha_energy gpus:NVIDIA H100 80GB HBM3 total:370342J (102.87Wh) duration:687.8s avg_power:538W peak_power:701W

Reading the output

Field	Meaning
`gpus`	GPU model and count (auto-detected)
`total`	Total energy consumed — joules and watt-hours
`duration`	Wall-clock time from start to finish
`avg_power`	Mean GPU power draw across the run
`peak_power`	Highest instantaneous power reading
`samples`	Number of NVML readings (at 100ms intervals)

What counts as duration

Duration covers everything from when Matcha starts to when your training process exits. This includes model compilation, data loading, warmup steps, training, validation, checkpointing, and serialization. It is wall-clock time, not just training time.

Zero overhead

Matcha does not pipe or intercept your training output. Your process writes directly to the terminal. The only work Matcha does is polling NVML in a background thread, which has no measurable impact on training performance.