Skip to Content
Get StartedExamples

Examples

All examples were run on NVIDIA H100 80GB HBM3 GPUs on RunPod.

Baseline training (17M params, 1xH100)

nanoGPT-style model training on FineWeb with 1024 BPE vocabulary, 10-minute wallclock cap.

BASELINE · 17M PARAMS · 1xH100 · 10 MIN$matcha run torchrun -​-standalone -​-nproc_per_node=1 train_gpt.pystep:1/20000 train_loss:6.9357 train_time:409msstep:1000/20000 train_loss:2.4040 train_time:454482msstep:1312/20000 val_loss:2.2944 val_bpb:1.3589stopping_early: wallclock_cap step:1312/20000matcha_energy gpus:NVIDIA H100 80GB HBM3 total:370342J(102.87Wh) duration:687.8s avg_power:538W peak_power:701Wsteps1,312energy103 Whavg power538W (77% TDP)cost$0.01 energy · $0.48 compute

Key numbers: 1,312 steps in 10 minutes, 103 Wh total, 538W average draw (77% of 700W TDP).

SOTA model (27M params, 1xH100)

Larger model with SWA, quantization-aware training, GPTQ compression, and sliding window evaluation.

SOTA · 27M PARAMS · 1xH100 · 30 MIN (TRAINING + GPTQ + EVAL)$matcha run torchrun -​-standalone -​-nproc_per_node=1 train_gpt_sota.pystep:909/20000 val_loss:2.3253 val_bpb:1.3772stopping_early: wallclock_cap step:909/20000gptq:generated 64 sequences in 204.3sfinal_int6_sliding_window val_bpb:1.8762matcha_energy gpus:NVIDIA H100 80GB HBM3 total:976085J(271.13Wh) duration:1825.5s avg_power:535W peak_power:704Wsteps909energy271 Whavg power535W (76% TDP)vs baseline2.6× energy (GPTQ + sliding eval)

Key numbers: 271 Wh total (2.6x the baseline) due to post-training GPTQ quantization and sliding window evaluation adding 20+ minutes beyond the 10-minute training cap.

Per-step energy (1xH100)

PER-STEP ENERGY · GPU RAMP-UP VISIBLE$matcha wrap torchrun -​-standalone -​-nproc_per_node=1 train_gpt.pystep:0/20000 … energy:60.3J/step← compiling, 200Wstep:1/20000 … energy:106.7J/step← warming up, 354Wstep:2/20000 … energy:154.0J/step← ramping, 508Wstep:3/20000 … energy:221.8J/step← near full, 551Wstep:200/20000 … energy:368.1J/step← sustained, 590Wmatcha_energy gpus:NVIDIA H100 80GB HBM3 total:97271J(27.02Wh) duration:202.9s avg_power:479W peak_power:701WFirst 3 steps draw 50-75% less power than steady state - torch.compile overhead visible in energy profile.

Notice the GPU ramp-up: step 0 draws only 200W because the model is still compiling, step 3 reaches 551W, and by step 200 it sustains 590W near full utilization.