Examples
All examples were run on NVIDIA H100 80GB HBM3 GPUs on RunPod.
Baseline training (17M params, 1xH100)
nanoGPT-style model training on FineWeb with 1024 BPE vocabulary, 10-minute wallclock cap.
Key numbers: 1,312 steps in 10 minutes, 103 Wh total, 538W average draw (77% of 700W TDP).
SOTA model (27M params, 1xH100)
Larger model with SWA, quantization-aware training, GPTQ compression, and sliding window evaluation.
Key numbers: 271 Wh total (2.6x the baseline) due to post-training GPTQ quantization and sliding window evaluation adding 20+ minutes beyond the 10-minute training cap.
Per-step energy (1xH100)
Notice the GPU ramp-up: step 0 draws only 200W because the model is still compiling, step 3 reaches 551W, and by step 200 it sustains 590W near full utilization.