Observability
Matcha plugs into any observability stack you already have. Pick one output or several at once: a JSONL file, a Prometheus /metrics endpoint, or OTLP push to a collector.
matcha wrap forwards metrics to three sinks in parallel, any or all of which feed into your stack (Grafana, ClickHouse, and similar):
- JSONL file —
--output run.jsonl - Prometheus
/metricsendpoint —--prometheus :9400 - OTLP push —
--otlp https://...
JSONL for files and ingestion
Write structured records to a file with --output. Every run emits a session_start record, one step record per step, and a session_end record. Per-GPU breakdowns are included on every step.
matcha wrap --output run.jsonl \
--label team=capacity --label config=lr_3e-4 \
torchrun --standalone --nproc_per_node=8 train_gpt.pyA single step record, formatted for readability:
{"type":"step","step":1,"energy_j":2354.0,"duration_s":0.612,
"avg_power_w":3847,"peak_power_w":4120,
"train_metrics":{"train_loss":6.9357,"step_avg_ms":612.0},
"gpus":[{"idx":0,"energy_j":323.5,"avg_power_w":528.6},
{"idx":1,"energy_j":301.2}]}Stream straight into ClickHouse, DuckDB, Loki, or anything that reads JSON lines:
cat run.jsonl | clickhouse-client --query "INSERT INTO energy_steps FORMAT JSONEachRow"Use --json to emit the same records to stdout instead of a file. With matcha wrap, prefer --output so JSONL does not interleave with your training output.
Prometheus (pull)
matcha wrap --prometheus :9400 torchrun train.pyMatcha serves a /metrics endpoint on the given port for any Prometheus-compatible scraper. Both live GPU gauges and step-level metrics are exposed.
matcha_step_energy_joules{run_id="...",gpu="0"} 2354.0
matcha_step_duration_seconds{run_id="..."} 0.612
matcha_step_peak_power_watts{run_id="..."} 4120
matcha_step_gpu_energy_deviation_ratio{gpu="3"} -0.18 # straggler
matcha_metric_train_loss{run_id="..."} 6.9357
matcha_metric_step_avg_ms{run_id="..."} 612.0One-line alert on a stuck rank:
- alert: GpuStraggler
expr: matcha_step_gpu_energy_deviation_ratio < -0.15
for: 5mOpenTelemetry / OTLP (push)
Install the OTLP extra and point Matcha at an OTLP/HTTP collector. The metric set matches the Prometheus endpoint, so dashboards port between deployments.
pip install 'usematcha[otlp]'
matcha wrap \
--otlp https://otlp-gateway.grafana.net/otlp \
--otlp-header "Authorization=Basic <token>" \
torchrun train.py--otlp-header can be repeated for multiple headers. Works with Grafana Cloud, Honeycomb, Datadog, or any OTel collector accepting OTLP/HTTP.
Training metrics (automatic)
In wrap mode, Matcha parses numeric fields from your stdout and surfaces them alongside energy as matcha_metric_*. No configuration, no code changes. Works out of the box for nanoGPT, modded-nanogpt, parameter-golf, DeepSpeed, and HF Trainer.
| Stdout pattern | Extracted metric |
|---|---|
train_loss:6.9357 | matcha_metric_train_loss |
lr=1.23e-4 | matcha_metric_lr |
{'loss': 2.3, 'grad_norm': 0.8} | matcha_metric_loss, matcha_metric_grad_norm |
step_avg:612.0ms | matcha_metric_step_avg_ms |
Pairing matcha_metric_train_loss with matcha_step_energy_joules in Grafana gives you an efficiency curve: joules per unit of loss reduction.
Labels and run IDs
Attach arbitrary labels to a run for filtering across JSONL, Prometheus, and OTLP. Use --run-id (or the MATCHA_RUN_ID env var) to pin a stable identifier so retries and resumed runs stay grouped.
matcha wrap --run-id exp-2026-04-18-a \
--label team=capacity --label config=lr_3e-4 \
--prometheus :9400 torchrun train.pyLabels appear as fields on every JSONL record and as Prometheus / OTLP label dimensions. --label is repeatable.