Observability

Matcha plugs into any observability stack you already have. Pick one output or several at once: a JSONL file, a Prometheus /metrics endpoint, or OTLP push to a collector.

JSONL for files and ingestion

Write structured records to a file with --output. Every run emits a session_start record, one step record per step, and a session_end record. Per-GPU breakdowns are included on every step.

A single step record, formatted for readability:


{"type":"step","step":1,"energy_j":2354.0,"duration_s":0.612,
 "avg_power_w":3847,"peak_power_w":4120,
 "train_metrics":{"train_loss":6.9357,"step_avg_ms":612.0},
 "gpus":[{"idx":0,"energy_j":323.5,"avg_power_w":528.6},
         {"idx":1,"energy_j":301.2}]}

Stream straight into ClickHouse, DuckDB, Loki, or anything that reads JSON lines:

Use --json to emit the same records to stdout instead of a file. With matcha wrap, prefer --output so JSONL does not interleave with your training output.

Prometheus (pull)

Matcha serves a /metrics endpoint on the given port for any Prometheus-compatible scraper. Both live GPU gauges and step-level metrics are exposed.


matcha_step_energy_joules{run_id="...",gpu="0"}     2354.0
matcha_step_duration_seconds{run_id="..."}          0.612
matcha_step_peak_power_watts{run_id="..."}          4120
matcha_step_gpu_energy_deviation_ratio{gpu="3"}    -0.18   # straggler
matcha_metric_train_loss{run_id="..."}              6.9357
matcha_metric_step_avg_ms{run_id="..."}             612.0

One-line alert on a stuck rank:

OpenTelemetry / OTLP (push)

Install the OTLP extra and point Matcha at an OTLP/HTTP collector. The metric set matches the Prometheus endpoint, so dashboards port between deployments.

--otlp-header can be repeated for multiple headers. Works with Grafana Cloud, Honeycomb, Datadog, or any OTel collector accepting OTLP/HTTP.

Training metrics (automatic)

In wrap mode, Matcha parses numeric fields from your stdout and surfaces them alongside energy as matcha_metric_*. No configuration, no code changes. Works out of the box for nanoGPT, modded-nanogpt, parameter-golf, DeepSpeed, and HF Trainer.

Stdout pattern	Extracted metric
`train_loss:6.9357`	`matcha_metric_train_loss`
`lr=1.23e-4`	`matcha_metric_lr`
`{'loss': 2.3, 'grad_norm': 0.8}`	`matcha_metric_loss`, `matcha_metric_grad_norm`
`step_avg:612.0ms`	`matcha_metric_step_avg_ms`

Pairing matcha_metric_train_loss with matcha_step_energy_joules in Grafana gives you an efficiency curve: joules per unit of loss reduction.

Labels and run IDs

Attach arbitrary labels to a run for filtering across JSONL, Prometheus, and OTLP. Use --run-id (or the MATCHA_RUN_ID env var) to pin a stable identifier so retries and resumed runs stay grouped.

Labels appear as fields on every JSONL record and as Prometheus / OTLP label dimensions. --label is repeatable.