Skip to Content
Get StartedObservability

Observability

Matcha plugs into any observability stack you already have. Pick one output or several at once: a JSONL file, a Prometheus /metrics endpoint, or OTLP push to a collector.

matcha wrapyour training commandJSONL file—output run.jsonlPrometheus /metrics—prometheus :9400OTLP push—otlp https://…your stackGrafana, ClickHouse, …

matcha wrap forwards metrics to three sinks in parallel, any or all of which feed into your stack (Grafana, ClickHouse, and similar):

  • JSONL file — --output run.jsonl
  • Prometheus /metrics endpoint — --prometheus :9400
  • OTLP push — --otlp https://...

JSONL for files and ingestion

Write structured records to a file with --output. Every run emits a session_start record, one step record per step, and a session_end record. Per-GPU breakdowns are included on every step.

JSONL OUTPUT$matcha wrap -​-output run.jsonl -​-label team=capacity -​-label config=lr_3e-4 torchrun -​-standalone -​-nproc_per_node=8 train_gpt.py
matcha wrap --output run.jsonl \ --label team=capacity --label config=lr_3e-4 \ torchrun --standalone --nproc_per_node=8 train_gpt.py

A single step record, formatted for readability:

STEP RECORD{"type":"step","step":1,"energy_j":2354.0,"duration_s":0.612, "avg_power_w":3847,"peak_power_w":4120, "train_metrics":{"train_loss":6.9357,"step_avg_ms":612.0}, "gpus":[{"idx":0,"energy_j":323.5,"avg_power_w":528.6,...}, {"idx":1,"energy_j":301.2,...}, ... ]}
{"type":"step","step":1,"energy_j":2354.0,"duration_s":0.612, "avg_power_w":3847,"peak_power_w":4120, "train_metrics":{"train_loss":6.9357,"step_avg_ms":612.0}, "gpus":[{"idx":0,"energy_j":323.5,"avg_power_w":528.6}, {"idx":1,"energy_j":301.2}]}

Stream straight into ClickHouse, DuckDB, Loki, or anything that reads JSON lines:

INGEST$cat run.jsonl | clickhouse-client -​-query “INSERT INTO energy_steps FORMAT JSONEachRow”
cat run.jsonl | clickhouse-client --query "INSERT INTO energy_steps FORMAT JSONEachRow"

Use --json to emit the same records to stdout instead of a file. With matcha wrap, prefer --output so JSONL does not interleave with your training output.

Prometheus (pull)

PROMETHEUS$matcha wrap -​-prometheus :9400 torchrun train.py
matcha wrap --prometheus :9400 torchrun train.py

Matcha serves a /metrics endpoint on the given port for any Prometheus-compatible scraper. Both live GPU gauges and step-level metrics are exposed.

/METRICSmatcha_step_energy_joules{run_id="...",gpu="0"}2354.0matcha_step_duration_seconds{run_id="..."}0.612matcha_step_peak_power_watts{run_id="..."}4120matcha_step_gpu_energy_deviation_ratio{gpu="3"}-0.18# stragglermatcha_metric_train_loss{run_id="..."}6.9357matcha_metric_step_avg_ms{run_id="..."}612.0
matcha_step_energy_joules{run_id="...",gpu="0"} 2354.0 matcha_step_duration_seconds{run_id="..."} 0.612 matcha_step_peak_power_watts{run_id="..."} 4120 matcha_step_gpu_energy_deviation_ratio{gpu="3"} -0.18 # straggler matcha_metric_train_loss{run_id="..."} 6.9357 matcha_metric_step_avg_ms{run_id="..."} 612.0

One-line alert on a stuck rank:

ALERT RULE- alert: GpuStraggler expr: matcha_step_gpu_energy_deviation_ratio < -0.15 for: 5m
- alert: GpuStraggler expr: matcha_step_gpu_energy_deviation_ratio < -0.15 for: 5m

OpenTelemetry / OTLP (push)

Install the OTLP extra and point Matcha at an OTLP/HTTP collector. The metric set matches the Prometheus endpoint, so dashboards port between deployments.

INSTALL$pip install ‘usematcha[otlp]‘PUSH$matcha wrap -​-otlp https://otlp-gateway.grafana.net/otlp  -​-otlp-header “Authorization=Basic <token>” torchrun train.py
pip install 'usematcha[otlp]' matcha wrap \ --otlp https://otlp-gateway.grafana.net/otlp \ --otlp-header "Authorization=Basic <token>" \ torchrun train.py

--otlp-header can be repeated for multiple headers. Works with Grafana Cloud, Honeycomb, Datadog, or any OTel collector accepting OTLP/HTTP.

Training metrics (automatic)

In wrap mode, Matcha parses numeric fields from your stdout and surfaces them alongside energy as matcha_metric_*. No configuration, no code changes. Works out of the box for nanoGPT, modded-nanogpt, parameter-golf, DeepSpeed, and HF Trainer.

Stdout patternExtracted metrictrain_loss:6.9357matcha_metric_train_losslr=1.23e-4matcha_metric_lr{'loss': 2.3, 'grad_norm': 0.8}matcha_metric_loss, …_grad_normstep_avg:612.0msmatcha_metric_step_avg_ms
Stdout patternExtracted metric
train_loss:6.9357matcha_metric_train_loss
lr=1.23e-4matcha_metric_lr
{'loss': 2.3, 'grad_norm': 0.8}matcha_metric_loss, matcha_metric_grad_norm
step_avg:612.0msmatcha_metric_step_avg_ms

Pairing matcha_metric_train_loss with matcha_step_energy_joules in Grafana gives you an efficiency curve: joules per unit of loss reduction.

Labels and run IDs

Attach arbitrary labels to a run for filtering across JSONL, Prometheus, and OTLP. Use --run-id (or the MATCHA_RUN_ID env var) to pin a stable identifier so retries and resumed runs stay grouped.

LABELS AND RUN ID$matcha wrap -​-run-id exp-2026-04-18-a -​-label team=capacity -​-label config=lr_3e-4 -​-prometheus :9400 torchrun train.py
matcha wrap --run-id exp-2026-04-18-a \ --label team=capacity --label config=lr_3e-4 \ --prometheus :9400 torchrun train.py

Labels appear as fields on every JSONL record and as Prometheus / OTLP label dimensions. --label is repeatable.