
Lead with traces, not dashboards
We instrument user journeys end to end, then project metrics off of traces. Dashboards stay slim; exploratory debugging happens in traces with guardrails on cardinality.
import { trace, context } from "@opentelemetry/api";
const tracer = trace.getTracer("checkout");
export async function charge(userId: string, payload: ChargeInput) {
return tracer.startActiveSpan("charge", async (span) => {
span.setAttribute("user.id", userId);
span.setAttribute("cart.items", payload.items.length);
try {
const result = await paymentClient.charge(payload);
span.setStatus({ code: 1, message: "ok" });
return result;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: 2, message: "failed" });
throw err;
} finally {
span.end();
}
});
}Rules we keep
- Every alert links to a runbook and an owning team
- Dashboards cap at 12 charts—anything else is a trace query
- Sampling tuned per route with business impact in mind
Adopt sampling that tracks revenue or risk—not just traffic—so the right customers stay in view during incidents.
Dashboards that stay lean
- Ship a single service health score: availability, latency, error rate
- One panel per user journey; everything else is a saved trace query
- Alert on burn rate and user impact, not raw error counts
Adopt sampling that tracks revenue or risk—not just traffic—so the right customers stay in view during incidents.
Key takeaways
- Traces-first
- Cardinality budgets
- Runbooks linked to alerts


