Core metrics
Track request volume, latency percentiles, success rate, retry rate, and webhook delivery lag.
Logging model
Log structured fields at request boundaries:
request_idendpointtransaction_idstatus_coderetry_attempt
Alerting baselines
Create alerts for:
- sustained
429spikes, - elevated
5xxrates, - webhook backlog growth,
- confidence distribution shifts.
On-call readiness
Document escalation path, rollback steps, and runbook links for fast incident handling.