For SREs — Failure Modes, SLOs, Synthetic Checks

Reliability is a feature. This guide helps SREs design defensible SLOs, enumerate failure modes, wire synthetic checks, and run incidents with evidence and fast recovery.

Anchor on user-journey SLOs (checkout, login, search), backed by component SLIs.
Adopt brownout first: degrade non-essential features before failing.
Timeouts < retries < client deadlines, with backoff + jitter and circuit breakers.
Keep error budgets visible; throttle launches when budget is burning too fast.
Synthetic journeys in each region + deep checks for critical dependencies.
Drill regional failover, data restore (RTO/RPO), and runbook exercises quarterly.
Capture artifacts: dashboards, logs, traces, packet captures when relevant.

SLI/SLO patterns

User-journey SLIs (preferred)

checkout_success_rate = successful_checkouts / checkout_attempts
otp_verify_latency_p95_ms at /otp/verify
message_delivery_rate = delivered / accepted (rolling 1h)
search_time_to_first_result_p95_ms

Component SLIs (supporting)

HTTP availability = 1 - 5xx_rate
Queue age_p95
DB error_rate, latency_p95
Cache hit_ratio

SLOs (examples)

Checkout success ≥ 99.5% (monthly).
OTP verify p95 ≤ 300 ms; p99 ≤ 800 ms (weekly).
API availability ≥ 99.9% (monthly).

Error budget policy

Burn rate alerts at 14× (fast) and 6× (medium) over 1h/6h windows.
Freeze risky launches when weekly budget < 25% remaining.

Failure modes (catalog)

Mode	Detection	Mitigation
Dependency brownout (slow 5xx/429)	latency p95↑, queue age↑	breaker open, fallback cache, brownout UI
Regional outage	synthetic fail in region	failover DNS/traffic, read-only mode
Thundering herd	spikes on retries	jitter, max attempts, token bucket
Hot shard	p99 DB table/index	rebalance, shard split, cache
Noisy neighbor	CPU steal, IO wait	quotas, limiters, isolate pool
Memory leak	restart count, RSS↑	restart window, heap dump analysis
Clock skew	auth/signature fails	time sync alerts, NTP fallback
Config blast radius	error surge post change	staged rollout, canary, fast revert
Webhook storm/replay	dup event ids	inbox dedupe, rate limits
DNS/PKI failure	TLS errors, NXDOMAIN	stale DNS cache, secondary CA

Keep tables short; details live in runbooks.

Resiliency design (practical defaults)

Backoff + jitter for 429/5xx/timeouts; cap total deadline per call.
Circuit breakers on high failure rate or long latency; half-open probes.
Bulkheads: separate pools for critical vs background traffic.
Load shedding: return 503 for non-critical when saturated; protect core routes.
Brownout toggles: disable avatars, recommendations, heavy queries under stress.
Idempotency on unsafe POSTs to avoid double effects on retries.

See: ../50-non-functional/resiliency-and-timeouts.md, ../40-api-and-data-contracts/idempotency-and-retries.md.

Synthetic monitoring

Journeys (per region):

Home → Login OTP → Search → Add to cart → Checkout (stub payment)
Health endpoints are not enough—exercise real dependencies (DB, cache, queue, PSP sandbox).

Deep checks:

DB: SELECT 1 + lightweight app query.
Queue: publish → consume loopback.
Webhooks: signed callback to a canary endpoint.
DNS/TLS: expiry, chain, OCSP stapling.

Design rules

Use tenant/test accounts with idempotent operations.
Tag metrics with synthetic=true, region, journey.
Store evidence: rendered screenshots, HARs, response bodies (redacted).

Capacity & scaling

Headroom: plan for 2× normal daily peak; 4× burst for flash sales.
Autoscaling on concurrency or queue depth, not CPU alone.
Control cold starts; keep warm pools for latency-sensitive paths.
Do load tests with production-like data, then set SLO budgets accordingly.

Disaster recovery (RTO/RPO)

RTO (restore time) and RPO (data loss window) per system.
Test point-in-time restore for DB; verify binlog replay window.
Cross-region replication lag monitored; failover drills with read-only and promotion runbooks.
Backups encrypted, tested, labeled with checksums.

Incident response (lite playbook)

Declare early; page roles: Incident Commander, Ops, Comms, Scribe.
Stabilize with brownouts, breakers, and rate limits.
Diagnose using dashboards + logs + traces; capture timeline.
Mitigate (rollback, failover, config fix).
Communicate status every 15–30 min to stakeholders.
Close: record impact, MTTR, root cause(s), learn (avoid blame).
Follow-up: action items with owners and dates; link PRs and dashboards.

Observability you must have

RED metrics per route; USE metrics per resource.
Traces with exemplars tied to latency histograms.
Structured logs with correlation_id, msgid, err.code.
Black-box synthetics + white-box component probes.
Budget dashboards: burn rates, availability, latency p95/p99, backlog sizes.
Alert routing: severity, ownership, runbook links; paging rules per tier.

See: ./for-developers.md for log/metrics/trace contracts.

Guardrails & change management

Feature flags with kill switches and safe defaults.
Progressive delivery: canary %, region-by-region, automatic rollback on guardrail breach.
Config schema with validation; review + diff for changes.
Secret rotation: planned jobs with alarms on failure.

MAE Scenarios (copy/adapt)

SRE-001 Main — Checkout synthetic passes

Steps: run hourly journey in all regions
Expected: p95 ≤ 2s; success ≥ 99.5%
Oracles: synthetic dashboard; HAR + screenshots stored

SRE-002 Alt — Dependency brownout (payments)

Steps: PSP sandbox returns 5xx + slow
Expected: breaker opens; brownout to “pay later”; error rate < budget
Oracles: breaker metrics; route-level error rate; user-journey SLI still ok

SRE-101 Exception — Regional failure drill

Steps: block egress in region ap-sg
Expected: failover to ap-jp in ≤ 5 min; RPO ≤ 60s
Oracles: traffic split, replication lag, error budget burn

SRE-102 Exception — Config misdeploy

Steps: push bad DB pool size
Expected: canary detects; auto-rollback; alert with runbook link
Oracles: deploy logs; canary diff; rollback event

SRE-201 Cross-feature — Quiet hours vs paging

Expected: customer-facing alerts respect quiet hours; on-call ignores quiet hours
Oracles: alert routes; escalation policy test

Review checklist (quick gate)

User-journey SLIs/SLOs defined; component SLIs mapped
Error budget policy with multi-window burn-rate alerts
Timeouts/retries/breakers consistent across services
Brownout controls and load shedding implemented
Synthetic checks for journeys and deep dependencies
Capacity model + load-test results; autoscaling rules
Backups verified; RTO/RPO documented and tested
Failover drill records; runbooks current and linked
Observability dashboards + paging rules + runbooks wired

CSV seeds

SLO registry

slo_id,description,objective,window,owner
checkout_success,Checkout success rate,>=99.5%,30d,Payments SRE
api_availability,HTTP 5xx inverted,>=99.9%,30d,Platform SRE
otp_verify_latency_p95,p95 ms for /otp/verify,<=300ms,7d,Identity SRE

Burn-rate alerts (multi-window)

slo_id,window,threshold_x
checkout_success,5m,14
checkout_success,1h,6
api_availability,30m,14
api_availability,6h,6

Synthetic registry

journey,region,frequency,deadline_ms
checkout_full,ap-sg,5m,3000
login_otp,us-east,5m,1500
search_basic,eu-west,5m,2000

Failover drill log

date,region,duration_min,success,notes
2025-09-10,ap-sg,18,true,"rds promotion 3m; cache warm 5m"

Queries & rules (snippets)

PromQL — burn rate

# Fast burn (5m) for availability SLO
(1 - sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m]))) / (1 - 0.999)

PromQL — latency p95

histogram_quantile(0.95, sum(rate(request_duration_ms_bucket{route="/otp/verify"}[5m])) by (le, region))

SLO spec (YAML)

slo:
  id: checkout_success
  objective: ">=99.5%"
  window: 30d
sli:
  numerator: metric: checkout_success_count{region="$region"}
  denominator: metric: checkout_attempt_count{region="$region"}
alerts:
  - burn_rate: 14x
    windows: [5m, 30m]
  - burn_rate: 6x
    windows: [1h, 6h]
runbook: "link://runbooks/checkout"

Templates

Runbook skeleton

Service: <name>
Owner: <team>  Pager: <rotation>
Dashboards: <links>
SLOs: <list>
Dependencies: <list + health links>
Guardrails: <flags, breakers, rate limits>
Symptoms: <common alerts + meanings>
Checks: <commands/queries to run>
Mitigations: <brownouts, rollbacks, failovers>
Comms: <channels + cadence + templates>
Post-incident: <data to collect + forms>

Chaos experiment

Name: <dep> brownout
Hypothesis: breaker opens within 60s and error budget burn < 1%
Method: inject latency + 5xx for 5m
Metrics: route error rate, p95, queue age
Abort: p95 > target by 2× for 2m

For Sres