Performance (p95/p99)

Users feel the tail, not the average. This playbook shows how to design, run, and judge performance work using p95/p99 latency, throughput, and error budgets—without boiling the ocean.

Define SLIs first (what you measure), then SLOs (targets).
Use open‑loop load (constant arrival rate) for tail latency; closed‑loop for capacity checks.
Track p50/p95/p99, error rate, and throughput per endpoint/feature.
Publish budgets: p95 ≤ 300 ms, p99 ≤ 800 ms for interactive APIs; adjust for your domain.
Always log correlation_id, attempt, error_code, and retry data; separate 2xx and non‑2xx percentiles.
Prove fixes with A/B before/after runs and time‑series percentiles (not just single aggregates).

1) Concepts (SLI/SLO/SLA & percentiles)

SLI (Service Level Indicator): a measurement, e.g., “Checkout API end‑to‑end latency.”
SLO (Objective): a target over a window, e.g., “p95 ≤ 300 ms, p99 ≤ 800 ms over 28 days.”
SLA (Agreement): business contract; usually a subset of SLOs with penalties—don’t test SLAs directly.

Percentiles

p95: 95% of requests are at or below this latency. p99 exposes long tails (locks, GC, cold caches).
Prefer HDR histograms / base‑2 buckets; avoid averaging percentiles (math doesn’t support it).
Report percentiles per status class (2xx vs 4xx/5xx) and per attempt (first vs retries).

2) What to measure (SLIs)

Layer	SLI	Why it matters
User	End‑to‑end interaction time	Actual experience; budgets per page/workflow
API	Request latency (TTFB & total)	Contracts for services, retriable vs terminal
Job	Time‑to‑completion, queue wait	Backlogs, SLOs for async work
Infra	CPU, memory, GC, disk IOPS, network	Root‑cause tails; saturation
Dependency	Downstream call latency/error	Hot spots; budget splits

Augment with Rate‑Errors‑Duration (RED) and Utilization‑Saturation‑Errors (USE).

3) Budgets (set them explicitly)

Interactive APIs (CRUD/eligibility/verify)

p50 ≤ 100 ms, p95 ≤ 300 ms, p99 ≤ 800 ms, error rate ≤ 0.1% (2xx success), over 7–28 days.

Background jobs (per item)

p95 ≤ 2 s, p99 ≤ 5 s; queue wait p95 ≤ 500 ms.

Web UI (ballpark; adapt to product)

Action → paint ≤ 200 ms feels instant; ≤ 1 s keeps flow; beyond that, show progress.

Budget splits (example, server‑side 300 ms p95)

Auth + routing: 20 ms
Business compute: 120 ms
Dependencies (sum): 120 ms (no single dep > 80 ms)
Serialization + network: 40 ms

Keep p99 ≤ 2.5 × p95 as a healthy rule; bigger gaps signal tail issues.

4) Workload modeling (make it realistic)

Traffic mix: % per endpoint/use case; include hot paths.
Data mix: payload sizes (p10/p50/p90), item counts, i18n strings, real cardinalities.
Think‑time (UI), diurnal cycles, spikes (deploys, campaigns).
Cache behavior: warm/cold ratios; TTL expirations; skew (hot keys).
Read/write mix and transaction sizes.
Security: authn/authz enabled; TLS on; real token scopes.

5) Test types

Baseline smoke — verify environment; steady 1–5 RPS.
Ramp — step up to target RPS; observe p95/p99 slope.
Steady‑state — 15–60 min at target; judge SLOs.
Spike — 2–5× burst for 1–5 min; ensure graceful shed.
Soak — 2–24 h to surface leaks, GC stalls, clock drift.
Stress / Breakpoint — push until p95 or error rate violates SLO; record knee.
Faulted — inject latency/500s from dependencies; verify retries/backoff and timeouts.

6) Open‑loop vs Closed‑loop load

Open‑loop (constant arrival rate): sends requests on schedule regardless of response → better at revealing tail and queueing.
Closed‑loop (concurrency controlled): each client waits for response → can mask tails due to backpressure.
Use open‑loop to validate p95/p99, then closed‑loop to find max sustainable RPS.

Little’s Law: L = λ × W

Given arrival rate λ and average latency W, expect concurrency L. Use it to right‑size workers/pools.

7) Oracles & evidence

Latency histograms (p50/p95/p99) per endpoint and per status.
Time‑series percentiles (1–5 min windows) to see regressions.
Trace spans with correlation_id, attempt, idempotency_key, downstream timings.
Metrics: CPU, GC time, heap, thread pool queue length, connection pool saturation, queue depth.
Logs: structured route, params_hash, error_code, retry_count, timeout=true/false.

8) Selecting test cases (examples)

Eligibility API (POST /discount/verify)

Typical payload, cache warm → ensure p95 ≤ 300 ms.
Cold cache (first hit after TTL) → p99 ≤ 800 ms; only first request misses.
Dependency adds 200 ms at p95 → endpoint stays within budget via fallback or tightened compute.
Retry budget: ≤ 1 retry at most; total still under p99.
Error rate under load ≤ 0.1%; no surge of 429/5xx.

Pagination under writes (see contracts doc)

p95 list latency ≤ 200 ms at 200 RPS while concurrent inserts occur; no dupes/skips.

Refunds

With provider timeouts, client retries (same key). End‑to‑end p99 ≤ 2 s; exactly once ledger entry.

9) Fix patterns (what usually works)

N+1 → add joins/batch endpoints; use request coalescing.
Hot lock / contention → shard, reduce critical section, use copy‑on‑write snapshots.
GC stalls → tune heap, reduce allocations, pool objects/buffers.
Cold start → warm pools, pre‑JIT, lazy‑load only on background.
Chatty deps → widen payload (fewer round trips), cache results, add circuit breakers.
Large payloads → gzip/br, paginate, project only needed fields.
Indexes → verify query plans; add covering indexes; avoid wildcard sorts without tiebreakers.
Thread/conn pools → size using Little’s Law; add backpressure and timeouts.
Retries → exponential backoff + jitter; honor Retry‑After.

10) Acceptance gates

A run passes if:

For each SLO’d endpoint:
- p50/p95/p99 within budgets over the steady‑state window.
- Error rate ≤ budget; no burst of 5xx.
- Retries within allowed limits; no retry storms.
System metrics show no saturation (CPU < 75% p95, queue depth bounded).
Logs/traces show no long‑tail outliers tied to repeats (locks, GC, cold I/O).
Under fault, graceful degradation keeps user‑visible p95 within “brownout” budget and functional correctness holds.

11) Anti‑patterns

Reporting a single run’s average latency only.
Mixing 2xx + 4xx/5xx into percentiles.
Ignoring warmup (JIT, caches) and autoscaling ramp.
Using closed‑loop only → hides tails.
Running on tiny datasets that fit entirely in memory when prod does not.
No time‑series (p95 spikes hidden by long windows).
Comparing runs with different traffic mixes or payload sizes.
Treating client‑side timeouts as success (requests cancelled in client aren’t counted).

12) Review checklist (quick gate)

SLIs defined; SLOs with p50/p95/p99 and error budgets set per endpoint/job
Workload model covers mix, data sizes, cache states, diurnal/spikes
Test plan includes baseline, ramp, steady, spike, soak, stress, faulted
Open‑loop for tails; closed‑loop for capacity; Little’s Law applied
Evidence: histograms + time‑series percentiles, traces, infra metrics
Fault injection covers dependency latency/5xx; retries/backoff honored
Acceptance gates agreed and automated in CI/CD
Before/after comparisons for regressions with the same mix & settings

13) CSV seeds

Load plan

phase,duration_s,arrival_rps,concurrency,notes
baseline,60,5,10,env check
ramp,300,"25->200",auto,step 25 rps every 60s
steady,1800,200,auto,measure p95/p99
spike,120,500,auto,burst traffic
soak,14400,100,auto,leak/GC check
faulted,600,200,auto,tax api +200ms p95 + 1% 5xx

Budgets per endpoint

endpoint,p50_ms,p95_ms,p99_ms,error_budget
POST /discount/verify,100,300,800,0.1%
GET /orders,80,200,500,0.1%
POST /refunds,150,400,2000,0.05%

Results sample

endpoint,window_start,p50_ms,p95_ms,p99_ms,rps,error_rate
POST /discount/verify,2025-09-16T12:00:00Z,95,270,620,210,0.03%
POST /discount/verify,2025-09-16T12:05:00Z,110,340,920,220,0.07%

14) Templates

Test plan header

# Perf Plan — <Feature/Endpoint>
Owner: <team>  Date: <YYYY-MM-DD>

SLIs: <list>
SLOs: p50 <= _, p95 <= _, p99 <= _, error rate <= _

Workload:
- Traffic mix: <…>
- Data mix: <sizes, distributions>
- Cache states: warm/cold %, TTLs

Phases: baseline → ramp → steady → spike → soak → stress → faulted

Acceptance gates:
- <bullets>

Evidence:
- Histograms (p50/p95/p99)
- Time-series charts
- Traces (top outliers)
- Infra metrics (CPU/GC/IO)

Run record

timestamp,commit,env,seed,notes
2025-09-16,abc123,staging-3,42,"post index fix"

Performance P95 P99