Postmortems
Postmortems & Anti‑patterns
Blameless, fast, and practical. Use this page to document what failed, why, and how we changed our test design so it never happens again. Focus on artifacts and repeatable test‑design updates (scenarios, cases, gates), not individual blame.
How to run a 30‑minute micro‑postmortem
Goal: capture enough signal to update scenarios, cases, and gates within the same day.
-
Assemble evidence pack (10 min)
- Timeline of events (UTC+0), who/what, timestamps
- Logs (structured), traces, metrics screenshots
- Inputs/requests, outputs/responses, user message(s)
- Rollback/build IDs, feature flags, config diffs
-
Root cause & contributing factors (10 min)
- Technique: 5‑Whys for depth, Fishbone for breadth (people/process/tech/data)
- Categorize: Functional · API/Data Contract · Non‑functional · Security · i18n · Deployment · Observability
-
Update test design (10 min)
- Add/modify scenarios (MAE) and cases with explicit expected results
- Add/adjust checklists/gates (performance, security, API)
- Link evidence → tests (traceability), and open a PR referencing this note
Keep it blameless. Stick to facts & artifacts. Prefer small, immediate design changes over long essays.
Template (copy/paste)
# Postmortem: <concise title>
**Date/time (UTC+0):**
**Impact:** <users/apps affected, duration, severity, metrics>
**Category:** Functional | API/Data Contract | Non-functional | Security | i18n | Deployment | Observability
## Context
Short description + what changed (deploy ID/flag/config).
## Timeline (UTC+0)
- t0 — <event>
- t+X — <event>
## Evidence Pack
- Logs: <paths or paste sanitized snippet>
- Traces: <span names / IDs>
- Metrics: <charts or numbers>
- Requests/Responses: <sanitized samples>
## Root Cause
Primary cause written as a single, testable statement.
## Contributing Factors
- <factor 1>
- <factor 2>
## Fix
- Code/config: <summary>
- Test design: <new scenarios/cases>
- Gates/checklists: <new or updated checks>
## Verification
How we confirmed the fix (tests, monitors, synthetic checks).
## Lessons & Anti-patterns
- <lesson>
- <anti-pattern>
## Updates to Repo
- Scenarios: <links under 30-*/ or 70-*/>
- Cases: <links>
- Checklists/Gates: <links under 60-*/ or 65-*/>
- Traceability: <link to matrix>
> Owner: <name or role> · Review: <peers> · Labels: field-note, postmortem, <domain tags>
Guidance & Do/Don’t
Do
- Redact PII/secrets; keep examples sanitized.
- Pin everything to timestamps and artifacts (logs/metrics/traces/requests).
- Convert findings into concrete tests & gates same‑day.
- Use relative repo links so readers can jump to new content.
Don’t
- Hunt for blame; focus on systems and tests.
- Ship fixes without verification evidence.
- Write “lessons” that don’t change a checklist, scenario, or gate.
Mapping findings → repo updates
- Functional gaps →
30-scenario-patterns/*
·20-techniques/*
·60-checklists/functional-coverage.md
- API/Data Contract →
40-api-and-data-contracts/*
·60-checklists/api-coverage.md
- Performance/Resilience →
50-non-functional/*
·60-checklists/performance-review.md
- AuthZ/Roles →
30-scenario-patterns/roles-and-permissions.md
- i18n/Encoding →
55-domain-playbooks/i18n-l10n.md
·20-techniques/crud-grids.md
- Traceability →
65-review-gates-metrics-traceability/*
Example 1 — Double refund due to missing idempotency
Category: API/Data Contract · Money‑moving Impact: 37 duplicate refunds over 4h; manual reconciliation required
Context
Retries were added for refund API timeouts. Payment create used idempotency keys; refunds did not.
Timeline
- 08:13 — spike in gateway timeouts (external)
- 08:14 — client retries POST
/refunds
without key → duplicate records - 08:27 — support reports double payout
- 08:42 — flag refund path; begin rollback; start reconciliation
Evidence Pack
- Logs show identical payloads with different request IDs
- DB: duplicate refunds referencing same charge ID
- Metrics: spike in refund count vs charge count
Root Cause
POST /refunds
lacked an idempotency key contract; client retries created duplicates.
Contributing Factors
- Assumed “only charges need idempotency”
- No contract test for “same key ⇒ same result”
Fix
- Require
Idempotency-Key
header on refunds; persist request hash - Retry policy:
100ms, 200ms, 400ms (+ jitter)
, stop on non‑retryables - Tests added
40-api-and-data-contracts/idempotency-and-retries.md
examples → cases under70-mini-projects/refund-workflow/cases.md
- Contract test: same key returns same refund; new key creates new refund
Verification
- Replayed traffic against staging with synthetic timeouts; no duplicates
- Audit logs show dedupe events
Lessons & Anti‑patterns
- Anti‑pattern: idempotency only for “create” but not “reverse” operations
- Lesson: money‑moving = both directions need idempotency + backoff
Updates to Repo
- Scenarios/Cases:
70-mini-projects/refund-workflow/*
- Checklist:
60-checklists/api-coverage.md
(“Idempotency for unsafe ops incl. refunds”) - Gate: add synthetic retry test in CI
Example 2 — Missing/duplicated items from pagination
Category: Functional · Consistency Impact: Admin list view intermittently missed new items under load
Context
Offset‑based pagination sorted by created_at DESC
without a tiebreaker; rapid inserts reordered items.
Timeline
- 14:05 — import job inserts 5k rows in 2m
- 14:08 — users report “disappearing items”
- 14:22 — reproduced locally with parallel inserts
Evidence Pack
- SQL explains re‑ordering around identical
created_at
values - Screen recordings of duplicates/skips while paginating
Root Cause
Unstable sort (no unique tiebreaker) allowed items to shift between pages.
Contributing Factors
- Offset pagination under high write rate
- No test that mutated data while paging
Fix
- Switch to cursor pagination with
(created_at, id)
composite sort - Tests added
- Insert‑while‑paging scenario under
30-scenario-patterns/cross-feature-interactions.md
- Assertions for no duplicates, no skips in
20-techniques/state-models.md
example
- Insert‑while‑paging scenario under
Verification
- Synthetic writer during paging test → stable results
- Monitors show consistent page sizes
Lessons & Anti‑patterns
- Anti‑pattern: sorting solely on non‑unique column
- Lesson: always add a tiebreaker and test with concurrent writes
Updates to Repo
- Pattern note:
05-field-notes/patterns-that-worked.md#pattern-6
- Checklist: add “Stable sort or cursor with tiebreaker” to
60-checklists/functional-coverage.md
Example 3 — Unicode normalization crash on search
Category: i18n/Encoding Impact: 502s for queries containing decomposed characters; user search failures (JP + emoji)
Context
Backend normalized user input to NFC; index stored strings in mixed NFC/NFD; equality and LIKE behaved inconsistently.
Timeline
- 03:12 — alert on 5xx for search endpoint
- 03:20 — found correlation with iOS input methods
- 04:05 — identified mixed normalization in import pipeline
Evidence Pack
- Request samples with decomposed glyphs (NFD)
- DB samples:
é
stored ase + ́
in some rows - Trace spans show error path during collation compare
Root Cause
Normalization mismatch (NFC vs NFD) between query path and stored data caused collation errors and misses.
Contributing Factors
- No data‑variation tests for normalization
- Mixed DB collation across tables
Fix
- Normalize to NFC at ingest; migrate affected rows
- Enforce DB collation consistently
- Tests added
- Data variation grid (charset/length/normalization) under
20-techniques/crud-grids.md
- Domain playbook updates under
55-domain-playbooks/i18n-l10n.md
- Data variation grid (charset/length/normalization) under
Verification
- Replay queries with NFD/NFC → same results
- Incident dashboards trend back to baseline
Lessons & Anti‑patterns
- Anti‑pattern: testing only ASCII/UTF‑8 “happy” paths
- Lesson: include normalization cases in the grid by default
Updates to Repo
- Checklist: “Normalization (NFC/NFD) covered” in
60-checklists/compatibility-review.md
- Case pack added to
70-mini-projects/checkout-discount-code/cases.md
(remark field)
Anonymization & privacy
- Replace user IDs/emails with neutral tokens (e.g.,
user_12345
). - Mask amounts, account numbers, and secrets.
- Remove external customer names unless you have explicit permission.
- If in doubt, generalize the detail and keep the test‑design change exact.
Metrics to track over time
- Defect leakage per release
- Coverage diff (new/removed scenarios & cases)
- Time to contain (t0 → mitigation)
- Mean time to test‑design update (incident → PR merged)
- Recurrence rate (same root cause category)
Aim for: shorter time to contain, faster test‑design updates, and a falling recurrence rate per category.