Design Notes

“Refunds must be predictable, auditable, and exactly-once.” These notes capture architecture, contracts, failure modes, and test hooks for Refund Workflow v1. Pair with the Brief for outcomes and acceptance.

1) Goals & non-goals

Goals

Deterministic, auditable state machine for refunds (full/partial).
Exactly-once provider submission and ledger posting.
Strong observability and reconciliation with payment provider.
Safe under retries, outages, and abuse attempts.

Non-goals (v1)

Split-tender, multi-currency conversions, RMA logistics.
Chargeback dispute automation (separate process).
Subscription proration.

2) Architecture (logical)

[UI / Agent Console]
        |
        v
[Refund API]  --(RPC)-->  [Policy Service]
     |                    [Ledger Service]
     |                    [Provider Adapter(s)]
     |                    [Attachments Store]
     |                           |
     |                           +-- PSP(s) / Bank rails
     |
     +--(DB)-----> Refunds (state), Idempotency store, Outbox/Inbox
     +--(Cache)--> Short-lived reads (refund status)
     +--(Events)--> Event Bus (refund.*)

Patterns

Outbox: queue provider submissions transactionally with refund state.
Inbox: dedupe provider webhooks by provider_refund_id + hash.
Saga: refund approval → submission → pending → completion/failure.

3) State machine (detailed)

requested → approved → submitting → provider_pending → completed | failed | canceled

Transitions

requested → approved (auto or agent decision).
approved → submitting (enqueue job; mark attempt).
submitting → provider_pending (accepted by provider, async outcome).
provider_pending → completed/failed (webhook or poll).
requested|approved → canceled (if allowed before submit).

Rules

Only approved refunds may be submitted.
Amount cannot exceed remaining refundable.
State transitions are idempotent; repeated webhooks are safe.

4) Data model (persistence)

CREATE TABLE refunds (
  refund_id TEXT PRIMARY KEY,
  order_id TEXT NOT NULL,
  user_id TEXT NOT NULL,
  amount_minor INTEGER NOT NULL,
  currency TEXT NOT NULL,
  reason TEXT NOT NULL,                         -- taxonomy in brief
  state TEXT NOT NULL CHECK (state IN (
    'requested','approved','submitting','provider_pending',
    'completed','failed','canceled'
  )),
  provider_refund_id TEXT,
  provider_attempts INTEGER NOT NULL DEFAULT 0,
  last_error_code TEXT,
  last_error_at TIMESTAMPTZ,
  idempotency_key TEXT,                         -- for create
  created_at TIMESTAMPTZ NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL
);
CREATE UNIQUE INDEX refunds_order_remaining_guard
  ON refunds(order_id, refund_id);              -- combined with ledger guard

CREATE TABLE idempotency_keys (
  key TEXT PRIMARY KEY,
  request_hash TEXT NOT NULL,
  response_json JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL
);

CREATE TABLE inbox_webhooks (
  provider TEXT NOT NULL,
  event_id TEXT NOT NULL,
  hash TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY(provider, event_id)
);

Ledger

Post REFUND_PENDING on approval; REFUND_SETTLED on completion.
Guard remaining balance with sum of completed + approved ≤ captured.

5) Contracts (API)

Create refund (idempotent)

POST /v1/orders/{order_id}/refunds

Headers: Idempotency-Key, X-Correlation-Id Body (simplified):

{ "amount_minor": 2500, "currency": "USD", "reason": "not_received", "evidence": [] }

Success 202:

{ "refund_id": "rf_123", "state": "approved", "message_id": "refund.request.accepted" }

Errors:

400 ERR.VALIDATION.amount.range
400 ERR.BUSINESS.refund.exceeds_remaining
402 ERR.BUSINESS.refund.not_captured
403 ERR.AUTHZ.scope
409 ERR.CONFLICT.idempotency

Agent decision

POST /v1/refunds/{refund_id}/decision { "decision":"approve|deny", "note":"..." }

Submit refund (system)

POST /v1/refunds/{refund_id}/submit → enqueues provider job; returns 202.

Read/list

GET /v1/refunds/{refund_id} GET /v1/orders/{order_id}/refunds

Webhooks (provider)

POST /webhooks/payments with refund.succeeded|refund.failed (signed + timestamp).

6) Idempotency & exactly-once

Create: store request hash + response; on replay with same key/identical hash → return stored response; with different hash → 409.
Submit: generate provider idempotency key per provider contract; store provider_refund_id once; prevent duplicate posts.
Webhooks: dedupe by (provider,event_id); if already applied → no-op.

7) Provider adapters

One adapter per PSP with uniform interface:
- submit_refund(order_ref, amount_minor, currency, key) -> { provider_refund_id, state }
- verify_webhook(headers, body) -> { event_id, type, provider_refund_id, status }
Map PSP-specific states to our state machine.
Handle clock skew and signature validation.

8) Reconciliation

Daily job joins ledger entries with provider refunds by provider_refund_id.
Emit refund_mismatch_rate_pct.
Export CSV of mismatches with reasons (missing webhook, amount mismatch, currency mismatch).
Alert when mismatch exceeds threshold.

9) Security, privacy, abuse

RBAC: only agents with scope can approve/deny; dual-control for goodwill > threshold.
Rate-limit refund create & decision endpoints.
No PAN or sensitive PII in logs; attachment storage encrypted; signed URLs with short TTL.
Prevent over-refund by transactional check against captured minus prior refunds.

10) Observability (signals)

Logs (msgid)

MSG.refund.requested
MSG.refund.approved
MSG.refund.submitted
MSG.refund.completed
MSG.refund.failed

Error codes

ERR.BUSINESS.refund.not_captured
ERR.BUSINESS.refund.exceeds_remaining
ERR.CONFLICT.idempotency
ERR.DEPENDENCY.timeout

Metrics

refund_create_latency_ms (histogram)
refund_outcome_total{state}
refund_mismatch_rate_pct
webhook_replay_dedup_total

Traces

Spans: policy.evaluate, ledger.post, provider.submit, webhook.process.
Attach exemplars to histograms.

11) Failure modes & mitigations

Provider timeout/5xx → mark provider_pending; retry with backoff + jitter; surface ETA to user.
Webhook lost → poll status via adapter; reconcile job detects gap.
Idempotency key collision → 409 with prior response; never overwrite.
Ledger post failure → outbox ensures retry; state blocked from completed until ledger write succeeds.
Partial provider outage → circuit breaker opens; queue submissions; show “processing” state.

12) Performance & capacity

Budgets: create p95 ≤ 250 ms; status read p95 ≤ 150 ms.
Expected throughput aligned with order volume; queue sized for burst 3×.
Storage: attachments limited by type/size/count; virus scan pipeline.

Message IDs for each state; short actionable copy.
ARIA live updates on status changes; focus returns to actionable control.
Emails/notifications localized; timelines vary by rail (card vs bank).
Dark mode/RTL verified; long text expansion ×1.3.

Unit: amounts, remaining logic, state transitions, idempotency hashing. Contract: OpenAPI schemas; reject unknown fields. Integration: provider simulators (succeeded/failed/timeout); signature checks. E2E: MAE scenarios from the brief with evidence bundle. Chaos: provider timeouts and webhook delays. Telemetry assertions: msgid/err.code in logs; metrics changes; traces present.

15) Open questions

Dual-control trigger thresholds by currency—static or FX-adjusted?
Allow refund to different instrument (out of scope v1 but design migration path)?
SLA visualization to customer—show expected date by rail?

16) API schema stub (OpenAPI fragment)

post:
  /v1/orders/{order_id}/refunds:
    headers:
      Idempotency-Key: string
      X-Correlation-Id: string
    body:
      type: object
      required: [amount_minor, currency, reason]
      properties:
        amount_minor: {type: integer, minimum: 1}
        currency: {type: string, pattern: '^[A-Z]{3}$'}
        reason: {$ref: '#/components/schemas/RefundReason'}
        evidence: {type: array, items: {$ref: '#/components/schemas/Attachment'}}
    responses:
      '202': {$ref: '#/components/schemas/Refund'}
      '400': {$ref: '#/components/schemas/Error'}
      '402': {$ref: '#/components/schemas/Error'}
      '403': {$ref: '#/components/schemas/Error'}
      '409': {$ref: '#/components/schemas/Error'}

17) Links

Brief → ./brief.md
API Coverage → ../../60-checklists/api-coverage.md
Performance → ../../60-checklists/performance-review.md
Security → ../../60-checklists/security-review.md
SRE (SLOs & breakers) → ../../57-cross-discipline-bridges/for-sres.md
Traceability → ../../65-review-gates-metrics-traceability/traceability.md

Design Notes

Design Notes

1) Goals & non-goals

2) Architecture (logical)

3) State machine (detailed)

4) Data model (persistence)

5) Contracts (API)

Create refund (idempotent)

Agent decision

Submit refund (system)

Read/list

Webhooks (provider)

6) Idempotency & exactly-once

7) Provider adapters

8) Reconciliation

9) Security, privacy, abuse

10) Observability (signals)

11) Failure modes & mitigations

12) Performance & capacity

14) Testing strategy

15) Open questions

16) API schema stub (OpenAPI fragment)

17) Links

On this page