Human-in-the-loop — review notes → regenerate cycles

Use people where they add the most signal, and make every review train the system. This page standardizes reviewer rubrics, queues, feedback schemas, and regenerate loops so AI features improve predictably.

TL;DR

Gate AI outputs with risk tiers and confidence thresholds.
Reviews create structured feedback (machine-actionable), not freeform prose.
Regeneration cycles are bounded (N tries) with escalation and ownership.
Evidence is attached on every hop: schema validation, citations/grounding, policy.
Close the loop: turn reviews into goldens, update evals, and track drift.

When to put a human in the loop

Policy-sensitive: security, privacy, legal, compliance, abuse.
High-risk impacts: money movement, medical, safety.
Low confidence: model confidence < threshold or schema_invalid.
Missing evidence: no source supports the claim (RAG empty).
Long-tail: novel intents, unseen languages/locales, domain edge-cases.

Default thresholds (edit to fit):

Auto-approve if confidence ≥ 0.85 AND schema_valid AND no_policy_flags.
Route to review if 0.5 ≤ confidence < 0.85 OR needs_citation.
Auto-refuse/regenerate if confidence < 0.5 OR schema_invalid OR policy_flagged.

Workflow (states & signals)

intake → draft → critic → review → regenerate? → approve → publish | refuse | escalate

intake: create task with payload, context, and trace_id.
draft: first model output (schema-constrained).
critic: automated checks (schema, safety, citations).
review: human rubric applies; feedback captured as structured deltas.
regenerate: model retries using feedback; max N cycles (default: 2).
approve/publish: changes applied; evidence bundled.
refuse: deterministic refusal with message id.
escalate: to subject-matter expert when unresolved.

Queue priorities: P0 (safety/security) → P1 (customer-visible) → P2 (backlog).

Reviewer rubric (short, high-signal)

Binary gates

Schema valid?
Policy clean? (no PII/leakage, allowed domain)
Grounded? (citations support claims)
Duplicates removed?
Clear & complete?

Scaled fields (0..3)

Task relevance
Factuality
Usefulness

Evidence attachments

Validation error (if any), citation list, diff vs input, screenshots/logs.

Keep comments short and actionable. Long notes belong in a linked doc/postmortem.

Feedback schema (machine-actionable)

{
	"version": "1.0",
	"decision": "approve|regenerate|refuse|escalate",
	"reasons": ["SCHEMA_INVALID", "POLICY_BREACH", "GROUNDING_MISSING", "LOW_CONFIDENCE", "DUPLICATE", "AMBIGUOUS"],
	"edits": [
		{ "op": "replace", "path": "/title", "value": "Short, plain title" },
		{ "op": "delete", "path": "/items/2" }
	],
	"hints": ["add_citations", "tighten_claims", "dedup_items"],
	"msgid": "MSG.review.feedback",
	"evidence": ["link://trace/abc123", "link://screens/shot1"]
}

Error taxonomy seeds

SCHEMA_INVALID — JSON fails validation
POLICY_BREACH — safety/privacy/compliance rule violated
GROUNDING_MISSING — claim lacks citation/support
LOW_CONFIDENCE — model uncertainty below threshold
DUPLICATE — outputs repeat meaningfully identical items
AMBIGUOUS — user intent unclear; ask 1 question

Regenerate loop (bounded)

N cycles max (default 2). On last failure → escalate or refuse.
Each cycle diffs output vs previous, applies reviewer edits/hints, re-runs critic.
Idempotency: attach attempt_id and preserve prior best result.
Cold-path: if critic finds only schema issues, auto-regenerate without waiting for human.

Regeneration prompt (sketch)

SYSTEM: Apply ONLY the structured feedback below. Do not invent content.
INPUT: <current_output_json/>
FEEDBACK: <feedback_json/>
OUTPUT: Return minified JSON that validates the schema. Keep IDs stable; dedup.

Observability

Logs (msgid)

MSG.hitl.queue.accepted
MSG.hitl.review.feedback
MSG.hitl.regenerate.requested
MSG.hitl.regenerate.succeeded|failed
MSG.hitl.approved|refused|escalated

Metrics

hitl_queue_depth
hitl_time_to_decision_ms (p50/p95)
hitl_cycles_per_item (avg)
hitl_approval_rate_pct
hitl_escalation_rate_pct
hitl_schema_invalid_rate_pct
hitl_grounding_missing_rate_pct
Quality: inter-rater agreement (kappa), post-ship edit rate

Traces

Spans: critic.validate, review.capture, regenerate.call, with exemplars for slow items.

Roles, permissions, audit

Reviewer: apply rubric; cannot change prompts/schemas.
Editor: can update goldens, tweak shots.
Owner: adjusts thresholds, N cycles, escalation policy.
Auditor: read-only access to logs, evidence, decisions.

Audit log fields: who, when, decision, reasons, diff_hash, trace_id.

Data & privacy

Redact secrets (tokens, account numbers) before display.
Remove PII from training/evals unless consented and necessary.
Keep reviewer freeform notes out of prompts; use structured feedback only.
Retention: purge raw inputs after T days unless required for audit.

Integrating with this repo

Map REQ → SCN → CASE → Evidence (65-review-gates-metrics-traceability).
Gate in CI at G4: require schema-valid outputs and no policy flags for auto-merge; otherwise route to HITL queue.
Evaluation drifts feed back to goldens in tests/goldens/… with PR review.

MAE scenarios (HITL)

HITL-MAIN-001 — Low-confidence, single pass

Given model confidence=0.7, schema valid
Then route to review → approve with zero edits

HITL-ALT-001 — Missing citations

Given output claims facts with sources=[]
Then reviewer requests regenerate with GROUNDING_MISSING → model adds citations → approve

HITL-ALT-002 — Dedup fix

Given near-duplicate items
Then feedback DUPLICATE + edits delete one → regenerate passes

HITL-EXC-001 — Schema invalid

Given invalid JSON
Then auto-regenerate once; if still invalid → escalate

HITL-EXC-002 — Policy breach

Given PII leaked in output
Then refuse with POLICY_BREACH and notify owner

Checklists

Setup

Risk tiers and thresholds defined
Feedback schema & error taxonomy stable
Critic validators (schema/safety/citations) in CI
Queues with priorities & SLAs
Reviewer training & calibration session held

Per-PR

Evidence bundle attached (inputs, outputs, feedback, cycles)
Metrics dashboard links (queue depth, time to decision)
Goldens updated when appropriate
Trace IDs included for slow/failed cycles

Templates

Reviewer form JSON

{
	"decision": "regenerate",
	"reasons": ["GROUNDING_MISSING"],
	"hints": ["add_citations"],
	"edits": [],
	"msgid": "MSG.review.feedback",
	"notes": "Keep claims narrow; cite source 12, p.3"
}

Escalation note

Case: <id>
Reason: POLICY_BREACH (PII)
What’s needed: SME review for redaction policy
Links: <trace/dashboards/evidence>

Queue item (YAML)

id: case_123
priority: P1
state: review
confidence: 0.62
attempts: 1
trace_id: tr_abc
sla_deadline: 2025-09-17T09:00:00Z

CSV seeds

Queue register

id,priority,state,attempts,confidence,owner
case_001,P1,review,0,0.68,AI
case_002,P0,review,1,0.42,Security
case_003,P2,approve,0,0.90,Docs

Reviewer roster

name,role,time_zone
Alice,Reviewer,UTC+08
Ben,Owner,UTC+08
Chin,Auditor,UTC+08

Error taxonomy

code,meaning
SCHEMA_INVALID,JSON failed validation
POLICY_BREACH,Safety or privacy issue
GROUNDING_MISSING,Citations/evidence absent
LOW_CONFIDENCE,Model score below threshold
DUPLICATE,Duplicate content
AMBIGUOUS,Needs clarification

Common pitfalls

Freeform reviewer notes that models can’t consume.
Unlimited regenerate loops; costs rise, quality stalls.
Mixing policy and taste—keep the rubric objective.
Not updating goldens; regressions repeat.
Missing audit trail; hard to explain decisions later.

Sign-off

Thresholds tuned; queue SLAs met p95.
Regenerate cycles bounded; escalation path exercised.
Reviewers calibrated (kappa ≥ 0.7).
Dashboards & traces wired; owners on-call.
Goldens/evals updated from review learnings.

Human In The Loop