Failure Files

Five incidents that punched us in the face, taught us humility, and upgraded the system permanently.

Success stories are comfortable. Failure files are useful. The most meaningful upgrades in this project did not come from happy-path planning sessions with elegant diagrams. They came from incidents that cornered us and forced a decision: mature the workflow now, or keep paying the same pain tax every week.

What follows is not a museum of mistakes. It is a catalog of turning points. Each incident left a scar. Each scar became a guardrail. Each guardrail made the next cycle faster, calmer, and less dependent on late-night heroics and improvised Slack archaeology.

The real metric of failure is not that it happened. The real metric is whether the exact same failure can walk back in next Tuesday wearing a fake mustache.

Incident 1: Route Drift and Phantom Endpoints

Symptom: endpoints behaved differently depending on which execution path handled the request.

Root cause: route logic had been split across layers without a single canonical registration contract.

Impact: support noise, confusing regressions, and delayed incident triage.

Change: central route registry and parity validation checks.

Guardrail: route behavior now validated from one source of truth before release.

Incident 2: Task Lifecycle Ambiguity

Symptom: work that looked complete remained operationally incomplete because status never transitioned correctly.

Root cause: stage movement depended on human memory rather than enforced workflow state.

Impact: duplicate work, stale review queues, and noisy handoffs.

Change: required status fields and explicit move operations.

Guardrail: task stage no longer changes without state alignment.

Once lifecycle state became explicit, handoff confusion dropped immediately.

Incident 3: Emergency Fixes Without Memory

Symptom: urgent fixes landed quickly but were impossible to audit later.

Root cause: in urgent moments, process fell back to "fix first, explain never."

Impact: recurring incidents with no durable learnings.

Change: slim hotfix template with mandatory fix summary and test evidence.

Guardrail: no hotfix is considered complete without a written artifact.

Incident 4: Token Burn and Context Drag

Symptom: sessions spent too much effort reloading old context, not solving current problems.

Root cause: unbounded read patterns and overlong conversational state.

Impact: rising cost, longer cycle times, and quality drift from stale assumptions.

Change: strict bootload discipline and progressive disclosure for large docs.

Guardrail: narrow file reads, written summaries, and explicit task contracts.

Incident 5: Role Bleed Across PM, Coder, and QA

Symptom: planning leaked into coding, coding leaked into QA, QA leaked into release decisions.

Root cause: role boundaries were culturally understood but not operationally enforced.

Impact: unstable acceptance criteria and hidden risk transfer.

Change: role mapping embedded in task metadata and gate sequence.

Guardrail: every stage has one accountable owner and one clear exit rule.

What Improved After the Pain

Faster incident diagnosis due to better artifact trails.
Less rework because acceptance criteria are concrete earlier.
Higher confidence in releases because QA replay is explicit.
Better team trust because accountability is visible, not implied.

The Never Again Checklist

Never rely on chat memory as system state.
Never close urgent fixes without written evidence.
Never treat QA as optional when timelines tighten.
Never blur role ownership to save one quick turn and lose twenty reruns.
Never ship without knowing which guardrail protects the next failure class.

Failure is expensive. Repeated failure is optional. The point of this list is not to sound battle-hardened on the internet. The point is to preserve learning so the same scar only needs to happen once, not once per sprint.

Incidents are inevitable. Unlearned incidents are inexcusable.

If you cannot point to the guardrail that came out of a failure, that failure is not history. It is pending.