The long version.
For anyone who wants the technical depth — here is what each phase actually produces, why it exists, and how it connects to the next. Broadly: the pipeline is a chain of artefacts where each phase takes in a document from the previous phase and produces a new document the next phase can read. No verbal handoffs. No "I think we talked about that at Wednesday's stand-up".
Allium — drift as a data point.
Allium is an interrogation language: you write a specification, then run an elicitation pass that finds gaps, ambiguities and inconsistencies. It produces open questions that must be answered before we move on. Later a distillation pass compares the finished code against the spec — drift surfaces in three categories: specified-but-not-implemented, implemented-but-not-specified, and behavioural drift. Each item gets an individual decision: fix now, defer (track in spec), or dismiss with a reason. No group decisions to "deal with it later" — every finding has an owner and a date.
TLA+ — proofs, not tests.
A test that passes shows that one execution worked. It says nothing about the executions you didn't test. TLA+ models the system as a state space and runs an exhaustive walk — the TLC model checker tries every possible sequence of events and finds the states where invariants break. For case management: can an issue be both closed and in-progress at the same time? For SSE: what happens if a client reconnects mid-update? For idempotency: what happens when the same idempotency key arrives in two simultaneous requests? TLC answers mathematically, not probabilistically.
Destructive tests — six categories.
Functional tests verify that the happy path works. Destructive tests verify that the system doesn't fall over when somebody tries to break it. We run at least eight destructive scenarios per surface, across six categories.
Input validation bombards fields with overlong strings, invalid characters, nulls and empty values. Authorisation tries to read and write resources the user doesn't own. The injection category feeds XSS into text fields and SQL injection into search parameters. Race conditions force two clients to change the same resource in the same second. Boundary values exercise maximum lengths, maximum counts, and pagination at the final page. State corruption sends changes in the wrong order or against resources that have already been deleted.
None of this is theoretical. Every category has caught real bugs in production systems we've built — before they shipped.
AI-augmented development — without AI religion.
Claude Code is used as a tool, not a replacement. The pipeline above is not AI-generated — it is documented, versioned, and enforced via deterministic hooks. The hooks block feature work without a spec, require clarification before plan, run Allium elicitation before implementation, and refuse to release a feature without both Playwright and TLA+ validation. The AI follows the method — it doesn't invent it. The result: AI speed with human-level review on architecture decisions.
This method takes longer in the first week. It saves months once the system is a year old.