Specs and tests: the stable stack for AI coding
Specs and tests as the stable stack for agent work: four named fixes that turn fuzzy scopes into reviewable, parallel-safe delivery.

The incident review found the contradiction before we did: the rules file said one thing, the skill the agent had just activated said another, and the agent had shipped on the skill while the sprint clock ran. Parallelism punishes fuzzy scopes first. That is the case for specs and tests as the stable layer in AI coding. Specs and tests are the written contract, expected behavior plus executable proof, that every agent-written change gets checked against. Stacks and models rotate; the contract is what holds.
Where the rules file and the skill disagreed
Counter-thesis: review cannot compress what the repo never recorded, and no amount of model quality writes the record for you.
The wrong path: We believed reviewers would absorb implicit intent. We tested that while hooks existed but nobody read the transcripts, and forks without receipts ate the sprint budget before lint ever failed.
Diagnosis: Brooks's law applies to agents the way it applied to people. Adding workers to underspecified work makes it later, and agents are workers you can add in seconds, which makes the spec the bottleneck on day one.
Thesis: stable specs and tests are what make agent parallelism safe.
Four fixes that stabilize the stack
The stable stack is not a framework choice; it is the set of artifacts that stay true while agents work in parallel.
Cursor scope fog. .mdc language sounds precise until reviewers argue about what it meant. Rules compete with chat memory, split-brain coordination without a referee.
Named fix: Scope ledger. The parent chat carries a five-line ledger: goal, allowed paths, forbidden paths, verification command, merge owner. Review shifts from debating prompts to checking ledgers against diffs.
Recursive handoff blur. Parallel children return summaries that omit child-owned paths, the telephone game at machine speed.
Named fix: Child receipt block. Every child returns the paths it touched, the commands it ran, and the tests proving regression guards. Parents merge evidence instead of confidence.
Review queue theater. CI is green and reviewers still ask why this approach, with no written answer. Humans optimize for checks passing.
Named fix: Decision stub. The PR template forces three lines: constraints considered, rejected alternatives, verification proof. Debate moves from vibes to explicit tradeoffs.
MCP blast radius. A connector wired for a demo ends up touching data nobody listed on the diagram. Least privilege needs explicit trust boundaries.
Named fix: Connector card. One markdown card per MCP server: allowed actions, forbidden actions, owner, rollback. Operators know what off looks like before they need it.
---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
- "**/*"
alwaysApply: false
---
- Cursor: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.
In our methodology the spec belongs to Plan: agents inherit whatever the Plan step wrote down, and nothing else. The cluster's remaining patterns live on the agentic coding governance page, and AI coding tools that keep working after rollout shows what happens to the same contract at tool-selection time.
Checking the contract at merge
A spec did its job when the reviewer never has to ask why did the agent touch this file.
| Gate | Question |
|---|---|
| Risk routing | Were red folders touched, and who approved? |
| Replay proof | Which commands prove regression guards? |
| Receipt match | Does the PR body list scopes + verification transcript? |
| Rules precedence | Which .mdc, SKILL.md, or CLAUDE.md governed behavior? |
Synthesis: picture two clocks, one for shipping and one for explainability. When only the shipping clock ticks, the team pays later with interest.
Hard constraints still belong to humans: threat models, customer promises, and blast radius decisions stay off autopilot.
Best ways to use this research
- Best for: Cursor teams deciding which rule, subagent, skill, or MCP boundary to standardize next around specs and tests for agent work.
- Best first artifact: turn the scope ledger into a
.mdcrule, AGENTS.md note, subagent receipt, or review checklist before the next parallel run. - Best comparison angle: compare the spec-first workflow against the current Cursor review path, connector scope, and team rule file; keep the path that leaves the shortest auditable trail.
Common questions
Do coding agents need specs and tests to be reliable? Yes, because they inherit ambiguity faster than people do. Specs pin expected behavior, tests prove it executable, and the pair gives parallel agents a contract that does not depend on chat memory. Without it, every fork resolves ambiguity its own way.
Why does parallel agent work fail without clear scopes? Because parallelism punishes fuzzy scopes first. Two agents with overlapping, unwritten boundaries will both touch the contested path, and the conflict surfaces at merge time as a mystery. A five-line scope ledger per fork keeps the boundaries checkable against the diffs.
What is a scope ledger? Five lines carried in the parent chat: goal, allowed paths, forbidden paths, verification command, merge owner. It is the smallest spec that still lets a reviewer test an agent diff against a written boundary instead of against someone's memory of the prompt.
Further reading
- Cursor docs: Agent
- OWASP Top 10 for Large Language Model Applications
- NIST AI Risk Management Framework
- Google Search Central: helpful, people-first content
- Google Search Central: generative AI content guidance
- Model Context Protocol specification
- OpenAI Skills repository
Next step
If your specs live in chat and your tests live in hope, we can help you move both into the repo: contact.
Related training topics
Related research

AI agent boundaries that hold under pressure
A boundary-setting guide to AI agent boundaries: connector cards, scope ledgers, child receipts, and decision stubs that stop permission drift.

Agent boundaries for teams running coding agents
How to set agent boundaries for teams: connector ownership, written scopes, and review receipts that keep agent diffs explainable after the session ends.

How to set up an AI coding workshop for your engineering team
How to set up an AI coding workshop: pick a format, scope it to your real repos and review habits, run hands-on labs, and leave with a shared playbook.