Back to Research

What to do when AI coding tools regress

When AI coding tools regress, the teams that recover fastest are the ones whose receipts survive the update: connector cards, child receipts, decision stubs.

Landscape Near Paris, landscape painting by Georges Michel (1835).
Rogier MullerMarch 7, 20266 min read

When an AI coding tool regresses, compare the new diffs against your written scopes before you touch the tool itself. A tooling regression is a release that quietly shifts agent behavior your workflow was counting on. The agent that behaved on Friday misbehaves on Monday, and the merge queue does not wait for you to figure out why. The teams that recover in hours are the ones whose pull requests describe boundaries, not vibes.

Most editors and agents here are familiar names: Cursor, Anysphere's AI code editor, plus Claude Code and the Codex CLI sitting next to it in the same repos. The fix below is the same regardless of which one shipped the surprise.

Find out what actually changed

Your first move is not debugging. It is diffing behavior against what you wrote down.

If the work left a trail, you can rerun the same intent against the new release and read the difference directly. If it did not, you spend the first day reconstructing intent from chat history, and that reconstruction is the expensive half of the outage. The whole game is having something to compare against.

So the question that matters on a bad Monday is simple. Did the regression change what the agent did, or only how it narrated it? A scope you wrote in advance answers that in seconds.

Write a scope ledger before you delegate

Put five lines in the parent chat or the task description before any agent starts work: goal, allowed paths, forbidden paths, verification command, merge owner. Reviewers argue endlessly about what a vaguely worded rule "meant," so spell the boundary out in plain language they can hold a diff against.

When a release shifts behavior, diff-versus-ledger is the fastest detector a reviewer has. The agent touched a folder that was on the forbidden list? You found the regression's edge without reading a single line of model output.

Here is a tool-agnostic snapshot you can adapt per repo. The same contract reads against Cursor, Claude Code, or the Codex CLI.

---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
  - "**/*"
alwaysApply: false
---

- Cursor: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.

Bound the blast radius with three small artifacts

You cannot stop a release from changing behavior. You can shrink how much a misbehaving run is allowed to touch. Three lightweight artifacts do most of that work.

A connector card is one markdown card per MCP server: allowed actions, forbidden actions, owner, rollback. Connectors ship as capability demos by default, and least privilege needs explicit trust boundaries (MCP specification). A regression is the worst time to learn a connector reaches data nobody listed.

A child receipt block makes every chained agent return the paths it touched, the commands it ran, and the tests proving its guards held. Chained agents love to hand back a tidy summary that omits child-owned paths, which is exactly where a behavior shift hides. With receipts, the regression's footprint is something you can enumerate instead of guess.

A decision stub forces three lines into the PR template: constraints considered, rejected alternatives, verification proof. When tool behavior shifts, the stub shows you which decisions the old behavior was holding up.

Triage with the four gates

Regression triage starts with four questions a good PR already answers. Keep this table near your review checklist.

Gate Question
Connector truth Which MCP servers fired, and were they expected?
Reviewer path Can someone unfamiliar trace intent without chat replay?
Risk routing Were red folders touched, and who approved?
Replay proof Which commands prove regression guards?

And a checklist you can paste straight into a PR body:

  • Scopes in the PR body match folders in the diff.
  • Primary-doc links were smoke-checked after publishing edits.
  • MCP connectors mentioned (if any) list owners.
  • Verification command output is pasted or linked.

None of this replaces architecture judgement. Agents speed up execution, not ownership. OWASP's LLM Top 10 and NIST's AI Risk Management Framework cover the risk classes that outlive any single release, and they are worth a read before you standardize anything.

Common questions

  • What should a team do first when AI coding tools regress?

    Compare diffs against your scope ledgers before debugging the tool. The ledger tells you whether the regression changed what the agent did or only how it described it. Teams without ledgers spend the first day reconstructing intent from chat, which is the slow, expensive half of the outage and the part you can skip with one written snapshot.

  • How do you tell a tool regression from a bad prompt?

    Replay receipts settle it. If your AGENTS.md mandates an intent line, a command transcript, and a diff summary, you can rerun the same intent against the new release and compare behavior directly. Without transcripts there is nothing to replay, so the argument stays an unresolvable matter of opinion on both sides.

  • Can you prevent regressions from reaching the main branch?

    Not fully, but you can shrink their reach. Connector cards bound what a misbehaving run can touch, child receipt blocks enumerate its footprint, and the decision stub records which behaviors the merge depended on. Containment is the realistic goal here. Treating prevention as guaranteed is just marketing.

  • Why do regressions hurt agent-heavy teams more?

    Because agents multiply the surfaces a release can shift: rules files, hooks, connectors, delegation chains. A team running one editor tends to absorb an update quietly. A team running parallel agent streams inherits permission drift nobody signed off on. Written receipts are how that second team stays fast through the bad week anyway.

  • Where should a small team start this week?

    Start with a single connector card for your riskiest MCP server, then add a scope ledger to your next delegated task. Both are markdown, both take minutes, and both pay off the first time a release surprises you. Behavior-in-files is the same pattern as OpenAI's skills repository.

Where to go next

Pick one artifact and ship it before the next update lands, not after. If you want the full receipt catalog with a rollout order for a recovery week, browse the AI coding governance topics or bring the habit to a hands-on training session.

Related training topics

Related research

Continue through the research archive

Ready to start?

Transform how your team builds software.

Get in touch