Long-running agent loops you can still

A long-running agent loop stays reviewable when every iteration leaves a receipt: an intent line, the commands it ran, and a diff summary, all written into the PR while the run happens. A long-running agent loop is an automated edit-verify cycle that keeps going after the person who started it has moved on to something else. The thing that breaks first is not the model's skill. It is your ability to explain what it did once the terminal is closed.

I learned this at a whiteboard. We were rehearsing a rollback and could sketch the deploy, the incident, and the revert. Nobody could sketch what the agent had done in between, because that story lived in a session that no longer existed. The merge had depended on memory, and memory had run out.

So the goal is simple. Let the loop run long, but make it pay rent in artifacts a reviewer can replay without you standing there.

Make every run leave a replay sandwich

Most green merges from a tool like Codex CLI, OpenAI's coding agent, share one weakness: the commands ran, the diff looks fine, but the transcript never reached review. Convenience hid the verification. You trust a summary written after the fact instead of the run itself.

The fix is a contract you mandate in AGENTS.md. Each PR carries three things in order: an intent line on top, the full command transcript in the middle, and a diff summary underneath. Now a reviewer can replay the work instead of believing a recap.

Call it a replay sandwich. The point is reproducibility, not ceremony. If someone can re-run the commands and land on the same diff, the loop earned its runtime.

Draw a boundary around every connector

A loop that runs long enough will eventually reach for a connector that touches data nobody put on the diagram. MCP servers ship as capability demos by default, so least privilege only exists if you write it down.

Give each MCP server one markdown card: allowed actions, forbidden actions, owner, and rollback. Incidents get smaller because operators know what "off" looks like before the loop finds the edge for them.

The same discipline catches recursive handoffs. When a long loop forks children, the summary that comes back tends to drop the paths the child actually changed. That is the telephone game at machine speed. Make every child return a receipt block: the paths it touched, the commands it ran, and the tests that prove the regression guards held. Parents stop approving mystery diffs from runs they stopped watching.

Force the "why" into the PR

CI goes green, the diff is clean, and a reviewer still asks why this approach. There is no written answer, and the longer the run, the less anyone remembers the constraints that were live at the time.

A decision stub fixes this with three required lines in the PR template: constraints considered, alternatives rejected, and the verification proof. Debate moves from taste to written tradeoffs.

Here is a small delegation-boundary file you can adapt and drop in your repo. Change the globs to match your layout.

---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
  - "**/*"
alwaysApply: false
---

- Cursor: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.

If you want the forcing function behind all of this, our methodology splits Test from Review for exactly this reason: a passing test proves behavior, and a Review proves the team can still explain it.

Check the four gates before you merge

A long run is safe to merge when these four questions have answers backed by files in the PR, not answers someone reconstructs from memory.

Gate	Question
Replay proof	Which commands prove the regression guards?
Receipt match	Does the PR body list scopes plus the verification transcript?
Rules precedence	Which `.mdc`, `SKILL.md`, or `CLAUDE.md` governed behavior?
Connector truth	Which MCP servers fired, and were they expected?

Tool vendors move fast. Your job is slower and harder: keep the receipts lined up with how your team actually reviews. If the repo cannot say what is allowed and what is forbidden, the agent cannot either, no matter how long it runs.

Common questions

How do you keep long-running agent loops reviewable? Make every iteration leave a receipt: an intent line, the command transcript, and a diff summary in the PR, plus a boundary card for each MCP server the loop might touch. The test for success is whether someone who never watched the run can defend the merge from the artifacts alone.

What is a replay sandwich in agent workflows? It is a PR contract you mandate through AGENTS.md: intent line on top, command transcript in the middle, diff summary underneath. The point is reproducibility. A reviewer can replay the run and reach the same diff, instead of trusting a summary someone wrote after the work was done.

When should a human interrupt an agent loop? Interrupt when the receipts stop matching the scope: the diff touches paths outside the ledger, an unexpected connector fires, or the verification command changes with no written reason. Threat models, customer promises, and blast-radius calls stay off autopilot, however green the loop looks.

How is this different from just reading the diff? A diff shows what changed, not why or whether it was verified. The receipts add the missing half: the intent, the commands that proved the guards, and which rule file governed the run. Reading the diff alone is how a clean-looking change slips past the constraints that mattered.

Start with one loop

Pick a single repo, run one loop, and adopt one receipt format before the next unattended run. We rehearse exactly this, live, in our team training.

Long-running agent loops you can still review

Make every run leave a replay sandwich

Draw a boundary around every connector

Force the "why" into the PR

Check the four gates before you merge

Common questions

Start with one loop

Further reading

Related training topics

Related research

AI coding wrappers that hold up under review

Subagent prompts: why every fork needs its own brief

Cursor Composer layers in agentic coding

Ready to start?

Make every run leave a replay sandwich

Draw a boundary around every connector

Force the "why" into the PR

Check the four gates before you merge

Common questions

Start with one loop

Further reading

Related training topics

Cursor subagents and team skills for engineering teams

Cursor rules training for engineering teams

Cursor MCP training for engineering teams

AI code review habits for generated code

Related research

AI coding wrappers that hold up under review

Subagent prompts: why every fork needs its own brief

Cursor Composer layers in agentic coding

Ready to start?