Back to Research

Stop using CSS selectors in E2E tests

CSS selectors in E2E tests churn every time an agent regenerates markup. Durable selectors, decision stubs, and scope ledgers keep the suite reviewable.

Sunset near Villerville, landscape painting by Charles-François Daubigny (1876).
Rogier MullerApril 6, 20266 min read

If your E2E suite goes red every time an agent touches the UI, the fix is to stop pinning tests to CSS selectors and pin them to what the user sees or a stable test id instead. A durable E2E selector is one tied to user-visible text or an explicit data-testid, not to a class chain an agent will rewrite tomorrow. Coding agents like Cursor, Anysphere's AI code editor, regenerate markup fast and freely, so a selector that depends on markup shape is a failure waiting for its next refactor.

Here is the loop I keep watching break. A test goes red, the proposed fix is "let the agent regenerate the test," the new test passes, and two days later it goes red again on a fresh batch of equally arbitrary selectors. The test never told you anything changed in the product, because nothing did. The markup just moved.

Why CSS selectors keep breaking your suite

A CSS selector couples your test to implementation detail. The test is supposed to check behavior, but .flex > div:nth-child(2) > button.primary checks structure. Change the structure without changing the behavior and the test fails anyway.

This was annoying before agents and it is constant now. An agent reshuffles a component for readability, your selector chain breaks, and the suite reports a regression that is not a regression. You spend review time arguing about a nth-child nobody chose on purpose.

The deeper problem is that the selector strategy was never written down. When the fix is "regenerate," the next agent session re-decides the strategy from scratch, picks different arbitrary selectors, and the cycle resets. The cure is not a smarter selector. It is making the selector choice a written, reviewable decision.

Pick selectors a refactor cannot reach

Tie each E2E selector to something a user would notice if it changed: visible text, a role, or a test id you put there on purpose. If a selector would survive a full markup rewrite that keeps the behavior identical, it is durable. If it would not, it is a tripwire.

Then record the choice so it survives the next agent. A one-line decision note on the PR does this: what you tied the selector to, and the class-chain alternative you rejected. Now the strategy outlives any single session, instead of getting re-decided by whoever runs next.

Here is the boundary file I keep at the root so the rule holds across tools:

---
description: Delegation boundary snapshot (adapt globs to your repo)
globs:
  - "**/*"
alwaysApply: false
---

- Cursor: keep scopes explicit in `.mdc`; forbid undeclared MCP domains.
- Claude Code: cite `CLAUDE.md` precedence before expanding bash scope.
- Codex: ensure `AGENTS.md` carries replay-friendly verification notes for CLI runs.

The same file works whether the session runs in Claude Code, Anthropic's coding agent, or scripts through the Codex CLI. Test code is code, so it rides through our methodology at the verification step: the suite is the thing every other change pins against.

Let agents fix tests, but inside a fence

Agents can regenerate a failing test. The danger is scope creep: you ask for "fix the failing test" and the diff quietly rewrites the component so the broken test passes. That hides a real regression behind a green check.

Fence it with a short scope ledger in the parent chat: goal, allowed paths, forbidden paths, verification command, merge owner. Put the test directory in allowed paths and the application markup in forbidden paths. Now the agent can fix the test or flag the app, never silently both.

Make each child agent hand back a receipt: which specs it touched, which commands it ran, and the output that proves the regression guard still fires. A summary that omits the specs it rewrote is how half a suite changes without anyone deciding it should.

One more boundary that gets forgotten: the browser-automation connector that drives your suite can usually reach far more than your suite. Give every MCP server a small card listing allowed actions, forbidden actions, owner, and rollback. The OWASP Top 10 for LLM applications reads like a list of what goes wrong when nobody writes that card.

Review a regenerated spec like any agent diff

A regenerated test gets the same four gate questions as any other agent change. Run them before you approve.

Gate Question
Connector truth Which MCP servers fired, and were they expected?
Reviewer path Can someone unfamiliar trace intent without chat replay?
Risk routing Were red folders touched, and who approved?
Replay proof Which commands prove regression guards?

Paste this checklist into the PR template so the handoff carries its own evidence:

  • MCP connectors mentioned (if any) list owners.
  • Verification command output is pasted or linked.
  • Forked agent work lists parent and child responsibilities.
  • Red-folder paths received explicit human acknowledgement.

None of this replaces test design judgement. Agents speed up execution, not ownership. The NIST AI Risk Management Framework makes the general point, and a suite makes it concrete: whoever owns the merge owns what the suite stops catching. This is plain agentic coding governance, and it matters double when several streams run at once, which is what keeping parallel coding agents from colliding is about.

Common questions

  • Why do CSS selectors in E2E tests keep breaking? Because they bind the test to markup structure instead of behavior. CSS selectors inherit every refactor, and agents refactor markup constantly, so the suite fails without the product changing. Each regeneration that "fixes" the test just plants the next failure in fresh, equally arbitrary selectors.

  • What should agents use instead of CSS selectors? Selectors tied to what the user sees or to a stable test id, with the choice recorded in a one-line decision note. The note matters as much as the selector. A written strategy survives regeneration, while an unwritten one gets re-decided by whichever agent session runs next.

  • Should an agent be allowed to regenerate failing tests? Yes, inside a fence. The scope ledger allows the test directory and forbids application markup, so the agent cannot rewrite the app to satisfy the spec. The child receipt then lists every spec it touched and the command output proving the suite actually ran.

  • How do I tell a real failure from selector churn? Sample your last red-to-green test diffs and sort them into two piles: behavior actually changed, or selectors just moved. If most of the churn is the second pile, you have a selector problem, not a product bug. That ratio is the fastest signal of whether your suite tests behavior or structure.

Where to start

Put a decision note on your next test-fix PR recording the selector strategy and the alternative you rejected, then fence the agent loop with a scope ledger. Our training runs a session where teams fence a loop around their own flakiest suite and keep the receipts.

Related training topics

Related research

Continue through the research archive

Ready to start?

Transform how your team builds software.

Get in touch