Trustworthy evals for coding teams

The situation

Most teams ask for “better evals,” but the request usually starts too small. They want a runner, a dashboard, or a score. The harder part is agreeing on what counts as good, keeping that definition stable as models change, and making the results useful across teams.

That is the main point in Phil Hetzel’s Braintrust talk: an eval platform is more than execution. It is shared measurement infrastructure. If the data, labels, and versioning are weak, the score becomes a local opinion instead of a decision tool.

This matters for agentic coding teams because the same problem shows up across IDEs, CLIs, and shared automation. One group tests prompt quality, another tests tool use, and a third tests review behavior. Without a common measurement layer, each team optimizes a different target.

The real question is not “Can we run evals?” It is “Can we trust the result enough to change behavior, training, or release gates?”

Walkthrough

Start by separating three layers: the task definition, the dataset, and the scoring rule. If those move together, you cannot tell whether a model improved or the benchmark drifted. Keep each layer versioned on its own, even if the first version is simple.

A lightweight pattern is a skill-style package for the workflow itself. Claude’s Skills model is a useful reference here because it treats repeatable work as a folder of instructions, scripts, and resources that loads only when relevant. That progressive-disclosure idea also fits eval operations: keep the task spec close to the task, not buried in a wiki. See the official docs at cursor.com/docs, Claude Skills, and the open repositories at github.com/anthropics/skills and github.com/openai/skills.

A minimal package might look like this:

# SKILL.md
name: code-review-eval
version: 0.1.0
purpose: Score whether an agent's code change is safe, minimal, and test-backed.
inputs:
  - diff
  - test_results
  - reviewer_notes
scoring:
  - correctness
  - scope_control
  - evidence_of_tests
  - explanation_quality
versioning:
  dataset: v3
  rubric: v2
  model: gpt-5.5

Then make the dataset boring and explicit. Each example should say what task it represents, what the expected outcome is, and what evidence is acceptable. If the team cannot explain why an example is in the set, it should not be in the set. This is where many eval platforms fail: they accumulate examples faster than they accumulate labels that can survive review.

For agentic coding, add a second layer of measurement for tool boundaries. A good agent may solve the task while still violating policy by reaching for the wrong connector, reading too much context, or writing outside its allowed scope. OpenAI’s Codex docs and Anthropic’s Skills docs both point toward this broader view: the system is not only generating output, it is operating inside permissions, tools, and workflows. That means your evals should include boundary checks, not just answer quality.

A small rule stub can help teams review those boundaries consistently:

# AGENTS.md
## Review gates
- Do not approve changes that pass tests but exceed the requested file scope.
- Require a note when the agent uses an external tool or connector.
- Flag any eval run that changes rubric text without a dataset version bump.
- Treat missing evidence as a failed run, not a neutral one.

If you already use a rule file or workspace guidance, keep the eval policy close to it. Cursor’s docs are a reminder that teams often manage agent behavior through local project instructions and workspace conventions, not only through a central platform. That makes governance easier to adopt, but only if the rules are short enough to be read and enforced.

The next step is review discipline. Evals become credible when someone other than the author can reproduce the result from the same dataset, rubric, and model snapshot. That means every run needs enough metadata to answer four questions: what changed, what was tested, what model or tool version ran, and who approved the change. If any of those are missing, the score should not be used as a release signal.

A practical operating model is to treat eval changes like code changes: small diffs, explicit review, and a clear owner. That is also where our methodology is useful: in the Review step, the goal is not to admire the result but to check whether the evidence is complete enough to trust.

Tradeoffs and limits

Evals are expensive to keep honest. The more teams use them, the more pressure there is to optimize the benchmark instead of the product. If the same group writes the rubric, curates the dataset, and interprets the score, the platform can drift into self-justification.

There is also a maintenance cost. Versioned datasets, label audits, and boundary checks add work every time a model or tool changes. That overhead is real, but it is usually cheaper than shipping a misleading score to multiple teams.

Another limit is that no eval platform fully captures real-world use. Agentic coding systems fail in context, with permissions, partial information, and human review. So the platform should be treated as a decision aid, not a substitute for production observation.

Trustworthy evals for coding teams

The situation

Walkthrough

Tradeoffs and limits

Further reading

Related training topics

Related research

Eval platforms need governance

Fast Evals for Better Decisions

Agent Boundaries for Teams

Ready to start?

The situation

Walkthrough

Tradeoffs and limits

Further reading

Related training topics

Cursor subagents and team skills

Cursor team conventions for engineering orgs

Cursor CLI workflows for production codebases

MCP and team skills for AI coding workflows

Related research

Eval platforms need governance

Fast Evals for Better Decisions

Agent Boundaries for Teams

Ready to start?