Back to Research

Composer 2 for Cursor Teams

Composer 2 for Cursor subagents, skills, and team workflows, with practical evals, rules, and review checks.

Editorial illustration for Composer 2 for Cursor Teams. Cursor’s Composer 2 release is best read as a measurement story, not just a model announcement.
Rogier MullerMay 8, 20265 min read

The situation

Field note: Composer benchmarks tempt teams to skip eval design—we install a tiny repo-specific suite before recommending model swaps so reviewers measure regressions, not vibes.

Official anchors: Composer 2 post, Cursor Agent docs.

Cursor’s Composer 2 release is best read as a measurement story, not just a model announcement. The official post says Composer 2 is frontier-level for coding, priced at $0.50/M input and $2.50/M output tokens, and backed by a technical report on training and benchmarks. For Cursor users, that matters because model quality only becomes useful when it is paired with the right cursor rules, cursor subagents, and reviewable team workflows.

The release also points to long-horizon coding tasks, benchmark gains, and continued pretraining before reinforcement learning. For engineering teams, the practical question is simple: what should change in the repo this week so agent work is easier to trust?

If you are building Cursor training for a team, the answer is usually not “use the newest model everywhere.” It is to tighten the operating surface around it: scoped rules, explicit handoffs, and a small evaluation loop that shows whether the model is helping on your codebase. See the related training topic on Cursor subagents and skills.

Walkthrough

Start with the smallest measurable unit: one repo task, one rule scope, one review path. Composer 2 is useful when the task is already well-framed. That means your first job is to reduce ambiguity in the files Cursor reads automatically.

A practical Cursor setup usually has three layers:

  1. Project rules for local conventions.
  2. Team memory for durable repo boundaries.
  3. Task-specific delegation for work that should be isolated in a cursor subagent.

If your team still has a large, flat rule file, split it. Cursor’s current rule model is built around scoped .mdc files rather than a single monolith. Keep the rule close to the code it governs.

---
description: API route conventions for billing code
globs:
  - app/billing/**
  - app/api/billing/**
apply: always
---

- Keep billing handlers small and side-effect free.
- Add tests for any change to pricing, retries, or idempotency.
- If a change touches external calls, ask for a review checklist before merge.

Use AGENTS.md for repo-wide constraints that should survive across tasks and tools. The point is not to write more instructions; it is to make the instructions easier to verify.

# AGENTS.md

## Repo boundaries
- Do not change deployment config without explicit review.
- Prefer small diffs over broad refactors.
- When a task spans services, delegate the service-specific work to a subagent and return a summary.

## Review rule
- Every agent-authored PR must include a short note on tests run, files touched, and known risks.

Then define what you will measure. Composer 2’s release highlights benchmark gains, but your team needs repo-specific checks. A good starter eval set is small: 10 to 20 representative tasks, each with a clear pass/fail criterion. Include at least one long-horizon task, because that is where agent workflows often fail first.

Team artifact

Composer rollout spreadsheet rows reviewers fill in:

  • Task ID + Composer tier/version captured
  • Rule globs (*.mdc) attached for files touched
  • Regression signal (tests flaky count, retry count) logged weekly

What to verify

  • Did the agent follow the scoped rule file for the touched area?
  • Did the subagent return a summary that a reviewer can inspect quickly?
  • Did the diff stay within the intended files?
  • Did tests or lint catch the risky part of the change?
  • Would a teammate accept the result without re-running the whole task?

This is where measurement becomes practical. Track not only task success, but also correction rate, number of follow-up prompts, and review time. If Composer 2 reduces retries but increases overreach, that is still a useful finding. If it improves code quality but makes diffs harder to review, your team may need stricter delegation boundaries.

A short methodology note: in the Review step, prefer evidence from the diff and test output over model confidence. That keeps Cursor training grounded in artifacts instead of impressions.

Tradeoffs and limits

Composer 2’s benchmark gains do not remove the need for guardrails. A stronger model can still produce broad edits, miss repo-specific conventions, or overfit to the prompt if your rules are vague. The release notes point to better coding intelligence, not automatic correctness.

There is also a measurement trap: if your evals are too synthetic, they will reward polished-looking answers instead of useful repo work. Keep at least part of the set tied to real tasks from your engineering team, such as a flaky test fix, a scoped refactor, or a documentation update with code impact.

Cursor subagents help with isolation, but they can also hide context if the handoff is too thin. Summaries need to be specific enough for review, and the parent task still needs enough context to decide whether the result is safe to merge.

Finally, do not treat token price as the only cost. In team workflows, the bigger cost is often review time, rework, and unclear ownership. A cheaper model that produces harder-to-review diffs can be more expensive in practice.

Further reading

Related training topics

Related research

Continue through the research archive

Ready to start?

Transform how your team builds software.

Get in touch