Evaluating AI Coding Agents in Practice
Evaluating AI Coding Agents in Practice
The Problem With Most Agent Evaluations
AI coding agent evaluations often look rigorous right up until you ask what they actually measure.
Usually it is some mix of:
- toy tasks
- optimistic success definitions
- vague human judgments
- cherry-picked demos
- benchmarks that reward local cleverness over production usefulness
That does not tell you whether an agent is genuinely helpful in real engineering work.
It tells you whether the agent can look good in a controlled presentation.
Those are different things.
What I Actually Want to Know
If I am evaluating an AI coding agent, I care about questions like these.
- Can it complete meaningful work in a real repository?
- Can it operate under realistic constraints and incomplete context?
- When it fails, how does it fail?
- How much human steering does it really need?
- Does it reduce delivery time without increasing hidden risk?
That is a much harder evaluation problem than "did it produce code that seems plausible?"
The Wrong Unit of Measurement
One of the biggest mistakes in agent evaluation is using a unit of work that is too small.
If the task is basically:
- rename a variable
- write a helper function
- fix a trivial lint issue
then the agent may look excellent while telling you nothing important.
The useful unit is not a micro-edit. It is a meaningful engineering slice.
Examples:
- implement a feature behind an existing API boundary
- add a missing negative test matrix for a controller
- fix a bug report end-to-end and prove the fix
- refactor a flow without changing behavior and keep tests green
Those tasks force the agent to navigate context, ambiguity, verification, and tradeoffs.
That is where the real signal starts.
Good Tasks Are Boringly Real
The best evaluation tasks are usually not flashy.
They are the kind of work a decent engineer actually gets assigned:
- update an existing feature
- wire a new endpoint into existing patterns
- add tests around a failure mode
- fix a deployment-blocking config issue
- improve a fragile setup script
I want tasks that expose whether the agent can live inside existing engineering reality.
That means:
- established code patterns
- existing tests
- mixed-quality documentation
- nontrivial constraints
- a dirty worktree sometimes
If the task does not feel like something that could show up in a team's backlog, it probably will not predict production usefulness very well.
Success Needs a Hard Definition
This is where many evaluations collapse.
The task is declared a success because the output looks directionally right or because the agent got "most of the way there."
That is not strong enough.
For each task, I want explicit acceptance criteria.
Task: Add pagination support to the orders endpoint.
Success means:
- endpoint accepts page and pageSize
- response includes stable pagination metadata
- existing filters still work
- tests cover happy path and boundary cases
- lint and typecheck passNow the task can be scored against something real.
Without that, humans tend to over-credit the agent for effort instead of outcome.
The Most Important Metric Is Intervention
People love binary scores:
- solved
- not solved
That is useful, but it is not enough.
The metric I care about most is intervention load.
How many times did a human need to step in, and what kind of help was required?
Examples:
- clarified the task once
- pointed the agent at the correct file
- explained a repo-specific convention
- corrected a wrong assumption about a data model
- stopped a risky or irrelevant action
That tells you a lot more about operational value than a simple pass/fail.
Two agents can both finish a task, but if one needed six rescue prompts and the other needed none, those are not equivalent results.
Failure Taxonomy Matters
When an agent fails, I want a precise reason, not just a miss.
I typically bucket failures into categories like:
- context failure: did not find or use relevant information
- planning failure: took the wrong approach despite enough context
- execution failure: correct plan, bad implementation
- verification failure: changed code but did not prove it correctly
- boundary failure: touched the wrong scope or ignored constraints
This matters because different failures imply different fixes.
If most misses are context failures, maybe retrieval or tool design is the problem.
If most misses are verification failures, maybe the agent needs stronger testing workflows or better incentives around proof.
If most misses are boundary failures, you probably need guardrails, not a smarter model.
Repository Quality Affects the Result
This is another thing people understate.
An agent is not operating in a vacuum. It is operating in your system.
If the repository has:
- inconsistent structure
- weak naming
- poor tests
- hidden conventions
- misleading docs
then some evaluation failures are really repo failures exposed by an agent.
That is not an excuse. It is part of the truth.
In practice, useful evaluation needs to account for the quality of the environment as well as the quality of the agent.
Time Is Not Just Wall Clock
People often ask whether the agent was faster.
That is fine, but raw elapsed time can be misleading.
I want to separate:
- agent working time
- human waiting time
- human intervention time
- verification time
An agent that produces code quickly but dumps a giant verification burden back on the user is not really saving that much.
The workflow needs to be evaluated end-to-end.
The Evaluation Harness Should Look Like Work
I like a harness that captures:
- task prompt
- acceptance criteria
- available tools
- repository snapshot
- interventions with timestamps
- final result and verification output
- failure category if applicable
This can be extremely simple.
{
"taskId": "agent-eval-014",
"task": "Add pagination to orders endpoint",
"acceptanceCriteria": [
"supports page and pageSize",
"returns pagination metadata",
"existing filters unchanged",
"tests added",
"build passes"
],
"interventions": [
{
"minute": 9,
"type": "context",
"note": "Pointed agent to existing pagination helper"
}
],
"result": "pass_with_intervention",
"verification": {
"tests": "pass",
"build": "pass"
},
"failureMode": null
}This is enough structure to learn from.
You Need a Mix of Task Shapes
A single category of task will distort the result.
I want a portfolio like this:
| Task type | What it tests |
|---|---|
| bug fix | diagnosis and verification |
| feature addition | planning and integration |
| refactor | boundary discipline |
| test-writing | behavioral understanding |
| config/devx fix | environment reasoning |
That produces a much more useful picture of the agent's actual range.
Benchmarks Are Fine, But They Are Not Enough
I am not anti-benchmark. Benchmarks are good for comparability.
The problem is when benchmarks become the whole story.
Benchmarks rarely capture:
- messy repos
- ambiguous requirements
- changing worktrees
- weak documentation
- long-running verification loops
- the cost of human rescue
Real work does.
So for teams evaluating agents for actual use, I think the right model is:
benchmark for broad signal, repository tasks for operational truth.
My Practical Evaluation Rules
If I were setting up an internal evaluation today, I would use these rules.
- Use tasks from real repos, not toy repos.
- Require explicit acceptance criteria per task.
- Track human interventions, not just outcomes.
- Record failure categories with discipline.
- Separate code generation from verification burden.
- Mix bug, feature, refactor, test, and config tasks.
- Review results for system fixes, not just model rankings.
That last one matters a lot.
Sometimes the right conclusion is not "replace the model."
Sometimes the right conclusion is:
- improve repo structure
- tighten tool contracts
- add guardrails
- improve context retrieval
- make verification easier to run
The Main Takeaway
Most AI coding agent evaluations are too forgiving, too synthetic, or too vague to tell you much.
Useful evaluation is less glamorous.
It starts with:
- real tasks
- hard acceptance criteria
- tracked interventions
- explicit failure modes
- end-to-end verification
That is how you stop measuring optimism and start measuring whether the agent is actually useful in engineering practice.