📝evaluating-ai-coding-agents-in-practice.md

Evaluating AI Coding Agents in Practice

The Problem With Most Agent Evaluations

AI coding agent evaluations often look rigorous right up until you ask what they actually measure.

Usually it is some mix of:

toy tasks
optimistic success definitions
vague human judgments
cherry-picked demos
benchmarks that reward local cleverness over production usefulness

That does not tell you whether an agent is genuinely helpful in real engineering work.

It tells you whether the agent can look good in a controlled presentation.

Those are different things.

What I Actually Want to Know

If I am evaluating an AI coding agent, I care about questions like these.

Can it complete meaningful work in a real repository?
Can it operate under realistic constraints and incomplete context?
When it fails, how does it fail?
How much human steering does it really need?
Does it reduce delivery time without increasing hidden risk?

That is a much harder evaluation problem than "did it produce code that seems plausible?"

The Wrong Unit of Measurement

One of the biggest mistakes in agent evaluation is using a unit of work that is too small.

If the task is basically:

rename a variable
write a helper function
fix a trivial lint issue

then the agent may look excellent while telling you nothing important.

The useful unit is not a micro-edit. It is a meaningful engineering slice.

Examples:

implement a feature behind an existing API boundary
add a missing negative test matrix for a controller
fix a bug report end-to-end and prove the fix
refactor a flow without changing behavior and keep tests green

Those tasks force the agent to navigate context, ambiguity, verification, and tradeoffs.

That is where the real signal starts.

Good Tasks Are Boringly Real

The best evaluation tasks are usually not flashy.

They are the kind of work a decent engineer actually gets assigned:

update an existing feature
wire a new endpoint into existing patterns
add tests around a failure mode
fix a deployment-blocking config issue
improve a fragile setup script

I want tasks that expose whether the agent can live inside existing engineering reality.

That means:

established code patterns
existing tests
mixed-quality documentation
nontrivial constraints
a dirty worktree sometimes

If the task does not feel like something that could show up in a team's backlog, it probably will not predict production usefulness very well.

Success Needs a Hard Definition

This is where many evaluations collapse.

The task is declared a success because the output looks directionally right or because the agent got "most of the way there."

That is not strong enough.

For each task, I want explicit acceptance criteria.

Task: Add pagination support to the orders endpoint.
 
Success means:
- endpoint accepts page and pageSize
- response includes stable pagination metadata
- existing filters still work
- tests cover happy path and boundary cases
- lint and typecheck pass

Now the task can be scored against something real.

Without that, humans tend to over-credit the agent for effort instead of outcome.

The Most Important Metric Is Intervention

People love binary scores:

solved
not solved

That is useful, but it is not enough.

The metric I care about most is intervention load.

How many times did a human need to step in, and what kind of help was required?

Examples:

clarified the task once
pointed the agent at the correct file
explained a repo-specific convention
corrected a wrong assumption about a data model
stopped a risky or irrelevant action

That tells you a lot more about operational value than a simple pass/fail.

Two agents can both finish a task, but if one needed six rescue prompts and the other needed none, those are not equivalent results.

Failure Taxonomy Matters

When an agent fails, I want a precise reason, not just a miss.

I typically bucket failures into categories like:

context failure: did not find or use relevant information
planning failure: took the wrong approach despite enough context
execution failure: correct plan, bad implementation
verification failure: changed code but did not prove it correctly
boundary failure: touched the wrong scope or ignored constraints

This matters because different failures imply different fixes.

If most misses are context failures, maybe retrieval or tool design is the problem.

If most misses are verification failures, maybe the agent needs stronger testing workflows or better incentives around proof.

If most misses are boundary failures, you probably need guardrails, not a smarter model.

Repository Quality Affects the Result

This is another thing people understate.

An agent is not operating in a vacuum. It is operating in your system.

If the repository has:

inconsistent structure
weak naming
poor tests
hidden conventions
misleading docs

then some evaluation failures are really repo failures exposed by an agent.

That is not an excuse. It is part of the truth.

In practice, useful evaluation needs to account for the quality of the environment as well as the quality of the agent.

Time Is Not Just Wall Clock

People often ask whether the agent was faster.

That is fine, but raw elapsed time can be misleading.

I want to separate:

agent working time
human waiting time
human intervention time
verification time

An agent that produces code quickly but dumps a giant verification burden back on the user is not really saving that much.

The workflow needs to be evaluated end-to-end.

The Evaluation Harness Should Look Like Work

I like a harness that captures:

task prompt
acceptance criteria
available tools
repository snapshot
interventions with timestamps
final result and verification output
failure category if applicable

This can be extremely simple.

{
  "taskId": "agent-eval-014",
  "task": "Add pagination to orders endpoint",
  "acceptanceCriteria": [
    "supports page and pageSize",
    "returns pagination metadata",
    "existing filters unchanged",
    "tests added",
    "build passes"
  ],
  "interventions": [
    {
      "minute": 9,
      "type": "context",
      "note": "Pointed agent to existing pagination helper"
    }
  ],
  "result": "pass_with_intervention",
  "verification": {
    "tests": "pass",
    "build": "pass"
  },
  "failureMode": null
}

This is enough structure to learn from.

You Need a Mix of Task Shapes

A single category of task will distort the result.

I want a portfolio like this:

Task type	What it tests
bug fix	diagnosis and verification
feature addition	planning and integration
refactor	boundary discipline
test-writing	behavioral understanding
config/devx fix	environment reasoning

That produces a much more useful picture of the agent's actual range.

Benchmarks Are Fine, But They Are Not Enough

I am not anti-benchmark. Benchmarks are good for comparability.

The problem is when benchmarks become the whole story.

Benchmarks rarely capture:

messy repos
ambiguous requirements
changing worktrees
weak documentation
long-running verification loops
the cost of human rescue

Real work does.

So for teams evaluating agents for actual use, I think the right model is:

benchmark for broad signal, repository tasks for operational truth.

My Practical Evaluation Rules

If I were setting up an internal evaluation today, I would use these rules.

Use tasks from real repos, not toy repos.
Require explicit acceptance criteria per task.
Track human interventions, not just outcomes.
Record failure categories with discipline.
Separate code generation from verification burden.
Mix bug, feature, refactor, test, and config tasks.
Review results for system fixes, not just model rankings.

That last one matters a lot.

Sometimes the right conclusion is not "replace the model."

Sometimes the right conclusion is:

improve repo structure
tighten tool contracts
add guardrails
improve context retrieval
make verification easier to run

The Main Takeaway

Most AI coding agent evaluations are too forgiving, too synthetic, or too vague to tell you much.

Useful evaluation is less glamorous.

It starts with:

real tasks
hard acceptance criteria
tracked interventions
explicit failure modes
end-to-end verification

That is how you stop measuring optimism and start measuring whether the agent is actually useful in engineering practice.