main*
📝evaluating-ai-coding-agents-in-practice.md
📅April 6, 20267 min read

Evaluating AI Coding Agents in Practice

#ai-agents#evaluation#developer-tools#testing#automation

Evaluating AI Coding Agents in Practice

The Problem With Most Agent Evaluations

AI coding agent evaluations often look rigorous right up until you ask what they actually measure.

Usually it is some mix of:

  • toy tasks
  • optimistic success definitions
  • vague human judgments
  • cherry-picked demos
  • benchmarks that reward local cleverness over production usefulness

That does not tell you whether an agent is genuinely helpful in real engineering work.

It tells you whether the agent can look good in a controlled presentation.

Those are different things.

What I Actually Want to Know

If I am evaluating an AI coding agent, I care about questions like these.

  1. Can it complete meaningful work in a real repository?
  2. Can it operate under realistic constraints and incomplete context?
  3. When it fails, how does it fail?
  4. How much human steering does it really need?
  5. Does it reduce delivery time without increasing hidden risk?

That is a much harder evaluation problem than "did it produce code that seems plausible?"

The Wrong Unit of Measurement

One of the biggest mistakes in agent evaluation is using a unit of work that is too small.

If the task is basically:

  • rename a variable
  • write a helper function
  • fix a trivial lint issue

then the agent may look excellent while telling you nothing important.

The useful unit is not a micro-edit. It is a meaningful engineering slice.

Examples:

  • implement a feature behind an existing API boundary
  • add a missing negative test matrix for a controller
  • fix a bug report end-to-end and prove the fix
  • refactor a flow without changing behavior and keep tests green

Those tasks force the agent to navigate context, ambiguity, verification, and tradeoffs.

That is where the real signal starts.

Good Tasks Are Boringly Real

The best evaluation tasks are usually not flashy.

They are the kind of work a decent engineer actually gets assigned:

  • update an existing feature
  • wire a new endpoint into existing patterns
  • add tests around a failure mode
  • fix a deployment-blocking config issue
  • improve a fragile setup script

I want tasks that expose whether the agent can live inside existing engineering reality.

That means:

  • established code patterns
  • existing tests
  • mixed-quality documentation
  • nontrivial constraints
  • a dirty worktree sometimes

If the task does not feel like something that could show up in a team's backlog, it probably will not predict production usefulness very well.

Success Needs a Hard Definition

This is where many evaluations collapse.

The task is declared a success because the output looks directionally right or because the agent got "most of the way there."

That is not strong enough.

For each task, I want explicit acceptance criteria.

Task: Add pagination support to the orders endpoint.
 
Success means:
- endpoint accepts page and pageSize
- response includes stable pagination metadata
- existing filters still work
- tests cover happy path and boundary cases
- lint and typecheck pass

Now the task can be scored against something real.

Without that, humans tend to over-credit the agent for effort instead of outcome.

The Most Important Metric Is Intervention

People love binary scores:

  • solved
  • not solved

That is useful, but it is not enough.

The metric I care about most is intervention load.

How many times did a human need to step in, and what kind of help was required?

Examples:

  • clarified the task once
  • pointed the agent at the correct file
  • explained a repo-specific convention
  • corrected a wrong assumption about a data model
  • stopped a risky or irrelevant action

That tells you a lot more about operational value than a simple pass/fail.

Two agents can both finish a task, but if one needed six rescue prompts and the other needed none, those are not equivalent results.

Failure Taxonomy Matters

When an agent fails, I want a precise reason, not just a miss.

I typically bucket failures into categories like:

  • context failure: did not find or use relevant information
  • planning failure: took the wrong approach despite enough context
  • execution failure: correct plan, bad implementation
  • verification failure: changed code but did not prove it correctly
  • boundary failure: touched the wrong scope or ignored constraints

This matters because different failures imply different fixes.

If most misses are context failures, maybe retrieval or tool design is the problem.

If most misses are verification failures, maybe the agent needs stronger testing workflows or better incentives around proof.

If most misses are boundary failures, you probably need guardrails, not a smarter model.

Repository Quality Affects the Result

This is another thing people understate.

An agent is not operating in a vacuum. It is operating in your system.

If the repository has:

  • inconsistent structure
  • weak naming
  • poor tests
  • hidden conventions
  • misleading docs

then some evaluation failures are really repo failures exposed by an agent.

That is not an excuse. It is part of the truth.

In practice, useful evaluation needs to account for the quality of the environment as well as the quality of the agent.

Time Is Not Just Wall Clock

People often ask whether the agent was faster.

That is fine, but raw elapsed time can be misleading.

I want to separate:

  • agent working time
  • human waiting time
  • human intervention time
  • verification time

An agent that produces code quickly but dumps a giant verification burden back on the user is not really saving that much.

The workflow needs to be evaluated end-to-end.

The Evaluation Harness Should Look Like Work

I like a harness that captures:

  • task prompt
  • acceptance criteria
  • available tools
  • repository snapshot
  • interventions with timestamps
  • final result and verification output
  • failure category if applicable

This can be extremely simple.

{
  "taskId": "agent-eval-014",
  "task": "Add pagination to orders endpoint",
  "acceptanceCriteria": [
    "supports page and pageSize",
    "returns pagination metadata",
    "existing filters unchanged",
    "tests added",
    "build passes"
  ],
  "interventions": [
    {
      "minute": 9,
      "type": "context",
      "note": "Pointed agent to existing pagination helper"
    }
  ],
  "result": "pass_with_intervention",
  "verification": {
    "tests": "pass",
    "build": "pass"
  },
  "failureMode": null
}

This is enough structure to learn from.

You Need a Mix of Task Shapes

A single category of task will distort the result.

I want a portfolio like this:

Task typeWhat it tests
bug fixdiagnosis and verification
feature additionplanning and integration
refactorboundary discipline
test-writingbehavioral understanding
config/devx fixenvironment reasoning

That produces a much more useful picture of the agent's actual range.

Benchmarks Are Fine, But They Are Not Enough

I am not anti-benchmark. Benchmarks are good for comparability.

The problem is when benchmarks become the whole story.

Benchmarks rarely capture:

  • messy repos
  • ambiguous requirements
  • changing worktrees
  • weak documentation
  • long-running verification loops
  • the cost of human rescue

Real work does.

So for teams evaluating agents for actual use, I think the right model is:

benchmark for broad signal, repository tasks for operational truth.

My Practical Evaluation Rules

If I were setting up an internal evaluation today, I would use these rules.

  1. Use tasks from real repos, not toy repos.
  2. Require explicit acceptance criteria per task.
  3. Track human interventions, not just outcomes.
  4. Record failure categories with discipline.
  5. Separate code generation from verification burden.
  6. Mix bug, feature, refactor, test, and config tasks.
  7. Review results for system fixes, not just model rankings.

That last one matters a lot.

Sometimes the right conclusion is not "replace the model."

Sometimes the right conclusion is:

  • improve repo structure
  • tighten tool contracts
  • add guardrails
  • improve context retrieval
  • make verification easier to run

The Main Takeaway

Most AI coding agent evaluations are too forgiving, too synthetic, or too vague to tell you much.

Useful evaluation is less glamorous.

It starts with:

  • real tasks
  • hard acceptance criteria
  • tracked interventions
  • explicit failure modes
  • end-to-end verification

That is how you stop measuring optimism and start measuring whether the agent is actually useful in engineering practice.