Agent evaluation and observability: how to measure an AI coworker in production
An agent in production doesn't have unit tests, it has evaluation sets. How to make an AI coworker measurable without going through every run by hand.
An agent without measurement is a promise with a logo
A traditional application is measured on functional tests and uptime. An AI coworker isn't a traditional application. It makes decisions on data that's a little different every day, with models that are periodically updated, with tools that change too. Without ongoing measurement, "it works" is a feeling, not a fact.
That doesn't have to be complicated, but it does have to be a fixed part of the work. The question isn't whether you need evaluation and observability, but what you minimally capture to be able to steer.
The four metrics everyone needs
Four metrics form the base of any serious agent measurement. They're relatively easy to capture and together give a good picture of whether an agent is healthy.
- Success rate: the share of runs finishing without error or escalation. Track per agent and per input type.
- Tool fidelity: the share of tool calls that matches what the agent says it's doing, tested against a ground-truth schema.
- Escalation rate: the share of runs deliberately handed to a human. A healthy escalation rate is greater than zero.
- P95 duration: how long the slowest ten percent of runs take. A spike there is almost always a signal.
An evaluation set, not just logs
Logs tell you what happened, not whether it was correct. An evaluation set is fifty to two hundred realistic cases with an expected outcome. On every prompt, tool or model change you run the set and see at a glance which regressions slipped in. For a serious agent in production an evaluation set isn't a luxury, it's a precondition.
What a dashboard needs to show
An agent dashboard doesn't need many tabs. Per agent: success rate over 7 and 30 days, p95 duration, escalations, cost per run, and a list of the most recent failed runs with direct links to the trace. For a process owner that's enough steering without having to read every log.
Traces for the hard work
For the cases that did go wrong, a per-run trace is indispensable: which prompt was sent, which tools were called with which parameters, which results came back, and how reasoning continued. Good platforms ship this out of the box; on a self-build you maintain it. A trace is the difference between "the agent did something wrong" and a targeted fix.
Cost per run and model routing
An agent runs on a model billed per token. Cost per run is a serious metric, not only for the CFO but as a signal: an agent suddenly using three times as many tokens likely has a wrong loop or an altered prompt. Model routing (smaller model for easy tasks, larger for genuinely hard ones) quickly becomes ad hoc rather than a design choice if cost isn't measured.
How to start without a team of six
For anyone now with one or two agents in production, the first step is: log all agent runs to one place, define an evaluation set of fifty cases per agent, and build one dashboard with the four basic metrics. That puts you ahead of many teams running twenty agents without measurement structure. Plan a Quick Scan if you'd like to see how we do this for agents we put in production.
Keep reading
More insights
Curious what an AI coworker can do for your process?
Book a no-strings Quick Scan and explore the options.
Book a Quick Scan