Your Questions About AI Agents and Production Feedback Answered
We got a ton of great questions from attendees, and I didn't have time to answer all of them during the session. So, here are my answers to the ones I found most interesting, and most representative of what people are actually grappling with right now.

By: Austin Parker

The Director’s Guide to the Future of Observability: AI, OpenTelemetry, and Complex Systems
Read Now
On April 1st, I joined Akshay Utture from Augment Code for a webinar on how AI agents use production feedback to improve code. We covered a lot of ground—the DORA report's findings on AI driving both throughput and instability, why observability is shifting left into the development process, and how to close the gap between "code that was written" and "code that actually works in production."
We got a ton of great questions from attendees, and I didn't have time to answer all of them during the session. So, here are my answers to the ones I found most interesting, and most representative of what people are actually grappling with right now.
On trust
At what point do you trust the agent's production feedback enough to let it merge code without human review? And how do you build that trust incrementally? Is there an observability-driven confidence score, or is it vibes?
It's a little bit vibes. I know that's not the answer people want, but let me explain why that's actually okay.
Trust in an AI agent should work the same way trust works with a new team member. You don't hand someone full production access on day one. You start with low risk, well-understood tasks. You review everything. Over time, as the results are consistently good, you widen the aperture. The observability data is what lets you make that determination with evidence instead of gut feeling.
I think about this in concentric rings. The innermost ring is changes where the agent can self-verify: it makes a change, runs the tests, checks the SLOs, confirms that the metrics didn't regress. If all of that passes, the change can be auto-merged, but only for categories of changes where you've already validated work this way. Next ring out, changes that get flagged for human review but come pre-annotated with observability context: "Here's what I changed, here's what I observed in production after deploying to staging, here's what I expect to happen." Everything else, you still want a human deeply involved.
The mechanism for building trust incrementally is observability. Tight SLOs give you a machine readable definition of "working." Production telemetry gives you the evidence. Over time, you look at the track record: how often did the agent's self-assessment match reality? When it was wrong, how wrong was it? That pattern of evidence is your confidence score. It just isn't a single number. It's more like a batting average that you develop intuition around.
Read our O’Reilly book, Observability Engineering
Get your free copy and learn the foundations of observability,
right from the experts.
On drift
How do you detect agent drift in production—when they go off-script or get stuck?
This is one of the most practical problems in running agents in production, and I think people are overcomplicating it.
"Drift" really means two different failure modes. One is the agent diverging from its intended task, like going on a side quest, fixating on the wrong signal, chasing its tail in a loop. The other is the agent silently producing bad outputs that look correct but aren't.
For the first one, instrument the agent loop itself. If you're emitting telemetry from your agent—traces of each tool call, the decisions it's making, how many iterations it's taking—you can set up alerts on pretty simple heuristics. An agent that's made fifteen tool calls without converging on an answer is probably stuck. An agent whose queries are getting progressively broader instead of more specific is probably lost. These patterns are visible in trace data, and honestly, they're easier to detect than a lot of traditional application problems.
For the second, you need output validation that goes beyond "did it complete without errors." This is where SLOs matter. If an agent is making code changes that technically pass tests but cause a gradual SLO burn, you need that burn rate alert to fire and trigger a review. The agent doesn't need to know it's drifting—your observability platform does.
One more thing on this: return control to the human operator frequently. The best agent workflows I've seen don't try to run autonomously for hours. They complete a bounded task, present results, and wait for confirmation before proceeding. Keeps the blast radius small. Makes drift a nuisance instead of a catastrophe.
On agents that don’t support observability
How do you hook into agents that don't natively support observability?
This describes the situation most people are in. Not everyone is running a hand-rolled agent with perfect OpenTelemetry instrumentation. Most folks are using Cursor, Claude Code, Copilot, or some internal wrapper around an LLM API, and they don't get to control the internals.
You have options at several layers, though.
If the agent talks to your systems (makes API calls, writes to a database, deploys code), those downstream systems should already be instrumented. You might not be able to see inside the agent, but you can see what it does. The traces from your existing services will show you the impact of agent-authored changes, and that's often enough to detect problems.
Many agents and coding tools are also starting to support OpenTelemetry natively. Claude Code, for instance, now emits OpenTelemetry metrics and logs, which you can send straight to Honeycomb. It's one environment variable away from useful data about token usage, tool calls, and session behavior. As MCP becomes more widespread, more tools will adopt this pattern.
For agents you're building internally, instrument the wrapper even if you can't instrument the model. Every tool call an agent makes, every prompt it constructs, every response it receives—those are all spans. You don't need to look inside the LLM to understand the agent's behavior; you need to look at its interactions with the outside world. Same principle we've always applied to third-party dependencies in distributed systems.
Where things get harder is with fully blackbox SaaS agents that don't expose any telemetry. For those, you're limited to observing their effects—watching the PRs they open, the deployments they trigger, the downstream metrics after those deployments. Less precise, but it's something. And if a tool won't let you observe what it's doing in your production systems, that should factor into whether you use it at all.
On token costs and unified views
Can we observe token consumption and cost in Honeycomb? What about correlating API success/failure, code quality signals, and LLM response performance into a unified view?
I'm combining two related questions here because the answer to both is yes.
On tokens and cost: if your agent is instrumented with OpenTelemetry, or if you're using something like Claude Code that emits OpenTelemetry data, you can send token counts, model usage, and session metadata to Honeycomb. We have a board template for Claude Code monitoring that shows this kind of data. Token usage per session, per model, per tool call, broken down by user or team. From there, attaching cost data as a derived column is straightforward.
The unified view question is more interesting and I think the industry is still figuring it out. The direction is clear, though. The telemetry you emit from your agent—tool calls, token usage, latency, model selection—should live alongside the telemetry from the systems it interacts with. When an agent makes a code change that causes a spike in P95 latency, you should be able to trace from the agent's action to the production impact in a single investigation.
High-cardinality wide events were built for this. Instead of siloing "agent metrics" from "application metrics," you want both in the same system, correlated by trace context. The agent's commit hash, the deployment it triggered, the SLO impact it caused—all queryable together. We're building towards this with Honeycomb MCP, because it's the foundation of a real feedback loop rather than two disconnected dashboards.
On balancing learning with execution
How do you balance learning with execution when code generation is delegated to the AI agents?
This question gets at something I think is genuinely underappreciated. When an AI agent writes code for you, you ship faster, but you might not understand faster. And if the humans on the team stop building mental models of the system, that problem compounds.
The DORA report speaks to this directly: AI tends to exacerbate the preexisting conditions of an organization. If your team already has strong shared understanding and good observability practices, AI accelerates your output while your existing feedback loops keep everyone informed. If your team is already operating without deep system understanding, AI just helps you ship more code you don't understand into systems you can't debug.
My honest answer is that observability is the learning mechanism now. When an agent writes code and deploys it, the production telemetry is how the team learns what that code actually does. Instead of learning by writing, you learn by observing. The mental model comes from watching behavior in production rather than from authoring every line.
That means your instrumentation has to be rich enough to teach. Barebones APM spans that tell you "this endpoint was slow" aren't sufficient. You need custom attributes that describe business context: what user action triggered this, what experiment variant was active, what the expected behavior was. That metadata is how you and your agents build understanding of the system over time.
There's a cultural piece, too. Teams that succeed treat agent-generated code as a first draft that needs to be understood, not just shipped. They review it, but they also use the review process to learn. When something breaks in production, they investigate the same way they would if a human wrote the code, because the production behavior is what matters regardless of who authored the change.
On the traditional SDLC
How can we reimagine AI-generated code getting to prod with the lessons from traditional SDLC but optimized for the AI era?
The traditional SDLC was designed around an assumption: writing code is expensive and slow, so we should build elaborate gates to prevent bad code from reaching production. Code review, integration testing, staging environments, change advisory boards—all exist because the cost of catching a bug in production was assumed to be orders of magnitude higher than catching it earlier.
AI changes the economics. Code is cheap. Iteration is cheap. Refactors that would have taken a week now take an afternoon. But the need to verify behavior in production hasn't gone away. If anything, it's more important because the volume and velocity of changes have outpaced our ability to review them all upfront.
So, what does the optimized version look like? I'd keep the parts of the SDLC that are about defining intent. Specs, SLOs, acceptance criteria—these describe what you want the software to do. They're the contract that both humans and agents can verify against. Don't let AI speed tempt you into skipping this.
Then, shift verification from pre-production gates to production observation. Instead of trying to catch every bug in a staging environment that doesn't match production anyway, invest in observability that tells you immediately when production behavior deviates from intent. Canary deployments, feature flags, SLO-based deployment gates—these let you ship fast and catch problems where they matter: when real users are affected.
And then close the loop. The biggest thing the traditional SDLC gets wrong for the AI era is that it's linear. Code, review, test, deploy, monitor. The version I'd advocate for is circular: observe, hypothesize, change, deploy, observe. Production telemetry feeds back into the next iteration. An agent that deploys a change and then watches SLOs to confirm it worked is operating in the tight feedback loop that the traditional SDLC was designed to avoid needing, because we assumed feedback from production was too slow and too expensive. It isn't anymore.
The DORA data backs this up. The highest performing teams aren't the ones with the most gates. They're the ones with the shortest feedback loops. AI just makes that dynamic more pronounced.
Conclusion
We covered a lot more during the webinar itself. If you missed it, you can watch the recording here. And if you want to see what production feedback loops look like in practice, try Honeycomb for free and connect our MCP server to your favorite AI coding tool. It takes about five minutes to get from zero to "huh, that's interesting."