ICYMI: Is This Code Worth Running? Here’s How to Know

Over the last three months, we’ve been exploring what about software development and observability changes with AI, and what doesn’t. Our conclusion: these five principles will still remain true, even when 90% of the code is AI-driven.

By: Rox Williams

| April 14, 2026

AI & LLMs

How to Resolve the Productivity Paradox in AI-Assisted Coding

Webinars

March 4, 2026

How to Resolve the Productivity Paradox in AI-Assisted Coding

Join Ben Good (Google) and Austin Parker (Honeycomb) as they unpack the productivity paradox in AI-assisted Coding.

Watch Now

ICYMI: Is This Code Worth Running? Here’s How to Know

The agentic AI space is moving fast. Models are improving, context windows are expanding, and the ways people build and operate agents are changing so fast that any thoughts we share could feel dated by the time you read this. We’re building the first generation of AI native sociotechnical systems, and it feels overwhelming. But at the end of the day, AI is still software and the code it writes is still software.

Just like humans, AI agents are typically constrained by incomplete context, fragmented tooling, and limits on what they can hold in working memory. Observability has always been about managing the tension between context and precision, giving engineers access to the rich data and fast feedback loops they need to deeply understand their systems. Whether held by humans or AI agents, the outcome will continue to depend on the quality, completeness, and accessibility of the context available to them.

We may have built up abstractions and practices that no longer serve us, but there are some core principles that will remain true, even when the vast majority of code is AI-driven.

Read our O’Reilly book, Observability Engineering

Get your free copy and learn the foundations of observability,

right from the experts.

Download Now

Developers have always needed fast feedback loops to learn and innovate

When you're generating code 100x faster, every deploy without observation becomes an open loop. Open loops compound into technical debt at agentic speed. If developers don't have fast, flexible, high-cardinality feedback, they've lost the best opportunity to learn. That learning system is amplified and accelerated by AI, and the absence of learning is also amplified by AI, adding more unchecked chaos to a compounding accumulation of mystery deploys. We’ve all known that “it works on my machine” was never good enough. Why would this be any different for AI?

Observability grounds the feedback between production and development in data rather than assumptions. As AI accelerates code generation, the bottleneck has moved from building to learning and validating. Developers need feedback loops that can keep pace, or issues pile up and confidence erodes. Verification in production remains the only way to confirm whether AI- or human-generated code is achieving the desired intent.

The hard truth is that few developers follow their changes to production to learn from how their code actually behaves. The entire DevOps movement was essentially a twenty-year effort to achieve a single feedback loop connecting developers with production.

Setting up and tracking feedback channels that are useful to developers has historically been difficult with traditional three pillars observability tooling. Frustration has pre-empted learning. But this has to change with AI. If we’re not developing any understanding that comes from writing or reviewing the code, we have to rely on observability to understand the code. And to be useful for AI, observability has to evolve to better support developers and AI agents. The good news is that AI is also useful in improving developer feedback loops by meeting developers where they already work. At Honeycomb, we're bringing a conversational interface to bring production telemetry directly into the conversation without asking developers to jump from dashboard to dashboard. By removing the context switching tax, feedback loops feel more like fuel than friction.

Context has always been what makes data valuable

Without context, data is useless. Context has always been helpful to teams for understanding why software is or isn’t working. The problem is that humans hold a tremendous amount of context in their heads. This means that context is not available for AI to use in its reasoning.

In 2026, AI can process all kinds of data at orders of magnitude the speed of humans. But without structure and context to the data, it’s likely to hallucinate. If a lossy or inaccurate feedback loop slows human teams and reinforces the wrong lessons, this is only amplified with AI.

Agents cannot rely on human intuition to bridge the gaps the way humans have historically managed. By capturing more context, it becomes available to AI (as well as other human team members). This requires thinking carefully about how we capture and store contextualizing data about our software.

As we prioritize capturing more context, the data doesn't become linearly more powerful as you widen a dataset; it becomes combinatorially more powerful. Adding a 30th field to a structured log event doesn't give you one more thing to query; it gives you over a billion possible combinations. Humans may scratch the surface of this combinatorial power in what they choose to analyze; AI can explore all of it in a fraction of the time and find what’s useful.

From an observability perspective, the more relationships preserved between data points, the more powerful the dataset becomes. This favors support for high dimensionality to map the overall system and provide AI with a rich knowledge graph. In contrast, the three pillars model destroys the relational seams that make data valuable. AI-SRE agents have already started going back to raw telemetry to find the richer signal that siloed pillars fail to provide.

This is why agents love Honeycomb's arbitrarily-wide structure event store. AI agents skip the dashboards and query your data iteratively, consuming all the attributes to answer specific questions. With Honeycomb, AI gets the context and structure it needs to make novel discoveries at 100x speed. And novel discoveries are useful when you face novel problems.

Complex systems fail in unpredictable ways

Pre-built dashboards are designed for humans looking for known failure modes. They underutilize the power of AI that can analyze data faster and are not very useful in the face of genuinely novel behavior. This is especially true in complex systems, which fail in ways you can't predict.

AI and agentic workflows introduce a new dimension of uncertainty that pre-AI tooling was never designed to handle. Unknown-unknowns are no longer the exception. The same input can produce different outputs, thus requiring fast and contextually rich feedback loops that can surface meaningful signal amid the noise. The traditional approach assumes you know what to look for—but most people don't know what question to ask, especially as systems grow more complex.

Observability has always been the mechanism for building confidence to deploy in the face of uncertainty, not by eliminating unknowns, but by giving teams the precision to find and understand them quickly when they surface. AI amplifies both the complexity and the rate of change. Observability (like Honeycomb!) that can handle unknown-unknowns and novel failures isn't a nice-to-have as AI becomes load-bearing infrastructure. It's the foundation that makes confident, fast deployment possible at all.

To debug hidden, unforeseen issues, both humans and AI need an explorable dataset with high cardinality, high dimensionality, speed, flexibility, and rich context. By consolidating as much context as possible into a single, explorable dataset, Honeycomb gives AI agents direct access to that rich telemetry, enabling them to investigate production behavior at agentic speed with the same precision a senior engineer would apply.

Pre-production testing was always insufficient

Testing increases confidence in deploying code, but pre-production testing was always insufficient. Verifying in production has always been the source of truth on code viability and business impact. Formal methods and test suites are flight simulators; production is flying the actual plane.

AI amplifies this reality in at least two ways. First, nondeterministic output is not repeatable in testing, let alone production. Second, dependencies on external foundational models create a new opportunity for externalized changes. LLMs introduce new interaction vectors to understand what’s happening to a system in production.

The good news is that OpenTelemetry's generative AI semantic conventions already capture the key attributes engineers need to understand LLM interactions. This includes model name, token usage, input/output messages, tool calls, finish reasons, and reasoning traces. This gives teams a standardized, vendor-neutral foundation for observing AI behavior in production without starting from scratch.

Cost has always been an attribute of a system

Code running in production carries a cost. Sometimes the cost impact is not directly felt by development teams, but it’s still an attribute of the system. What used to be more capex intensive has been shifting to usage-oriented costs, from cloud services, ephemeral infrastructure, and API metering.

With AI, token consumption cost becomes another such attribute. Every inference costs money, the costs are direct and variable per request, and you can't optimize what you can't see. However, optimizing token spend before business value is realized is premature optimization. AI reinforces the need to integrate cost information, including token cost, to completely understand the impact of software. Without that visibility, runaway token costs will undermine AI's ability to deliver net value.

The Intercom team learned this firsthand while building Fin, their AI customer service agent. After implementing an "eager request" optimization that significantly improved Fin's response latency, they discovered a costly side effect. When conversations were routed away from Fin entirely, the LLM requests fired speculatively were wasted. Engineering only found out after Finance raised the issue with them.

The signal existed in the data, but it wasn't surfaced in real time where engineers could act on it. By adding cost per interaction into their traces, cost became a first-class, real-time signal rather than a lagging financial report. They connected every optimization back to a single customer-centric metric and linked both performance and cost signals to the same telemetry. The result was a 60% reduction in median time to first token alongside real-time cost optimization, with SLOs defending both.

Answering the question with AI: is this code worth running?

In order to answer that question, we need to understand how it’s working and if it’s delivering more value than it costs. With the right guardrails, AI can help build the code and answer those questions.

Agents are fast, but they need a system of checks and balances. That system needs to be fast, context-rich, and tethered to production.

Fast feedback upholds relevancy as the half-life of feedback utility shortens. Wide context provides relevancy across more related attributes for AI to reason over. Novel failures are normal and demand explorability. Production is the ultimate test. Costs can’t be ignored. And for all of these things, you need observability.