It Can Only Goodhart Happen

Is the era of tokenmaxxing over before it even began? The rise of token leaderboards to the death of token leaderboards at companies like Amazon seem to have taken place in less than three months!

By: Austin Parker

| June 8, 2026

AI & LLMs

Honeycomb Canvas: The Multiplayer Workspace for the Agentic Era

Blog

May 20, 2026

Honeycomb Canvas: The Multiplayer Workspace for the Agentic Era

Auto-investigations that start the moment a trigger fires. Custom skills that encode your team's runbooks. A live workspace where humans and agents investigate production together. Now available to all Honeycomb customers.

When a measure becomes a target, it ceases to be a good measure.

Charles Goodhart, 1975

You’ve probably read this quote in relation to any number of things over the years. People complaining about arbitrary metrics like PRs merged, lines of code produced, and now, token usage. But is the era of tokenmaxxing over before it even began? The rise of token leaderboards to the death of token leaderboards at companies like Amazon seem to have taken place in less than three months!

To steal a line from Cat Hicks on Bluesky, you probably mean Campbell’s Law when you’re talking about Goodhart’s Law.

“The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

The real dirty secret here is that Campbell is usually the more relevant pull than Goodhart for pretty much any conversation around ‘developer productivity’ because it is fiendishly difficult to measure. It’s practically impossible to disentangle the impacts of process or policy on overall outcomes other than in some extremely thin ways.

The most rigorous recent look at this found that the usual recommended best practices (code more, ship smaller PRs, collaborate) nudge cycle time in the directions you'd expect. The catch is that the effects are small and almost entirely drowned out by variance, most of it within a single person from one month to the next. Any one measurement of an individual tells you basically nothing about their typical performance. Their conclusion is that improving delivery is a systems-level problem, not an individual-scorekeeping one; Which is to say, the thing everyone wants to measure resolves into noise the moment you point it at a person.

Yet here we are, another day older and not a whit wiser, issuing new social credit scores to incentivize certain behaviors, understanding nothing. I think the discussion about token ROI is actually a pretty foolish one, personally. It’s a stalking horse for a deeper question that organizations are having difficulty even describing, let alone answering. That question? What would you do with intelligence as a service?

Monkeys, Shakespeare, that sort of thing

You’re perhaps familiar with the infinite monkey theorem? The notion is that given infinite time, a monkey typing infinite letters into a keyboard would randomly produce all written works, including the work of Shakespeare. Despite real-world trials failing to demonstrate this, it’s an interesting thought experiment and a direct proof exists (with some hand waving), but over an infinite amount of time, the probability of infinite monkeys replicating Hamlet does turn out to be 1.

The argument for tokenmaxxing reduces, somewhat, to a much more realistic infinite monkey theorem. It asks, “For any problem in your domain that can be expressed through text generation, how much value can be derived from generating infinite text?” This should be a somewhat appealing proposition. If you hold that LLMs are capable of producing, on average, median solutions to nearly any problem put before them that can be solved through the application of verifiable text (e.g., computer programs), then the law of large numbers would indicate that some of those solutions will randomly be out-of-distribution and better than the median—and, irrespective of that, that the median is actually probably better than the status quo.

This is not an entirely foolish proposition, especially if you engage in some game theory. If everyone else is tokenmaxxing, then you should be too, because you’re leaving potential upside on the table. Tokenmaxxing allows for one contributor to do the ‘work’ of five, ten, a hundred, a thousand more. If the only constraint is your budget and the opportunity cost, then it makes perfect sense to engage in wild amounts of arbitrage. You never know which of the ten thousand Rust rewrites of a core service will be the one that solves some thorny technical problem that’s been plaguing you forever!

Leverage AI-powered observability with Honeycomb Intelligence

Learn more about Honeycomb MCP, Canvas, and Anomaly Detection.

Learn More

Arrayed against the tokenmaxxers are a loose coalition of people who look at that position and wonder how you’re able to breathe with your head so far in the clouds it’s outside of the atmosphere.

The ‘anti’ position, as far as there is a coherent one, is that even if all of that is true, the downstream costs and impacts net out to be an overall drag on the business, team, and organization. Yes, you can rewrite every application in Rust. This is unhelpful if nobody understands a lick of Rust, though. You can burn through your entire backlog twice over and add in every feature you’ve ever wanted, but your codebase is now an unmanageable and unmaintainable mess (at least without the application of even more tokens to correct it) and you’re using twelve different serif fonts in the UI. Beyond that, while you might have created infinite output, verification costs scale as well. You’re paying twice (or many times more over) for potential improvements that create a drag on everything else that you do. Intelligence on tap is great, but judgement is still a scarce resource.

A token does not equal a token

My position is that both of these camps are maximalist in ways that don't really hold up to scrutiny. A token is not a fungible unit of intelligence. Agentic engineering uses a tremendous amount of tokens across different models and different modalities in ways that are difficult to map back to a specific deliverable. Models will misunderstand the problem or solution space and rabbit hole down unworkable paths, use tools incorrectly, and encounter any number of transient failure states, just like humans.

Decision-makers and business leaders are looking for the calculated balm of a clean and tidy cloud bill, but it simply does not exist in AI. This isn’t ‘insert dollars, receive computation.’ Even for AI application and agent delivery, there’s no level of guardrail or harness that will give you 100% error-free engagement. Someone will always do something to foil your best-laid plans, guardrails, and evaluations.

Thus, we need a better mental model of AI spend. Focusing on tokens is reductive and probably harmful, especially as we move towards a world where inference can be run locally. It seems likely that we’ll always have some models available on-demand due to the enormous requirements of memory and storage for weights, but it seems rather likely to me that we’ll move towards smaller models for more well-constrained local tasks (such as search and some forms of tool usage) with routers that are able to dispatch requests to a variety of endpoints—thus making the token math even more confusing.

There’s actually a pretty fascinating parallel here with story points, indeed, with most social attempts at making work legible. My story point looks different than your story point, and it's influenced by dozens of factors. Tenure, moxie, customer needs, business needs. We assume the truth will out when we look at these measures across an entire business unit, but when you get down to brass tacks there’s not much you can actually do to normalize the measurement of productivity across different people without clamping those measurements to a median. This should echo the tokenmaxxing dilemma: if you’re already in a world where you expect median output, then being able to have a single dial that you can crank for infinite median output seems like a no-brainer!

Illegible activities

What these measures miss is that there exist factors that these blunt measurements fail to capture, or model. Prompting and driving an LLM is a skill. Having good judgement and being able to build internal alignment around increasing the rate of change is a talent. Being able to hold a complex system in your head and understand implicit architectural constraints is an art. These are all illegible activities! When you start trying to incentivize behavior through various mechanisms like token leaderboards or mandatory AI usage, Campbell’s Law invariably creeps in and your measures become corrupted.

So, what matters? The same things that mattered before, to be honest. High performing teams need psychological safety, room for experimentation, and robust CI/CD. You need to test in production, and have excellent observability. We might be in the middle of a revolution, but not everything is revolutionized at once. This is an excellent moment to take stock and ask what still matters, and how it matters; don’t assume that past knowledge is now irrelevant.

The irony of the AI era is that the technical capability to achieve these things is within our reach. AI agents are effortlessly capable of adding instrumentation, writing tests, and improving your CI/CD pipelines. You will get an outsized ROI by turning your token budgets inward and improving developer experience. Structure those logs! Let a thousand Claudes bloom across your codebase, normalizing your telemetry and doing that OpenTelemetry migration you’ve been putting off! Send the data to Honeycomb and use Canvas to analyze agent behavior, if you’d like to actually see what they’re all doing (see this docs page for more). Demystify the token generation pipeline, actually dig into it, understand it alongside the agents.

The important thing isn’t that you’re counting and measuring the output, it’s that you’re building a system of understanding. You’re learning alongside the agents by observing their outputs, in the same way that you’ve come to understand productive teams by observing their behavior and characteristics. It doesn’t make the unmeasurable suddenly measurable, but it does allow you to build a muscle of knowledge that you can then apply to your processes.

But my biggest piece of advice is stop mandating this stuff. The reality is that AI doesn’t make your team, company, or organization better—it only accentuates the differences that already exist. If you’re finding that some teams become far more productive when you give them AI, good! That’s a good data point. Rather than exclaiming, “Wow, look at what AI can do!” it’s probably worthwhile to talk to them and understand what stopped them before rather than focusing on what AI has enabled them to do. You’ll probably learn something far more useful than you will through a hundred token leaderboards or AI mandates.