Honeycomb Incident Report: Kafka Maintenance on May 4 and 7, 2026

On May 4th in the EU instance, and May 7th in the US instance, Honeycomb ran its only scheduled maintenance session with major planned downtime in the last five years. The maintenance aimed to replace the decade-old Kafka cluster at the core of event ingestion in all of Honeycomb with a newer, more reliable and scalable cluster.

By: Fred Hebert

| June 19, 2026

Incident Response

Everything We Talked About at O11yCon 2026

Blog

May 27, 2026

Everything We Talked About at O11yCon 2026

We just wrapped O11yCon 2026, and this year's conversations hit differently. Agent-based software development is here, now. It's no longer an optional choice, and everybody is struggling to understand what their agents are doing and how to make them cost less and perform better. Over the course of both days, we saw clearly that the old assumptions on how and who (or what) writes our software has been upended.

Honeycomb Incident Report: Kafka Maintenance on May 4 and 7, 2026

On May 4th in the EU instance, and May 7th in the US instance, Honeycomb ran its only scheduled maintenance session with major planned downtime in the last five years. During the maintenance window, Honeycomb kept accepting your events, but recent data was not queryable while the work was underway: about 25 minutes in the EU instance, and roughly an hour and a half in the US. For that period, dashboards and queries showed gaps in recent data, and triggers and SLOs did not behave as usual. No data was lost, and everything was backfilled once the window closed. For many of you the experience was worse than we prepared you for, and we sincerely apologize. What we did not get right was the preparation.

The maintenance aimed to replace the decade-old Kafka cluster at the core of event ingestion in all of Honeycomb with a newer, more reliable and scalable cluster. Given the rarity of such maintenance windows in the history of Honeycomb, the existing processes for proactive customer communications were not adjusted to be sufficient for an event like this.

Goal conflicts in choosing when to schedule it, misunderstandings as to the scope of the impact, and challenges around choosing the right communication channels to make our users aware of the maintenance ahead of time all played a role in us falling short of many of our customers’ expectations.

The experience our customers had during this maintenance window did not meet the bar we hold ourselves to. The advance notice we provided was not broad enough, and the description of customer-visible impact was not sufficient to prepare for the experience many of you had.

There is no good time for scheduling an impacting maintenance window, and we tried to pick the best time based on the options available. This was a one-time infrastructure migration and we do not anticipate or plan to perform scheduled maintenance of this nature again. Regardless, we are using this experience to establish clear processes for customer communication, advance notice, and scheduling decision-making.

This report inventories these challenges and identifies remediation items, which includes a description of a new process that includes engineering planning, communications writing and validations, and the communication of these changes with clear timelines.

Timeline

Technical background

Since its founding 10 years ago, Honeycomb has operated a Kafka cluster as a core component of its ingest pipeline. All events absorbed by the frontend ingest service are validated, and then inserted in the Kafka cluster on a per-topic basis. Honeycomb tightly controls its interactions with Kafka to provide the semantics it requires. Briefly, the architecture through which events go can be described as:

An ingest service validates all telemetry events submitted by customers, and is tasked to insert them to Kafka; enqueuing the message is necessary for Honeycomb to return a successful ingestion status for message batches.
All events go into a single Kafka topic. The topic is divided into multiple partitions for horizontal scalability. Each Honeycomb team contains one or more environments, which contains one or more datasets. Each environment is divided into subsets that are partitioned across multiple topic partitions. Each topic partition is replicated across three Availability Zones (AZs) for fault tolerance.
Honeycomb’s ingestion service keeps control of how individual events are allocated to topic partitions, such that it can safely tolerate Kafka nodes and leaders—and even whole partitions—becoming unavailable without dropping traffic.
Honeycomb’s query service ("Retriever") maintains a tight 1:1 coupling with Kafka. The query service has as many shards as there are Kafka partitions, and each shard has two query nodes (in different AZs).
To ensure consistency of consumed data, each Retriever node consumes all events and stores the Kafka-generated offset as a marker of its progress, with various consistency checks in place to avoid skipping or dual-consuming events. These offsets are stored in multiple redundant areas.
Other services, such as SLO processing, Service Maps, or Anomaly Detection features, consume from the same topic partitions. These however use more traditional Kafka Consumer Groups to process each event only once for analysis, and store generated data elsewhere.

This architecture was supported through self-managed Kafka nodes on what is now legacy infrastructure, on which they were the last component to run. On its own this isn’t necessarily a problem, but over the years, the third-party tiered storage approach we were using was becoming problematic:

The recovery time following a hard node failure went from roughly 8h-12h per node initially, to taking more than 48h-72h.
- An AZ-wide failure, which we test yearly, could risk taking a week to properly recover from and reduce degradation rather than a few hours, in case a second AZ failure were to happen.
Scaling the cluster became challenging under the old infrastructure.
An open source community equivalent was available.

Our Kafka infrastructure itself also had some flaws. The biggest one being that our tight coupling to Kafka offsets meant that if we ever lost quorum on a Kafka topic partition, we would be stuck for days with it being unreadable while unplanned, emergency migrations would be required. This happened on December 5, 2025 and was not a theoretical risk.

Our migration intended to change the infrastructure on which Kafka runs from EC2 to EKS, the tiered storage implementation we rely on from proprietary to open source, replacing ZooKeeper with KRaft, and moving from manual, chef-managed nodes to Strimzi-managed open-source ones.

While it is possible to run zero-downtime migrations, the amount of work and planning required to make it happen would likely leave us in a brittle state for months during a more complex preparation period. The migration itself would not be without its own risks, especially given the sensitivity of our workload to exact offset IDs, and any issue could lead to hours of downtime and data loss or duplication.

By comparison, the recovery mechanism required for our December 5 outage gave us the foundations to do a much faster, but more user-visible migration. An hour or two of delayed ingest (but with no data loss) could be traded off in order to switch writes across clusters and reset the consumer offsets across all applications.

The benefit of this latter approach is that it unblocked our ability to scale much faster—within two to three months instead of nine to 12 months for a zero-downtime approach. This meant that we would also get quick recovery of node replacement delays, an earlier ability to scale up to growing demand, and a well-exercised path for disaster recovery in case of major outages that would still be required in our toolbox regardless of the chosen migration path.

We considered it safer to take guaranteed bounded downtime rather than delay reliability and scaling work by multiple quarters while still risking potentially worse outages while the zero-downtime process would be prepared.

The migration

Because Honeycomb is designed to be high-availability and has never previously needed to do planned maintenance with extended disruption to users, no up-to-date formal communications process was in place or well-practiced. In general, our internal SLOs are far more stringent than our shared SLAs, and since no downtime is expected as part of routine operations, we rarely expect to communicate to customers about operational plans.

In the past, the most common use case for operational notices to customers had been our yearly AZ failover tests—for which we actually expect no downtime but plan for the worst. Previously, engineers had notified customers about this work via a status page notice. Since no impact was detected by customers in these cases, there was no customer feedback on how wide-reaching or effective this process may have been, despite it becoming the routine approach.

The team owning the Kafka migration was aware of this practice. They wrote a status page advance notice for the Kafka migration in the EU Honeycomb instance based on past patterns, and also communicated it internally within Honeycomb.

Honeycomb employees from the field, who more often talk to and advocate on behalf of customers, felt it was too short notice for them and customers to react in time given actual downtime. After some discussion, it was agreed to postpone both migrations by a few weeks to let everyone prepare more thoroughly.

Multiple factors played in choosing the final migration dates and times:

Engineering wanted a date where the majority of the organization was online and available for support in case something went wrong. Best case estimates for a perfect migration ran within one hour of impact or so, but the worst case could take up to 6 hours.
Engineering wanted the migrations to be done close together, to benefit as much as possible from recent experience and to avoid any decay in information acquired during all drills and practice runs.
Because ingest would be delayed for a prolonged time period, customers’ SLOs and trigger alerts would be impacted; some alerts would fire due to being “heartbeats” (monitoring for data that is entering the system) and some should keep running if they had more than a 1-2h window (such as alerts some customers run about runaway daily spend, for example), such that it was possible our customers would require active monitoring of their systems. We believed our customers’ engineering teams would prefer to handle this outside of their peak traffic time but during their employees’ weekday waking hours, and we chose times for each cluster that would meet these criteria for the greatest number of customers of that cluster by validating peak traffic assumptions against customer ingestion data.
Most of our customers run 24/7 businesses, such that no time is ideal for any planned downtime.

After discussion, May 4th was chosen for the EU cluster—a bank holiday in some countries—which let us do a daytime migration without significantly impacting most customers’ business or waking them up at night.

No such date was available for the US cluster migration, with the next holiday being on May 25th, too far in the future for comfort. An internal compromise was reached by doing the migration on May 7th, but shifting its hours to be after NYSE and Nasdaq trading hours, such that it would be after peak time for many customers, and still within business or at least daytime hours for many.

The team could not identify a time that satisfied all priorities raised by all stakeholders, but the chosen window appeared to be a “least bad” option given the overlapping constraints.

The migration went as planned from a technical level. EU delays in ingested data becoming queryable were limited to 25 minutes, US delays were limited to roughly 1h30, no abnormal data loss was detected, and the migration finished just in time to avoid the us-east-1 thermal event in AWS.

The communications

In hindsight, it is clear that we missed the mark on ample and timely customer communication about this maintenance event. We received feedback that multiple customers were surprised and upset by the impact.

Advanced notice of the events were first shared on our status page. The status page has a subscription mechanism, but it is voluntary; customers must sign up in advance to receive these notices.

Following an internal impact analysis that assessed how many customer false alarms might trigger and how many real alarms would fail to trigger depending on proposed migration times, additional communications were re-drafted through a collaboration of people from customer success, support, and engineering, and reviewed by other people in the organization.

The communications we sent out included the following description of impact: “Event ingestion will continue uninterrupted. Honeycomb will keep accepting your data throughout the maintenance window. However, you may notice a gap in visible data during the cutover period. This is expected. All data will be backfilled and fully present within approximately three hours of the maintenance window closing.” It further warned that triggers monitoring missing events would fire, that those that monitored for specific values wouldn’t, and that other features would experience delays.

This impact list is significant because it reflects what was known within Honeycomb at the time, and critically, two drastically distinct interpretations resulted from the ambiguity in the meaning of "ingestion" and the expected delay:

Honeycomb employees closer to the design of our systems and familiar with migration processes run in the previous weeks understood the impact to mean “ingest will continue, but all indexing will be paused and will then catch up” and “the delay on data becoming queryable may be as long as three hours.”
Others interpreted it to mean “there will be some lag in the data, for a period of three hours,” meaning, for example, that the data could be consistently 10 minutes late, but for many hours.

The first interpretation was more technically accurate and much more severe in terms of impact, but the latter interpretation had an influence on which communication channels were chosen and how much planning was done on the field side to prepare our customers.

In the end, we reached out to a part of our organization that owned email communications to send the messages. Because past experience showed that broad email sending could trigger spam traps, we took measures to filter the list of customer emails to those with a high likelihood of deliverability. Filters were narrowed down to “team owners who were active within the last 90 days”, which, in practice, was too few users. Not all power users of Honeycomb are team owners, and as we determined after the fact, not all relevant users were visible to our email selection mechanism.

Alternative methods like shared chat channels, because they historically tended to be used for other purposes than maintenance. Few teams had the time to be told face-to-face because the migration schedule proceeded before most teams could have met with their representatives, but even then, the internal underestimation of impact meant that few extra measures were planned in the first place.

The EU migration completed quickly enough, impacted few enough customers, and was scheduled in such a way that no complaints or questions were asked during the migration, which reinforced the beliefs that our preparation and scheduling had been adequate. The US cluster’s migration, however, provided direct feedback that we were off the mark. Once again, we are sorry for the impact that this caused to customers.

Analysis

Three themes came out of our post-incident analysis:

Challenges in scheduling
Challenges in communicating
Preparedness for disruptive maintenance

Challenges in scheduling

Given the time and criticality tradeoffs of the migrations, the migration had been planned months in advance of the maintenance event. It became critical to parts of our strategic roadmap in terms of growth. However, the lack of process for communicating about maintenance events meant that our customer-facing team members were not systematically looped into these plans for communication coordination until close to the time of the migration.

As our customer-facing team members and engineers began to coordinate, we identified that additional coordination time (both internally and with our customers) would have been valuable. However, competing scheduled events around launches, operational and risk management concerns, and scaling needs for our growth, all made it impractical to wait.

By then two major tension points were established:

The need for the migration to be done in a way that could minimize its worst possible outcomes, and optimizes for customers to be around and have their own active hands to monitor their systems through whatever fallback mechanisms they may have.
The need for the migration to be done in a way that would minimize impact for customers who require Honeycomb to do their jobs during working hours (and with 24/7 alerting and incident support).

The compromise worked well in the EU by allowing a daytime migration during a bank holiday. The EU cluster is also smaller, contains primarily EU-based customers, and was migrated within 1/3 the time the US cluster required.

No such compromise existed for the longer-standing US environment, which also has a larger user base spread more widely around the globe, including APAC and EMEA.

We do not anticipate needing to do disruptive maintenance for the foreseeable future of our technical roadmap, measured in years.

However, it is clear as well that if we are to run maintenance events with an expectation of disruptive impact to customers in the future, two things are needed:

A much longer advance notice must be given to properly communicate and provide workaround mechanisms to everyone who will be impacted.
Decision support about acceptable or ideal hours, with workarounds provided or supported.

Challenges in communicating

Plainly: we failed to reach enough of you, with enough advanced notice, and the notice we did send had imprecise impact language.

In investigating these events, we noted the following elements:

Three email systems, with distinct ownership and purpose, mixing operational, billing/transactional, and marketing origins.
Shared chat channels (70% of Enterprise customers), with a mixed set of purposes.
A status page, used for outages, supporting subscription, and engineer-controlled.
Not all enterprise customers prefer to communicate with their account team in the same way; some have a regular weekly or monthly cadence, while others meet more asynchronously.
Support engineers and their portal, used for customers contacting us; support engineers are regarded as the best group of people to interact between engineering and field teams. They, however, own none of the Honeycomb-to-customers communication channels.

Ultimately, our fragmented set of tools and owners, used for various purposes, without established procedures for this type of maintenance, led us to improvise what we thought would be adequate communications. However, they weren’t acceptably broad, varied, clear, or ahead of time enough for our customers, as we have heard from your feedback.

Preparedness for disruptive maintenance

It was obvious to people within Honeycomb early on that we didn’t have a good process for this, because we had never needed to do such disruptive maintenance at any point in the past decade of Honeycomb, but also couldn’t defer maintenance without risk. The overarching theme of this investigation is coping with figuring out the requirements as we go, under time pressure.

The challenges identified by this investigation likely need clarification and follow-up for the future:

Maintaining awareness of disruptive events or incoming needs for stability.
Clarifying requirements for advance notice of maintenance within the organization.
The mechanisms to be used for advance notice of maintenance must be defined ahead of time.
There is a proliferation and fragmentation of communication tools and channels, with scattered ownership, which creates potential for inconsistencies over time based on who is currently involved and what they are familiar with.
Advance notice redaction and publication requires cross-disciplinary support from all parts of the organization, and requires clear leads for the overall outcome (even if transient, such as during incident command).
The goal conflicts in deciding when to schedule such maintenance are still not resolved fully, and preemptive guidelines would be of use.
It should be possible to silence alarms during maintenance windows to prevent spurious, un-actionable alerts from firing for missing data.

Even if we do not anticipate needing to make use of such a plan in the future, there are benefits that would come from establishing protocols in the first place. Gaps in communication across departments surfaced during this incident and will require further efforts to address.

Protocols for the future

In line with the analysis and events surfaced in this report, we established a protocol and matching runbook to properly support disruptive maintenance of this kind, even if we have none planned for the future. While the details are out of scope for this report, some key elements will be provided here.

Longer timelines for internal coordination. Projects will require an increased notice time between engineering and other internal teams. This will give us more capacity to properly analyze impact, design workarounds, field effective communications and preparation, and give more time for our customers to do the same.
Clearer internal accountability and support. For both communications and scheduling, a clear ownership structure has been established to remove ambiguity and reduce the risk of surprises. We have also planned for earlier cross-team involvement for various departments to ensure a more accurate understanding of impact.
More time for customers to raise concerns. The guidelines we have established make room for earlier direct communications and early warnings so feedback can be gathered and incorporated in our workarounds and scheduling.
Workarounds to be provided. We expect to provide useful workarounds for both ourselves and our users to minimize or properly manage the impact of maintenance.
Wider, clearer communications. We have identified broader communication channels with specific owners, stakeholders, and channels to be used for maintenance notices. Guidance around messaging and how we validate its effectiveness are now in place, and we expect clear calls to action in the future.

Although most plans are imperfect upon meeting reality, we believe that the structure outlined previously should help by reducing the amount of uncertainty and improvisation required, and by increasing the buffer space available to cope with any surprise encountered.

While we hope not to need this plan, much like disaster recovery exercises and other contingencies, having them ready should make Honeycomb a more reliable partner.