A very problematic migration for an observability company (honeycomb) with many lessons learned:
In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company.
We turned on tiered storage, progressively uploading all of our longer retention data to S3, and only keeping a few hours locally. We then booted up 6 new hosts of type m6g.xlarge, manually moved the topics over, up until we could get rid of the 36 i3en.xlarge instances.
Things were going fine until we lost an instance, and hell broke loose. One of our engineers found out that we had mistakenly migrated to the m6g.large instance type—the same one we used in internal environments—instead of the m6g.xlarge type we had planned for.
This was subtle enough that reviews and careful rollouts over days just made it invisible. Since nothing went wrong, we put more and more load onto the new hosts. It’s only when a failure happened that they could no longer cope with demand.
We started moving partitions back to the older bigger machines, and things got even worse. The new hosts were replicating slower than expected…