debugging Zookeeper at Pagerduty  "After more than a month of tireless research and testing,

debugging Zookeeper at Pagerduty
"After more than a month of tireless research and testing, we have finally got to the bottom of our ZooKeeper mystery. Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, which leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack… Even after all this, we are still unsure where precisely the bug lies. "

And instead of dumping Xen for KVM, the author proposes workarounds. Hrmpf.

A bug they missed out of the article IMO is missing sanity checking on deserialized values. In this case specifically scheme_len.

They also didn’t bother (afaict) to fix ZooKeeper to not go into zombie mode when its main thread catches an exception.

Or, you know, configure the jvm to heap dump on OOM.

Also the fact they’re still running 2.6 in production which means they’re running some ancient OS release.