The perils of retry in a scaled-out system - the query of death.

The perils of retry in a scaled-out system - the query of death.
In the comments: "A long time ago, Google used to have problems with the search frontends and a “query of death”, that happened to trigger a crash bug. “Keep retrying” seemed like a good strategy in the face of occasional failures, until they started bringing down all the frontends in a datacenter in rapid succession…
" - @Joel_Webber

Originally shared by Craig Ulmer

Resilience is difficult in distributed systems. The other day we found a creative way to crash our Accumulo cluster. This cluster uses about a dozen beefy nodes (100+GB, many cores, many disks) and houses a large number of records. Accumulo has been great in that it usually does the right thing- as our data grows, Accumulo splits tables into smaller tablets that are migrated to other nodes in the cluster. Underneath the hood, everything is stored in HDFS, so if a node crashes in the cluster Accumulo can just spin the dead tablets up on other nodes in the cluster. These resilience features are automatic and happen without the user ever knowing.

Things have been running smoothly, but a few weeks ago we had a user that wanted to export data to a new service so they could demo their work to others. While preparing the service, the user ran a client that (right or wrong) exported all of the data in Accumulo to their service. The user noticed the export was taking a long time. Then the user noticed Accumulo was sluggish, and that services were dying one by one, until everything was dead. What happened?

Two things. First, the user was requesting every tablet server fetch all of its data off disk and ship it back to the client. That’s a lot of data for Accumulo to assemble in memory and send back. When I configured the Accumulo services, I’d used the large memory config files as a template and then bumped it up to match the 64GB of ram we had at the time. At some point we bumped the memory of the system up, and finally accumulated enough data that we could use all that memory. However, we never stress tested at those new levels. When faced with this request one of the larger tablet servers used all of its heap and then crashed with out of memory (OOM) errors.

Second, and more interestingly, Accumulo’s reliability mechanisms saw the failure and respawned the tablet on another node. Since the client’s request was still waiting to be resolved, the first thing the new tablet server would do when it came online was try to fill the pending request. It would eventually run out of memory and OOM crash just like it’s brother. It took about a minute and a half for this process to happen, given the sizes involved. We watched the failure march through all the nodes in the cluster, until every single node’s tablet service had crashed. Bang. Bang. Bang… We call it passing the gun.

The obvious fixes here were to boost the Xmx settings for Accumulo and put some better limits on our batch scanners. To be fair, Accumulo did what we told it to do, which was unfortunate because we told it to do something dumb. It’s just interesting to me that the reliability mechanisms turned a failure at a single node into something that brought down the whole distributed system. As I said, Resilience is difficult. #accumulo

Retrying once can take you from 99% availability to 99.99% availability. Retrying more than once: see above.