"A Cascading Failure of Distributed Systems" , , and some more.

“A Cascading Failure of Distributed Systems” #docker, #kafka, #kubernetes and some more. https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df

How did sudden logging-sidecar activity cause load on the docker daemons? Docker/containerd shouldn’t be on the path for sidecar->Kafka communication.

@Adrian_Colley FTA: “it was cumulatively enough to put a high load on the docker daemons for the nodes in the Kubernetes cluster.” I interpret that as “high CPU load”. If that is correct, it means that the CPU load cumulatively started to hit the cgroups upper limit of CPU usage, which caused the docker daemon to inform kubernetes that these nodes were “unhealthy”, which in turn made Kubernetes frantically moving stuff around until this cascade collapsed under its own weight.

I could understand it if kubelet was seeing timeouts trying to talk to the docker daemon, but then, why would the docker daemon be slow just because the production containers were maxing out their cpu limits?

On the other hand, if the node’s load average was high, but the kubelet-containerd interaction was still working, then that isn’t a good reason to declare the node unhealthy.

Either way, there’s a design fault which contributed to the outage but which didn’t get a mention in the PM.