A kubernetes cluster was served by a DNS infrastructure which was killed by the kernel’s out-of-memory handling. With the DNS down, monitoring also failed. An hour’s downtime. “From this incident we learned that our Kubernetes DNS infrastructure is far from resilient enough.” It’s a €5bn annual revenue company.
The CoreDNS pods did not recover from the initial OOM killing and kept running out of memory and getting killed. This let to a total DNS outage for the entire cluster.
Because of the DNS outage, the aggregation layer service opened circuit breakers to downstream services as it was unable to resolve any of the hostnames and had no DNS caching.
As a side effect of the DNS outage our internal monitoring for the cluster was also completely down as it needs to talk to external services to trigger alerts and push metrics.