Online fashion store downed by OOMkiller

EdS · June 30, 2019, 4:16pm

A kubernetes cluster was served by a DNS infrastructure which was killed by the kernel’s out-of-memory handling. With the DNS down, monitoring also failed. An hour’s downtime. “From this incident we learned that our Kubernetes DNS infrastructure is far from resilient enough.” It’s a €5bn annual revenue company.

github.com

zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md

# Total DNS outage in Kubernetes cluster

## Summary

On Monday 7 January 2019, all web product and outfit pages of the [Zalando
Fashion Store][zalando_de] were returning a high amount of errors to customers
for over 1 hour.  The errors were coming from the main data aggregation layer
service running in one of our Kubernetes clusters and was ultimately caused by
an outage of the Kubernetes cluster DNS infrastructure.

## Timeline

The incident started when calls to one of the many downstream services from the
aggregation layer, began to time-out. The aggregation layer returned 404s for
the timed out calls which resulted in clients retrying, creating a spike in
requests to the aggregation layer.

![Ingress Traffic](images/jan-2019-dns-outage/ingress_traffic.png)

The spike in requests from the aggregation layer then resulted in an equivalent

This file has been truncated. show original

The CoreDNS pods did not recover from the initial OOM killing and kept running out of memory and getting killed. This let to a total DNS outage for the entire cluster.

Because of the DNS outage, the aggregation layer service opened circuit breakers to downstream services as it was unable to resolve any of the hostnames and had no DNS caching.

As a side effect of the DNS outage our internal monitoring for the cluster was also completely down as it needs to talk to external services to trigger alerts and push metrics.

via Mastodon here and here.