Postmortem of the recent AWS/DynamoDB outage that affected quite a lot of things.

Postmortem of the recent AWS/DynamoDB outage that affected quite a lot of things.

https://aws.amazon.com/message/5467D2/

TL;DR different usage patterns that developed due to new features caused problems in unexpected ways. Growing pains?

Wide adoption of new feature causes higher possible peak load, followed by some avalanche effect failures when the peak eventually happens. Good set of action points to go forward though.

Or, slightly more granularly, a provided service gets heavier (due to adoption of a new feature), but the time-out for responses does not adjust, so some service requests start timing out. Service requestors respond to time-outs by retrying the request, leading to backed-up queues on servers, and causing previously-marginal service requests to also time out, causing everything to snowball. Fun!

It seems to me that timeouts which are too long can cause trouble, as can timeouts which are too short. I wonder if there’s a better way.