Postmortem of the recent AWS/DynamoDB outage that affected quite a lot of things.

Jan_Wildeboer · September 23, 2015, 9:21am

https://aws.amazon.com/message/5467D2/

TL;DR different usage patterns that developed due to new features caused problems in unexpected ways. Growing pains?

EdS · September 23, 2015, 12:08pm

Wide adoption of new feature causes higher possible peak load, followed by some avalanche effect failures when the peak eventually happens. Good set of action points to go forward though.

Andrew_Reid · October 4, 2015, 6:22pm

Or, slightly more granularly, a provided service gets heavier (due to adoption of a new feature), but the time-out for responses does not adjust, so some service requests start timing out. Service requestors respond to time-outs by retrying the request, leading to backed-up queues on servers, and causing previously-marginal service requests to also time out, causing everything to snowball. Fun!

EdS · October 4, 2015, 6:33pm

It seems to me that timeouts which are too long can cause trouble, as can timeouts which are too short. I wonder if there’s a better way.