Two week outage for 400 customers of Atlassian

An update on the 14th day says it’s all fixed

(Atlassian has $2 billion annual revenues, providing on-site and cloud-based software for software engineering. The 400 afffected customers will be businesses who will have had no software productivity for the duration.)

Causes given as:

  • Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.
  • Faulty script. Second, the script we used provided both the “mark for deletion” capability used in normal day-to-day operations (where recoverability is desirable), and the “permanently delete” capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.

via this discussion at HN which links to
Inside the longest Atlassian outage