Postmortems

Sharing some (not) fun we had the other day at Klout with the leap

Sharing some (not) fun we had the other day at Klout with the leap second
http://engineering.klout.com/2015/07/leap-second-induced-downtime-post-mortem

Thanks for posting this. I see (unrelated) reports of high CPU usage in various applications - possibly the afflicted parties are still running old kernels, as this looks similar to the 2012 issue. See

If you knew that this is coming, and knew that you had problems in the past, why didn’t you take a preventive approach this time? Smearing the change seemed to have worked well for Google on the past.

Two observations on smearing: several financial exchanges implemented their own chosen methods of smearing, with the result that they were mutually adrift by variable amounts in the hours before and after; google has publicly visible timeservers which provide google time, which is indeed smeared across leaps, but it’s not clear whether and how these servers should be used by the public.

I have a very small team, our capacity to plan and take preventative measures is correspondingly small. Given the inconsistent failure modes in the prior leap second events, “unknown unknowns” and competing priorities it was difficult to give this advance attention it was owed. However, I plan to take the damage done here into the conversation in advance of the next leap second, smearing will be one of the topics. Hopefully this be our last “damn, that hurt” post mortem that’s leap second related.

@EdS , well, you’d roll your own.

@Ian_Kallen I’m working in IT operations, and I hate when people end up feeding The Machine their own blood. The aim of postmortems is to do better next time. If only to avoid people burning out by repeated stress.

@EdS ​ See http://developerblog.redhat.com/2015/06/01/five-different-ways-handle-leap-seconds-ntp/ for how to do smearing without relying on an external entity.

Oh wow @Michael_Stapelberg reading the linked ntpd discussion at http://bugs.ntp.org/show_bug.cgi?id=2745 gives such a sinking feeling. See comment 23. It kind of looks like they had 3 months of debate then made a change the day before the leap second.
I like Google’s slew, but they haven’t fully specified what they do. I’d want as many systems as possible to use the same tactics - not some homegrown approximation of each other. There could be a http://slew.pool.ntp.org for example.
With leap seconds in play, and a time representation that excludes them, either time must be non-monotonic, or it must be up to half a second adrift and run at a false rate. A rock and a hard place.

I think we’d be better off with fewer events like this. For example, aggregate 10 leap-seconds into one 10-second leap.

There doesn’t seem to be a good reason to keep the nominal clocks within one second of the “ideal” and each clock adjustment event like this costs us a lot collectively.

Part of me wants to agree. But the less often we encounter a leap, the less experience we have and the more likely to neglect to handle it - see for example the year 2000 leap year, which was a one in 400-year event. Also a bigger jump might well affect more systems and more severely.

My rationale is that we cannot dedicate enough resources, be vigilant enough if this happens frequently, so it’s all wishy-washy, “hopefully I will not be impacted” (even if I’m aware leap second is upcoming). By making it less frequent we can handle it more robustly – currently it’s not frequent enough for all the kinks to be ironed out for it to be a non-event and not rare enough, so one cannot afford to take some time to prepare to each event either. I’m not insisting this is right, but it seems like a reasonable solution.

I could maybe be happy with a solution which says we wait until the difference is 15 mins. After all, the only thing we’re preserving with leap seconds is some approximation to local noon, and that’s almost always adrift because timezones are at least an hour wide.
The leaps would then be so rare that it’s akin to saying we just don’t need them.
In fact why not: by the time local noon has drifted by an hour, people will be used to the idea that it’s drifted.