Postmortems

Leap seconds: the most recent real one being a year ago

Leap seconds: the most recent real one being a year ago. Here’s a collection of incidents from that time:
http://six.pairlist.net/pipermail/leapsecs/2012-July/004260.html

See particularly
http://blog.cloudbees.com/2012/07/cloudbees-postmortem-on-two-recent.html
and
http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/
(“MySQL and Java servers had a huge spike in CPU”)

Previously we had a post about an illusory leap second causing a bug:
https://plus.google.com/u/0/108637477974399451867/posts/YQyBWToLvp5
which contains a link to a Google blog from 2011 explaining that in their network of datacenters they apply a smoothing function to their time service, avoiding leap seconds as jumps but not falling out of sync with world time:


Time, technology and leaping seconds

This is all so utterly stupid. If synchronized time on second resolution or better is required, then all interchange should happen with a fixed time reference counting from there like the unix epoch (seconds since 1.1.1970 00:00). Any kind of time zone or leap second stuff then becomes just a matter of display, right?

I think that’s UTC0, which GPS uses, but the thing is you drift by a second every leap second and go out of step with civil time: merely having midnight at a different moment could cause trouble. There’s a discussion ongoing as to whether to cancel leap seconds: after 40 years of experience there are those who think we should cancel them, and have a leap minute rather less often.

Civil time is also using the physical definition of a second. So a unix epoch will have a “switch” between seconds at the very same instant as civil time. Count the seconds from the start of the epoch to civil time midnight and store that somewhere so you can display it right. Or if need be understand what other computers or users tell in civil time.

However, it should NEVER ever have an effect on in kernel stuff like mutexes or such. I’m really wondering what’s going on here

No, civil time is periodically resynced to the Earth’s gradually and erratically slowing rotation. The seconds are the same length, yes, but the lengths of the days differ. Sometimes.

Yes just update when in the epoch a civil point in time is, when it changes. So 1st of august next year will be in epoch x. Then introduce a leap second, ok then next year’s 1st of august starts at x+1. So what? No need for the kernel to worry about

Oh, I see, run the kernel on ut0 and let userspace convert to utc.

Might be worth reading this essay:

There are basically three kinds of “time” in computer systems:

  1. time since an “epoch” as measured by the local system
  2. time since an “epoch” as agreed upon by the “network”
  3. “calendar” time as perceived by humans

The former is immune to issues around leap seconds as @Andre_Fachat describes. Unfortunately, for a variety of reasons that actually make a fair bit of sense most software systems work in terms of #2 & #3. The two main reasons are a) #1 is always calibrated based on some notion of #2 & #3 based on when the system started tracking time, so it is distorted not only by inaccuracy of the local clock but by when and what source was used to perform that initial calibration and b) it can produce surprising behaviour for humans when asked to calendar based instructions like “do X exactly one year from now”.

In the end, it is a mistake for a computer system to assume that its own sense of time is absolute and moves forward both monotonically and with perfect precision. You really should allow for the fact that time can “jitter” forward and backwards.

Interesting thoughts. Still I’m not really convinced that this would be a kernel issue and not “just” a change in value of the difference between kernel timer and UTC (or the local version of UTC when the timer started. Thanks for commenting!

@Andre_Fachat I think you could credibly make the case that time type #2 doesn’t need to be in kernel space. In practice kernels usually keep both #1 & #2.

Even if you have #2 outside of kernel space you have some problems:

  1. The proper unit for “time since some epoch” is going to be variable with your platform. So you’ll have fun translating back and forth with your API calls.

  2. Some parts of the POSIX API have notions of time that are very much framed by human perceptions of calendaring or network perceptions of time.

  3. Even if you dealt with #1 and expunged the relevant API’s in #2, you’ve not solved the problem, but just moved the problem to user space. You still have a single place that needs to provide notions of #2 to all processes (and possibly #3 as well), and both sides need to deal with the reality that #2 & #3 don’t follow the simple rules of #3.

It’s kind of the old trade off: sure you can maybe make the kernel simpler, but you haven’t solved the problem.