This site is now 100% read-only, and retired.

XML logo

New Year Crash
Posted by endecotp on Fri 2 Jan 2009 at 12:21
Tags: none.
My co-located server went down on the stroke of midnight on 31st December.

It came back up after power-cycling with nothing suspicious in its logs.

So, which "Y2K9" bug did I get bitten by?

 

Comments on this Entry

Leap Second?
Posted by Anonymous (72.11.xx.xx) on Fri 2 Jan 2009 at 18:58
Maybe it was the recent leap second that was added this year.

For more info: http://lonesysadmin.net/2008/12/31/leap-second/comment-page-1/

[ Parent ]

Re: Leap Second?
Posted by endecotp (86.7.xx.xx) on Fri 2 Jan 2009 at 20:01
[ View Weblogs ]
Yes, that's quite likely. I probably have code of my own that doesn't cope correctly when seconds=60. But that should only cause the affected process to crash. In this case the server went down. Have there been any kernel bugs in this area that I missed?

[ Parent ]

Re: New Year Crash
Posted by Anonymous (217.216.xx.xx) on Mon 5 Jan 2009 at 17:58
Is your server running on a Zune-wintendo? :))))))

[ Parent ]

Re: New Year Crash
Posted by dkg (216.254.xx.xx) on Tue 6 Jan 2009 at 17:57
[ View Weblogs ]
I saw the same thing, on a machine that is running a modified older kernel. fortunately, i had a timestamped, logged serial console attached to that machine, so i could log the bug:
2009-01-01_00:00:00.39399 BUG: at arch/i386/kernel/smp.c:546 smp_call_function()^M
2009-01-01_00:00:00.42740  [] smp_call_function+0x66/0x10b^M
2009-01-01_00:00:00.42741  [] find_busiest_group+0x1b4/0x4c5^M
2009-01-01_00:00:00.42742  [] retrigger_next_event+0x0/0x96^M
2009-01-01_00:00:00.42742  [] on_each_cpu+0x18/0x39^M
2009-01-01_00:00:00.42743  [] clock_was_set+0x18/0x1a^M
2009-01-01_00:00:00.42743  [] second_overflow+0xad/0x225^M
2009-01-01_00:00:00.42744  [] lapic_next_event+0xd/0x10^M
2009-01-01_00:00:00.42745  [] clockevents_program_event+0x9c/0xa3^M
2009-01-01_00:00:00.42746  [] do_timer+0xd9/0x6eb^M
2009-01-01_00:00:00.42747  [] tick_program_event+0x3a/0x59^M
2009-01-01_00:00:00.42747  [] read_tsc+0x6/0x7^M
2009-01-01_00:00:00.42748  [] clockevents_program_event+0x9c/0xa3^M
2009-01-01_00:00:00.42748  [] tick_do_update_jiffies64+0x93/0xa8^M
2009-01-01_00:00:00.42749  [] tick_nohz_update_jiffies+0x46/0x56^M
2009-01-01_00:00:00.42749  [] smp_apic_timer_interrupt+0x24/0x7d^M
2009-01-01_00:00:00.42750  [] hrtimer_start+0xf7/0x101^M
2009-01-01_00:00:00.42750  [] apic_timer_interrupt+0x28/0x30^M
2009-01-01_00:00:00.42751  [] default_idle+0x0/0x55^M
2009-01-01_00:00:00.42751  [] native_safe_halt+0x2/0x3^M
2009-01-01_00:00:00.42752  [] default_idle+0x3a/0x55^M
2009-01-01_00:00:00.42753  [] cpu_idle+0xb5/0xd6^M
2009-01-01_00:00:00.42753  =======================^M
The buggy kernel is of the 2.6.21 vintage, which is the same version reported by the, comments on another related post. It looks like it was resolved in a patch from 2007.

You might also be interested in the lkml thread about leap second trouble.

[ Parent ]

Re: New Year Crash
Posted by endecotp (86.7.xx.xx) on Tue 6 Jan 2009 at 18:37
[ View Weblogs ]
Thanks for that. Yes, it's a 2.6.21 kernel so this is probably the bug. Unfortunately I don't know what was on the console when it crashed: I asked the co-location provider if they had a known network issue, and they went and power-cycled my box without asking!

Of course I know that I could avoid this sort of thing by upgrading to newer kernels as they come out. But with a remote box like this, where getting KVM access costs me real money, I have to balance the risk&cost of getting a kernel upgrade wrong against the potential benefit.

I'm still a bit surprised that I hadn't heard about this particular bug in advance, though.

[ Parent ]