Weblog entry #77 for dkg
It has two Intel(R) Xeon(R) CPU 5160 @ 3.00GHz processors (according to /proc/cpuinfo, 8 1GiB 667MHz DDR2 ECC modules (part number HYMP512F72CP8N3-Y5), according to dmidecode, and an Intel Corporation 5000X Chipset Memory Controller Hub (rev 12) according to lspci.
The machine has been running stably for many months.
On the morning of March 31st, i started getting the following messages from the kernel, on the order of one pair of lines every 3 seconds:
Mar 31 07:04:38 zamboni kernel: [16883514.141275] EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x800 Mar 31 07:04:38 zamboni kernel: [16883514.141278] EDAC i5000: NON-Retry Errors, bits= 0x800A bit of digging turned up a redhat bug report that seems to suggest that these warnings are just noise, and should be ignorable. Another link thinks it's a conflict with IPMI, though i don't think this model actually has an IPMI subsystem correction: this machine does have IPMI, though i am not making use of it.
However, i also notice from munin logs that at the same time the error messages started, the machine exhibited a marked change in CPU activity (including in-kernel activity) and local timer interrupts:
I also note that more rescheduling interrupts started happening, and fewer megasas interrupts at about the same time. I'm not sure what this means.
A review of other logs and graphs on the system turns up no other evidence of interaction that might cause this kind of elevated activity.
One thought was that the elevated activity was just due to writing out a bunch more logs. So i tried removing the i5000_edac module just to keep dmesg and /var/log/kern.log cleaner. Leaving that turned off doesn't lower the CPU utilization or change the interrupts, though.
Any suggestions on what might be going on, or further diagnostics i should run? The machine is in production, and I'd really rather not take down the machine for an extended period of time to do a lengthy memory test. But i also don't want to see this kind of extra CPU usage (more than double the machine's baseline).
Comments on this Entry