Weblog entry #77 for dkg
It has two Intel(R) Xeon(R) CPU 5160 @ 3.00GHz processors (according to /proc/cpuinfo, 8 1GiB 667MHz DDR2 ECC modules (part number HYMP512F72CP8N3-Y5), according to dmidecode, and an Intel Corporation 5000X Chipset Memory Controller Hub (rev 12) according to lspci.
The machine has been running stably for many months.
On the morning of March 31st, i started getting the following messages from the kernel, on the order of one pair of lines every 3 seconds:
Mar 31 07:04:38 zamboni kernel: [16883514.141275] EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x800 Mar 31 07:04:38 zamboni kernel: [16883514.141278] EDAC i5000: NON-Retry Errors, bits= 0x800A bit of digging turned up a redhat bug report that seems to suggest that these warnings are just noise, and should be ignorable. Another link thinks it's a conflict with IPMI, though i don't think this model actually has an IPMI subsystem correction: this machine does have IPMI, though i am not making use of it.
However, i also notice from munin logs that at the same time the error messages started, the machine exhibited a marked change in CPU activity (including in-kernel activity) and local timer interrupts: ![[Individual interrupts - by month]](http://dkg.fifthhorseman.net/blog/i5000_edac/irqstats-month.png)
![[CPU Usage - by month]](http://dkg.fifthhorseman.net/blog/i5000_edac/cpu-month.png)
I also note that more rescheduling interrupts started happening, and fewer megasas interrupts at about the same time. I'm not sure what this means.
A review of other logs and graphs on the system turns up no other evidence of interaction that might cause this kind of elevated activity.
One thought was that the elevated activity was just due to writing out a bunch more logs. So i tried removing the i5000_edac module just to keep dmesg and /var/log/kern.log cleaner. Leaving that turned off doesn't lower the CPU utilization or change the interrupts, though.
Any suggestions on what might be going on, or further diagnostics i should run? The machine is in production, and I'd really rather not take down the machine for an extended period of time to do a lengthy memory test. But i also don't want to see this kind of extra CPU usage (more than double the machine's baseline).
Comments on this Entry
[ Send Message | View dkg's Scratchpad | View Weblogs ]
How would one "activate automated scrubbing"? I'd be happy with just a link or to to read up, if you have one you recommend.
[ Parent | Reply to this comment ]
Scrubbing is an option on most non-joke BIOSes when the chipset supports it, just hunt it down (and upgrade your Dell firmware if an update is available), I very much doubt Dell would not offer it.
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
I'm interested in how i would scrub it from the kernel -- It seems like i'd need to know:
- which physical memory region needs scrubbing
- how that physical memory region is currently in use
- how to re-allocate that memory (e.g. if it is fscache, how to flush that part of the cache, ideally without flushing the entire cache)
- how to overwrite that memory
[ Parent | Reply to this comment ]
Have you tried loading (or unloading) the IPMI kernel modules?
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
[ Parent | Reply to this comment ]
> EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x1410010
> EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x10)
> EDAC MC0: CE row 2, channel 0, label "": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=5071 CAS=508, CE Err=0x10000)
> EDAC i5000: THERMAL Error, bits= 0x400000
> EDAC i5000: DIMM-Spare Error, bits= 0x1000000
After making all the usual things (memtest, removing hardware parts and switching hardware parts) with no success, I read www.nikhef.nl/pub/projects/grid/gridwiki/index.php?title=Valentin e_memory&redirect=no and after that, I disabled (blacklisted) the i5000_edac kernel modul.
As far as I understand the problem, the modul i5000_edac tries to get health information about the hardware over ipmi (internally). Also the BIOS tries to get health information over ipmi (internally). If it happens that "both" systems try to access the same resource at the same time, you get this error. The error only means: Couldn't get health information!
It's very confusing and it took me several weeks to look into it.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Send Message | View dkg's Scratchpad | View Weblogs ]
The kernel documentation for the subsystems in question is: edac.txt and IPMI.txt.
However, loading or unloading the edac modules doesn't cause a difference in the CPU consumption or interrupts on this machine, so i don't think it things like edac_mc_pol_msec (which appears to default to 1000) have any relevant effect for me. I'm not sure what i should try with the ipmi subsystem.
[ Parent | Reply to this comment ]
2. You have to make sure the SMBIOS is programmed to do what you want, it may be filling the logs with crap and stealing processor time through SMIs to do it. This is often visible indirectly, and the kernel might be accounting it as interrupts or something else.
Reboot that box. If the BIOS is not a useless piece of crap, it will scrub the RAM. While at it, activate automated scrubbing if it is not activated already: Linux does not do it (yet) and you'd have to be insane to run a server without hardware (or at least BIOS SMI-based) memory scrubbing.
If a reboot doesn't fix the issue, the repair sequence is: reseat the RAM module *and* air-blast-clean motherboard to remove shorts; replace the RAM module; check PSUs; replace the motherboard and/or PSU.
[ Parent | Reply to this comment ]