Weblog entry #77 for dkg

EDAC i5000 non-fatal errors
Posted by dkg on Sat 9 Apr 2011 at 00:00
I've got a Debian GNU/Linux lenny installation (2.6.26-2-vserver-amd64 kernel) running on a Dell Poweredge 2950 with BIOS 2.0.1 (2007-10-27).

It has two Intel(R) Xeon(R) CPU 5160 @ 3.00GHz processors (according to /proc/cpuinfo, 8 1GiB 667MHz DDR2 ECC modules (part number HYMP512F72CP8N3-Y5), according to dmidecode, and an Intel Corporation 5000X Chipset Memory Controller Hub (rev 12) according to lspci.

The machine has been running stably for many months.

On the morning of March 31st, i started getting the following messages from the kernel, on the order of one pair of lines every 3 seconds:

Mar 31 07:04:38 zamboni kernel: [16883514.141275] EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x800
Mar 31 07:04:38 zamboni kernel: [16883514.141278] EDAC i5000: 	NON-Retry  Errors, bits= 0x800
A bit of digging turned up a redhat bug report that seems to suggest that these warnings are just noise, and should be ignorable. Another link thinks it's a conflict with IPMI, though i don't think this model actually has an IPMI subsystem correction: this machine does have IPMI, though i am not making use of it.

However, i also notice from munin logs that at the same time the error messages started, the machine exhibited a marked change in CPU activity (including in-kernel activity) and local timer interrupts: [Individual interrupts - by month]

[CPU Usage - by month]

I also note that more rescheduling interrupts started happening, and fewer megasas interrupts at about the same time. I'm not sure what this means.

A review of other logs and graphs on the system turns up no other evidence of interaction that might cause this kind of elevated activity.

One thought was that the elevated activity was just due to writing out a bunch more logs. So i tried removing the i5000_edac module just to keep dmesg and /var/log/kern.log cleaner. Leaving that turned off doesn't lower the CPU utilization or change the interrupts, though.

Any suggestions on what might be going on, or further diagnostics i should run? The machine is in production, and I'd really rather not take down the machine for an extended period of time to do a lengthy memory test. But i also don't want to see this kind of extra CPU usage (more than double the machine's baseline).


Comments on this Entry

Posted by Anonymous (201.82.xx.xx) on Sat 9 Apr 2011 at 01:19
1. You need to scrub that RAM. If you do not, it will remain generating MCEs and might even go from CE to UC due to a second error, and the kernel will panic and halt the box.

2. You have to make sure the SMBIOS is programmed to do what you want, it may be filling the logs with crap and stealing processor time through SMIs to do it. This is often visible indirectly, and the kernel might be accounting it as interrupts or something else.

Reboot that box. If the BIOS is not a useless piece of crap, it will scrub the RAM. While at it, activate automated scrubbing if it is not activated already: Linux does not do it (yet) and you'd have to be insane to run a server without hardware (or at least BIOS SMI-based) memory scrubbing.

If a reboot doesn't fix the issue, the repair sequence is: reseat the RAM module *and* air-blast-clean motherboard to remove shorts; replace the RAM module; check PSUs; replace the motherboard and/or PSU.

[ Parent | Reply to this comment ]

Posted by dkg (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Sun 10 Apr 2011 at 21:45
[ View dkg's Scratchpad | View Weblogs ]
Thanks for this feedback. Is there a way that i can "scrub the RAM" directly (without a reboot)? Are you talking about just re-initializing some specific section of physical RAM? How would i know what memory addresses to scrub?

How would one "activate automated scrubbing"? I'd be happy with just a link or to to read up, if you have one you recommend.

[ Parent | Reply to this comment ]

Posted by Anonymous (201.82.xx.xx) on Mon 11 Apr 2011 at 01:16
You have to rewrite the RAM to scrub it. Can only be safely done by the kernel or the hardware, really. SMBIOS doing it in SMM would race DMA/bus-mastering if it is an IO page.

Scrubbing is an option on most non-joke BIOSes when the chipset supports it, just hunt it down (and upgrade your Dell firmware if an update is available), I very much doubt Dell would not offer it.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Mon 11 Apr 2011 at 15:11
[ View dkg's Scratchpad | View Weblogs ]
I'm assuming that by SMM you mean System Management Mode.

I'm interested in how i would scrub it from the kernel -- It seems like i'd need to know:

  • which physical memory region needs scrubbing
  • how that physical memory region is currently in use
  • how to re-allocate that memory (e.g. if it is fscache, how to flush that part of the cache, ideally without flushing the entire cache)
  • how to overwrite that memory
Any pointers?

[ Parent | Reply to this comment ]

Posted by Anonymous (81.106.xx.xx) on Sat 9 Apr 2011 at 13:09
Most Dell boxes have IPMI... and it can/does steal CPU like this.

Have you tried loading (or unloading) the IPMI kernel modules?

[ Parent | Reply to this comment ]

Posted by dkg (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Tue 12 Apr 2011 at 04:38
[ View dkg's Scratchpad | View Weblogs ]
I currently have no ipmi modules loaded. Do you think i should try to load them? Can you give me an example of which modules are worth trying to load and what i should expect to see from loading them? I'd be happy with a pointer to relevant documentation.

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Fri 15 Apr 2011 at 17:46
[ View dkg's Scratchpad | View Weblogs ]
Actually, loading ipmi_devintf, ipmi_si, and ipmi_msghandler makes the local timer interrupts jump from around 430/second to about 510/second, but doesn't seem to affect the load on the machine or the delay i'm seeing with the standard transactions.

[ Parent | Reply to this comment ]

Posted by Anonymous (91.89.xx.xx) on Sat 9 Apr 2011 at 15:10
I got a similiar error on a PowerEdge 2950 after Upgrading to the lenny (2.6.26) kernel:
> EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x1410010
> EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x10)
> EDAC MC0: CE row 2, channel 0, label "": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=5071 CAS=508, CE Err=0x10000)
> EDAC i5000: THERMAL Error, bits= 0x400000
> EDAC i5000: DIMM-Spare Error, bits= 0x1000000

After making all the usual things (memtest, removing hardware parts and switching hardware parts) with no success, I read www.nikhef.nl/pub/projects/grid/gridwiki/index.php?title=Valentin e_memory&redirect=no and after that, I disabled (blacklisted) the i5000_edac kernel modul.

As far as I understand the problem, the modul i5000_edac tries to get health information about the hardware over ipmi (internally). Also the BIOS tries to get health information over ipmi (internally). If it happens that "both" systems try to access the same resource at the same time, you get this error. The error only means: Couldn't get health information!

It's very confusing and it took me several weeks to look into it.

[ Parent | Reply to this comment ]

Posted by Anonymous (91.89.xx.xx) on Sat 9 Apr 2011 at 15:13
Forgot to say you have to reboot your machine after the error occured!

[ Parent | Reply to this comment ]

Posted by dkg (2001:0xx:0xx:0xxx:0xxx:0xxx:xx) on Tue 12 Apr 2011 at 15:57
[ View dkg's Scratchpad | View Weblogs ]
Some followup after a bit of digging:

The kernel documentation for the subsystems in question is: edac.txt and IPMI.txt.

However, loading or unloading the edac modules doesn't cause a difference in the CPU consumption or interrupts on this machine, so i don't think it things like edac_mc_pol_msec (which appears to default to 1000) have any relevant effect for me. I'm not sure what i should try with the ipmi subsystem.

[ Parent | Reply to this comment ]