This is a short rewrite of a post I wrote elsewhere, but which is no longer easily searchable or accessible.
If you’ve got a DIMM that’s going bad and your system supports Machine Check Architecture (MCA) / Machine Check Exceptions (MCEs), you might see alerts about memory errors popping up in your logs or console output. They typically look something like this:
MCA: Bank 9, Status 0x8c000047000800c0
MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID...
Memory DIMMs have a small flash memory chip (EEPROM) on them, containing an important descriptor table called the Serial Presence Detect (SPD). This data tells the system the size, speed, timings, operating voltage, manufacturer, part number, overclocking profiles, and all sorts of other information about each DIMM. The SPD chip is accessed using the SMBus protocol, which is based on I2C.
Tools such as CPU-Z, RAMMon, and RW Everything can be used to read the SPD data by talking to the flash chip over a...