Kernel panic - not syncing: Machine check

Problem

The Opteron system crashes and generates the following messages:

CPU 1: Machine Check Exception: 4 Bank 4: f60c200100000813
TSC 377366e1b8aa ADDR 1fef51800
Kernel panic - not syncing: Machine check

Here is another similar machine exception check generated from the utility mcelog.

MCE 0
CPU 1 4 northbridge TSC 511338368e676
ADDR 46a9f888
Northbridge ECC error
ECC syndrome = e3
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 9471c00000000a13 MCGSTATUS 0

Here the Northbridge ECC error also indicates bad memory module.

Beside the kernel panic, memory error can also be flagged by the EDAC module:

EDAC k8 MC1: extended error code: ECC error
EDAC k8 MC1: general bus error: participating processor(local node
response), time-out(no timeout) memory transaction type(generic read),
mem or i/o(mem access), cache level(generic)
EDAC MC1: CE page 0xcc9db, offset 0xd38, grain 8, syndrome 0x80, row
0, channel 1, label "": k8_edac

In this example, MC1 is the equivalent of CPU 1.

Solution

The machine exception check indicates the system has a bad memory module. To locate the bad module, first we need to determine which CPU bank containing the bad module. Almost all ASL multi-processor Opteron systems have multiple CPU banks. For example, a dual processor system has two CPU banks while a quad processor system has four CPU banks. The exception is the Marquis K820 of which has only one CPU bank.

When the kernel generates the machine exception check, a CPU ID is given. In the examples above, the CPU ID is 1. To identify the CPU bank, refer to the following tables:

Opteron system configured with single-core processor(s)

CPU IDCPU Bank
01
12
23
34

Opteron system configured with dual-core processor(s)

CPU IDCPU Bank
01
11
22
32
43
53
64
74

Opteron system configured with quad-core processor(s)

CPU IDCPU Bank
01
11
21
31
42
52
62
72
83
93
103
113
124
134
144
154

Once the CPU bank number is identified, next locate that bank on the motherboard. Some motherboards such as Tyan S2885 label CPU bank 1 as CPU 0. Other motherboards such as Supermicro H8DCE label CPU bank 1 as CPU 1. Depending on the motherboard model, a CPU bank contains two, four or eight memory slots. The most common configuration is four memory slots per CPU bank.

Now that the bad CPU bank on the motherboard has been located, remove all the memory modules from that bank. Afterward, the system should boot and run properly.

The last step is to locate the bad memory module out of the group. That can be done easily by running memtest86+ on each individual module separately. This is possible because all Opteron motherboards will boot with only one memory module installed.

Memtest86+ can be downloaded here:

ftp://ftp.aslab.com/pub/utility/memtestp.img

ftp://ftp.aslab.com/pub/utility/memtestp.README

To obtain replacement memory modules, please send an Email to techsupport@aslab.com and provide the following information:

  1. The serial number or the invoice number of the system
  2. The shipping address
  3. Brief description of the problem (bad memory)