The Opteron system crashes and generates the following messages:
CPU 1: Machine Check Exception: 4 Bank 4: f60c200100000813 TSC 377366e1b8aa ADDR 1fef51800 Kernel panic - not syncing: Machine check
Here is another similar machine exception check generated from the utility mcelog.
MCE 0 CPU 1 4 northbridge TSC 511338368e676 ADDR 46a9f888 Northbridge ECC error ECC syndrome = e3 bit46 = corrected ecc error bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS 9471c00000000a13 MCGSTATUS 0
Here the Northbridge ECC error also indicates bad memory module.
Beside the kernel panic, memory error can also be flagged by the EDAC module:
EDAC k8 MC1: extended error code: ECC error EDAC k8 MC1: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC1: CE page 0xcc9db, offset 0xd38, grain 8, syndrome 0x80, row 0, channel 1, label "": k8_edac
In this example, MC1 is the equivalent of CPU 1.
The machine exception check indicates the system has a bad memory module. To locate the bad module, first we need to determine which CPU bank containing the bad module. Almost all ASL multi-processor Opteron systems have multiple CPU banks. For example, a dual processor system has two CPU banks while a quad processor system has four CPU banks. The exception is the Marquis K820 of which has only one CPU bank.
When the kernel generates the machine exception check, a CPU ID is given. In the examples above, the CPU ID is 1. To identify the CPU bank, refer to the following tables:
Opteron system configured with single-core processor(s)
CPU ID | CPU Bank |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
Opteron system configured with dual-core processor(s)
CPU ID | CPU Bank |
---|---|
0 | 1 |
1 | 1 |
2 | 2 |
3 | 2 |
4 | 3 |
5 | 3 |
6 | 4 |
7 | 4 |
Opteron system configured with quad-core processor(s)
CPU ID | CPU Bank |
---|---|
0 | 1 |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
5 | 2 |
6 | 2 |
7 | 2 |
8 | 3 |
9 | 3 |
10 | 3 |
11 | 3 |
12 | 4 |
13 | 4 |
14 | 4 |
15 | 4 |
Once the CPU bank number is identified, next locate that bank on the motherboard. Some motherboards such as Tyan S2885 label CPU bank 1 as CPU 0. Other motherboards such as Supermicro H8DCE label CPU bank 1 as CPU 1. Depending on the motherboard model, a CPU bank contains two, four or eight memory slots. The most common configuration is four memory slots per CPU bank.
Now that the bad CPU bank on the motherboard has been located, remove all the memory modules from that bank. Afterward, the system should boot and run properly.
The last step is to locate the bad memory module out of the group. That can be done easily by running memtest86+ on each individual module separately. This is possible because all Opteron motherboards will boot with only one memory module installed.
Memtest86+ can be downloaded here:
ftp://ftp.aslab.com/pub/utility/memtestp.img
ftp://ftp.aslab.com/pub/utility/memtestp.README
To obtain replacement memory modules, please send an Email to techsupport@aslab.com and provide the following information: