Northbridge GART error

Problem

The operating generates the following machine exception checks (MCE) periodically.

MCE 0
CPU 0 4 northbridge TSC 91347b1f412f
ADDR 101dc0000
   Northbridge GART error
       bit61 = error uncorrected
   TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0

CPU 1: Silent Northbridge MCE
Northbridge status a60000010005001b
     GART TLB error generic level generic
    extended error gart error
     link number 0
     err cpu1
     processor context corrupt
     error address valid
     error uncorrected
     previous error lost
     error address 00000000f7fe0008

Solution

This is a harmless MCE. In the 64-bit version, the kernel uses the AGP aperture as IOMMU. It is a known documented hardware bug that causes the spurious GART errors. The BIOS and Linux disable them. Unfortunately the Linux MCE handler is too thorough and picks them up as corrected events. Since the 32-bit kernel does not use AGP aperture as IOMMU, the northbridge GART error does not occur.

The northbridge GART errors commonly occur under the following environments:

  1. The RAID array(s), while being initialized in the background, is being accessed with a lot of heavy I/O activities. Both SCSI (LSI MegaRAID) and SATA (3ware Escalade) RAID controllers exhibit the behavior.
  2. Heavy disk I/O occurs on the SATA bus.