您的位置:首页 > 其它

内存MCE错误导致系统崩溃的问题分析

2013-10-31 16:42 706 查看
今天服务器因为内存问题而崩溃,通过mcelog工具分析是在读内存的时候Error overflow(虽然是ECC内存,但也无奈错误太多),估计是内存硬件故障,如果再次出现的话就得考虑更换内存。

最终原因:硬件故障,应该是主板问题,因为是线上服务器为减少计划内停机时间,同时更换主板和内存解决。

# more /var/log/messages

Oct 31 14:19:36 pingu_fd kernel: sbridge: HANDLING MCE MEMORY ERROR

Oct 31 14:19:36 pingu_fd kernel: CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010092

Oct 31 14:19:36 pingu_fd kernel: TSC 0 ADDR 428fc8840 MISC 204808e886 PROCESSOR 0:206d6 TIME 1383200376 SOCKET 0 APIC 0

Oct 31 14:19:36 pingu_fd kernel: sbridge: HANDLING MCE MEMORY ERROR

Oct 31 14:19:36 pingu_fd kernel: CPU 0: Machine Check Exception: 0 Bank 10: 8800004800800092

Oct 31 14:19:36 pingu_fd kernel: TSC 0 ADDR 0 MISC 4900030243025000 PROCESSOR 0:206d6 TIME 1383200376 SOCKET 0 APIC 0

通过mcelog翻译message的内容如下:

# mcelog sandybridge-ep --ascii < mcelog-manu.txt

sbridge: HANDLING MCE MEMORY ERROR

Hardware event. This is not a software error.

CPU 0 BANK 5

MISC 244076f686 ADDR 1a6bca040

TIME 1383200376 Thu Oct 31 14:19:36 2013

MCG status:

MCi status:

Error overflow

Corrected error

MCi_MISC register valid

MCi_ADDR register valid

MCA: MEMORY CONTROLLER RD_CHANNEL2_ERR

Transaction: Memory read error

STATUS cc0000c000010092 MCGSTATUS 0

CPUID Vendor Intel Family 6 Model 45

SOCKET 0 APIC 0
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: