0

I have an older HP Z440 tower with 4x8GB ECC DDR4, running Proxmox VE 6.4. Recently, it started showing MCE errors every few seconds. I installed rasdaemon and can see that they are memory read errors. However, edac-util doesn't show any sign of problems. Memtest passed, but I understand that's normal for correctable errors.

There is only one socket, and the DIMMs are installed in slots 1, 3, 6, and 8 (which seems to be preferred for this model).

Am I actually having memory errors? How can I troubleshoot this further?

dmesg:

root@pve:~# dmesg
...
[ 5729.899255] mce_notify_irq: 20 callbacks suppressed
[ 5729.899260] mce: [Hardware Error]: Machine check events logged
[ 5732.907207] mce: [Hardware Error]: Machine check events logged
[ 5792.907319] mce_notify_irq: 19 callbacks suppressed
[ 5792.907323] mce: [Hardware Error]: Machine check events logged
[ 5793.899247] mce: [Hardware Error]: Machine check events logged
[ 5852.911342] mce_notify_irq: 11 callbacks suppressed
[ 5852.911347] mce: [Hardware Error]: Machine check events logged
[ 5853.903354] mce: [Hardware Error]: Machine check events logged

Errors from rasdaemon:

root@pve:~# ras-mc-ctl --errors | tail
1435 2023-05-12 14:58:05 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=5, mcgcap=0x07000c16, status=0xcc00014000010091, addr=0x4ccdc28c0, misc=0x40484886, walltime=0x645e9a4e, cpuid=0x000306f2, bank=0x00000007
1436 2023-05-12 14:58:06 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x4d5c831c0, misc=0x140383886, walltime=0x645e9a4f, cpuid=0x000306f2, bank=0x00000007
1437 2023-05-12 14:58:09 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x4ccdc28c0, misc=0x403aba86, walltime=0x645e9a52, cpuid=0x000306f2, bank=0x00000007
1438 2023-05-12 14:58:11 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x6fd8eee80, misc=0x140282886, walltime=0x645e9a54, cpuid=0x000306f2, bank=0x00000007
1439 2023-05-12 14:58:12 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x510122800, misc=0x140282886, walltime=0x645e9a55, cpuid=0x000306f2, bank=0x00000007
1440 2023-05-12 14:58:13 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=4, mcgcap=0x07000c16, status=0xcc00010000010091, addr=0x4ea312a80, misc=0x1403c3c86, walltime=0x645e9a56, cpuid=0x000306f2, bank=0x00000007
1441 2023-05-12 14:58:16 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x4ea342a80, misc=0x1403aba86, walltime=0x645e9a59, cpuid=0x000306f2, bank=0x00000007
1442 2023-05-12 14:58:17 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x50abf2900, misc=0x1404c4c86, walltime=0x645e9a5a, cpuid=0x000306f2, bank=0x00000007
1443 2023-05-12 14:58:18 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x52676fbc0, misc=0x140585886, walltime=0x645e9a5b, cpuid=0x000306f2, bank=0x00000007

No errors reported by edac:

root@pve:~# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

root@pve:/sys/devices/system/edac/mc# tail -n +1 mc*/ce_* mc*/dimm*/dimm_ce_count
==> mc0/ce_count <==
0

==> mc0/ce_noinfo_count <==
0

==> mc0/dimm0/dimm_ce_count <==
0

==> mc0/dimm3/dimm_ce_count <==
0

==> mc0/dimm6/dimm_ce_count <==
0

==> mc0/dimm9/dimm_ce_count <==
0

0

You must log in to answer this question.

Browse other questions tagged .