0

we have few DELL machines ( with RHEL 7.6) , and as we replaced the DIMM cards on machines because the Erros that we seen from kernel messages

after some time we checked again the kernel messages and we found the following and we can see the errors about the RAM memory ( also related to RHEL case - https://access.redhat.com/solutions/6961932 )

[Mon May  8 21:08:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683580080 SOCKET 0 APIC 0
[Mon May  8 21:08:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x6f3c77 offset:0xc80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:2 rank:4)
[Mon May  8 21:08:21 2023] mce: [Hardware Error]: Machine check events logged
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue May  9 05:30:29 2023] {13}[Hardware Error]: event severity: corrected
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:  Error 0, type: corrected
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:  fru_text: B6
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   section_type: memory error
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   error_status: 0x0000000000000400
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   physical_address: 0x000000446e0d5f00
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   node: 1 card: 1 module: 1 rank: 0 bank: 3 row: 64982 column: 888 
[Tue May  9 05:30:29 2023] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: TSC 30d2ef7e9bfda 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: ADDR 446e0d5f00 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: MISC 0 
[Tue May  9 05:30:29 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1683610228 SOCKET 0 APIC 0
[Tue May  9 05:30:29 2023] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x446e0d5 offset:0xf00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:2 rank:4)
[Tue May  9 05:30:51 2023] mce: [Hardware Error]: Machine check events logged
[Tue May  9 17:52:21 2023] perf: interrupt took too long (380026 > 7861), lowering kernel.perf_event_max_sample_rate to 1000
[Wed May 10 06:27:17 2023] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported

just to be sure that this above messages are not random messages we decided to reboot machines and see if the bad messages about memory are reproduced

but the Erros messages about the RAM memory, are still remained.

so we are confused about the problem that we seen from kernel messages

how it can be that we still get Erros about RAM in spite we replaced the DIMM cards

I must gives here additional information about what we see from IDRAC

enter image description here

as we can above IDRAC not completed about the DIMM cards or about the RAM memory

so the question is - how comes dmesg ( kernel messages ) are complained about the RAM memory in spite all DIMMs was replaced?

is it possible that something else is BAD and not the DIMM cards? for example the motherboard in DELL machine?

1 Answer 1

3

The error you see is single-bit ECC correctable memory error that was corrected by hardware. These do not trigger a component listed as failed in iDRAC, at least until their number exceeds some internally defined threshold, but you should see this memory error logged under iDRAC SEL (system event log).

It is not recommended to mix single and dual rank modules, but your mileage may vary depending on processor/motherboard version.

5
  • so do you recommended to find the faulty dimm ( according to dmesg slot ) and replaced them ?
    – King David
    May 10 at 14:40
  • If these slots were unpopulated before you may get away with blasting the dust out of them with canned air. I recommend first rearranging the DIMMs to see if the error is slot or module related and monitor DIMM temperature, as correctable errors could be caused by insufficient cooling airflow and overheating. May 10 at 15:18
  • about the DIMM if they was before - so yes we replaced only the existing DIMM , so we not insert DIMM card on new SLOT for sure
    – King David
    May 10 at 15:23
  • about insufficient cooling airflow - can we capture it from linux cli as example of dmidecode?
    – King David
    May 10 at 15:25
  • Then rearrange the DIMMs and see if the error will migrate with the DIMM or stay with the slot. We use ipmitool sdr list for that. May 10 at 16:42

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .