Reserver i3-1125G4 unstable when using PCIe slot

I have a Reserver i3-1125G4 running ProxMox from a mirrored pair of WD Blue SN570 2TB SSDs: one in the M-key m.2 slot and another via the PCIe 4.0 x4 slot.

Ever since I started using this setup I have had system lock-ups due to the PCIe slot. The lock-ups are variable and do not seem to be related to temperature or high load or anything. Sometimes the system hangs only minutes after a reboot, other times it happens after more than a week. If I remove the SSD that is connected to the PCIe slot, then the system runs without problems.

I have tried the following steps

  • upgraded to a 90W power supply
  • replaced the ADT R24SF PCIe to M.2 extension cable with the shielded ADT K24MF version
  • replaced the ADT R24SF PCIe to M.2 extension cable with a direct PCIe to M.2 adapter (no cable)
  • updated the bios and reset the settings to default
  • tried ProxMox with the default Debian kernel rather than the ProxMox one
  • tried FreeBSD with bhyve to exclude the possibility that it’s Linux specific
  • swapped the two SSDs to exclude the possibility that one of them is faulty
  • tried without the SSD in the m.2 slot

Am I the only one with these problems? Is there perhaps some bios setting that I might try? Because I have run out of ideas and can only conclude it must be a hardware problem.

Below the console output of ProxMox when it hangs

[16400.651838] watchdog: Watchdog detected hard LOCKUP on cpu 0
[16400.651842] Modules linked in: hid_logitech_hidpp joydev input_leds hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs netconsole ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter softdog nf_tables bonding tls sunrpc nfnetlink_log binfmt_misc nfnetlink xe snd_hda_codec_hdmi drm_gpuvm drm_exec snd_hda_codec_realtek gpu_sched drm_suballoc_helper drm_ttm_helper snd_hda_codec_generic intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl kvm_intel snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink kvm soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof irqbypass crct10dif_pclmul polyval_clmulni polyval_generic snd_sof_utils ghash_clmulni_intel snd_soc_hdac_hda sha256_ssse3 snd_hda_ext_core sha1_ssse3 snd_soc_acpi_intel_match aesni_intel snd_soc_acpi soundwire_generic_allocation
[16400.651885]  crypto_simd soundwire_bus cryptd rapl snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine intel_cstate i915 mei_pxp mei_hdcp snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep drm_buddy cmdlinepart ttm snd_pcm spi_nor drm_display_helper snd_timer mtd wmi_bmof snd mei_me cec pcspkr ee1004 soundcore cdc_acm 8250_dw mei rc_core i2c_algo_bit intel_pmc_core intel_vsec pmt_telemetry intel_hid sparse_keymap pmt_class acpi_tad acpi_pad mac_hid vhost_net vhost vhost_iotlb tap it87(OE) hwmon_vid efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c spi_pxa2xx_platform dw_dmac dw_dmac_core ahci intel_lpss_pci xhci_pci nvme spi_intel_pci intel_lpss xhci_pci_renesas crc32_pclmul nvme_core libahci xhci_hcd igc i2c_i801 spi_intel i2c_smbus nvme_auth idma64 video wmi pinctrl_tigerlake
[16400.651936] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: P        W  OE      6.8.4-3-pve #1
[16400.651939] Hardware name: Default string Default string/ODYSSEY-TGL-A, BIOS ODYSSEY-TGL-A_v2.0a 05/19/2022
[16400.651940] RIP: 0010:native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.651945] Code: 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00 48 03 04 d5 e0 ac aa b9 4c 89 20 41 8b 44 24 08 85 c0 75 0b f3 90 <41> 8b 44 24 08 85 c0 74 f5 49 8b 14 24 48 85 d2 74 8b 0f 0d 0a eb
[16400.651947] RSP: 0018:ffffaa6380003ca8 EFLAGS: 00000046
[16400.651949] RAX: 0000000000000000 RBX: ffff8abe402fc1c0 RCX: 0000000000040000
[16400.651950] RDX: 0000000000000006 RSI: 00000000001c0001 RDI: ffff8abe402fc1c0
[16400.651951] RBP: ffffaa6380003cc8 R08: 0000000000000000 R09: 0000000000000000
[16400.651952] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8acdf02359c0
[16400.651953] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8abe4036b800
[16400.651954] FS:  0000000000000000(0000) GS:ffff8acdf0200000(0000) knlGS:0000000000000000
[16400.651956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16400.651957] CR2: 000059e219a47c7c CR3: 00000007fce36006 CR4: 0000000000f72ef0
[16400.651958] PKRU: 55555554
[16400.651959] Call Trace:
[16400.651960]  <NMI>
[16400.651963]  ? show_regs+0x6d/0x80
[16400.651968]  ? watchdog_hardlockup_check+0x15a/0x250
[16400.651972]  ? watchdog_overflow_callback+0x6b/0x80
[16400.651974]  ? __perf_event_overflow+0xeb/0x2d0
[16400.651979]  ? perf_event_overflow+0x19/0x30
[16400.651982]  ? handle_pmi_common+0x175/0x3c0
[16400.651989]  ? intel_pmu_handle_irq+0x108/0x480
[16400.651991]  ? x2apic_send_IPI_self+0x15/0x20
[16400.651995]  ? arch_irq_work_raise+0x28/0x40
[16400.651998]  ? perf_event_nmi_handler+0x2b/0x50
[16400.652001]  ? nmi_handle+0x5d/0x160
[16400.652004]  ? default_do_nmi+0x47/0x130
[16400.652008]  ? exc_nmi+0x1c2/0x290
[16400.652011]  ? end_repeat_nmi+0xf/0x60
[16400.652016]  ? native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.652018]  ? native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.652020]  ? native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.652022]  </NMI>
[16400.652023]  <IRQ>
[16400.652024]  _raw_spin_lock+0x3f/0x60
[16400.652026]  qi_submit_sync+0x34d/0x8b0
[16400.652031]  qi_flush_iotlb+0x84/0xb0
[16400.652034]  intel_flush_iotlb_all+0x78/0x150
[16400.652036]  fq_flush_iotlb+0x26/0x40
[16400.652040]  fq_flush_timeout+0x2d/0xe0
[16400.652042]  ? __pfx_fq_flush_timeout+0x10/0x10
[16400.652045]  call_timer_fn+0x27/0x160
[16400.652049]  ? __pfx_fq_flush_timeout+0x10/0x10
[16400.652052]  __run_timers+0x262/0x300
[16400.652056]  run_timer_softirq+0x1d/0x40
[16400.652058]  __do_softirq+0xd6/0x31c
[16400.652061]  __irq_exit_rcu+0xd7/0x100
[16400.652064]  irq_exit_rcu+0xe/0x20
[16400.652066]  sysvec_apic_timer_interrupt+0x92/0xd0
[16400.652067]  </IRQ>
[16400.652068]  <TASK>
[16400.652069]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[16400.652071] RIP: 0010:cpuidle_enter_state+0xce/0x470
[16400.652074] Code: 17 03 ff e8 f4 ee ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 02 07 02 ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f 44 00 00 <45> 85 ff 0f 88 83 01 00 00 49 63 d7 4c 89 f1 48 8d 04 52 48 8d 04
[16400.652075] RSP: 0018:ffffffffba203db8 EFLAGS: 00000246
[16400.652077] RAX: 0000000000000000 RBX: ffffca637fa19a50 RCX: 0000000000000000
[16400.652078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[16400.652079] RBP: ffffffffba203df0 R08: 0000000000000000 R09: 0000000000000000
[16400.652080] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
[16400.652081] R13: ffffffffba46fa00 R14: 00000ee7c133bacb R15: 0000000000000003
[16400.652083]  ? cpuidle_enter_state+0xbe/0x470
[16400.652087]  cpuidle_enter+0x2e/0x50
[16400.652093]  call_cpuidle+0x23/0x60
[16400.652096]  do_idle+0x207/0x260
[16400.652098]  cpu_startup_entry+0x2a/0x30
[16400.652101]  rest_init+0xd0/0xd0
[16400.652103]  arch_call_rest_init+0xe/0x30
[16400.652106]  start_kernel+0x71b/0xb00
[16400.652109]  x86_64_start_reservations+0x18/0x30
[16400.652112]  x86_64_start_kernel+0xbf/0x110
[16400.652115]  secondary_startup_64_no_verify+0x184/0x18b
[16400.652119]  </TASK>

FreeBSD trace

nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: resetting controller
nvme0: failing queued i/o
nvme0: WRITE sqid:2 cid:0 nsid:1 lba:682079376 len:24
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:2 cid:0 cdw0:0
nvme0: failing outstanding i/o
nvme0: WRITE sqid:3 cid:121 nsid:1 lba:682784024 len:80
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:121 cdw0:0
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=28a7b490 0 17 0 0 0
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=28b27518 0 4f 0 0 0
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nda0 at nvme0 bus 0 scbus2 target 0 lun 1
nda0: <WD Blue SN570 2TB 234200WD 23073J801495> s/n 23073J801495 detached
(nda0:nvme0:0:0:1): Periph destroyed

Hi there,
Any BIos settings for the speed of the PCIE slot. if you remove that 2TB does it still hang?
try turning it down.
my .02
GL :slight_smile: PJ

@PJ_Glasso thanks for your suggestion. I was able to down the machine and triple check the bios settings, but unfortunately I cannot find anything related to PCIe speeds.

Sorry for the wait. I will check your report. If there is a solution you will definitely be contacted first here. Thank you for your understanding and support!

Thank you, I am curious to hear your findings and I really hope I can get this working one way or another!

Hi there,
So have you swapped M.2’s? Did it change? I have had a bad one b4 it is rare with WD stuff but You indicate it runs fine with out it, so Kinda says it’s on the edge of spec’s maybe.
Curious to know if a straight swap changes it’s behavior? ie. A to B and B to A.?
also check the board, bottom side for defects near the socket.
GL :slight_smile: PJ

It should work and you are close , so hang in there.

Yes, swapping the SSD’s was one of the first things I tried to exclude the possibility that the SSD was faulty:

  • I didn’t have problems with either SSD if it was in the M.2 slot
  • I had lockups with either SSD when installed in the PCIe slot (nothing in the M.2 slot)
  • when using the pair I had problems regardless of which SSD was in the PCIe slot and which in the M.2 slot

I visually checked the board and the PCIe slot for damage or dust, but couldn’t see anything. But I am no expert in microelectronics, so I expect that anything that I would be able to notice would give problems immediately at boot time. :sweat_smile:

Hi there,
OK … Very good.
can you post a picture?
I would assume both are good then in that case. any swollen Capacitors around for visual inspection.
als wonder if a PCI to m2 adapter would be any different.
GL :slight_smile: PJ

Here are pictures of the front and back of the PCIe slot area

Actually, I have tried the following PCI to M.2 adapter:

It took some time to find one small enough to fit in the enclosure (removing one hard disk). My reasoning was that the extension cables despite their shielding were still affected by magnetic interference. Unfortunately, after a day the system still locked up…

In any case, thanks for taking your time!

Hi there,
Can you lift that label covering the two transistors in the first picture and see that it’s not a metal back label.
What about a thermal look, does it boot ok and hang after or not even boot when Cold?
Try some cold air, or hairdryer blowing cold. All things being spec’d should work.
GL :slight_smile: PJ

What is the voltage of your power supply?
Please note that the input voltage of Reserver i3-1125G4 must be 12V.

I feel that the issue is likely due to incompatibility between the PCIe expansion card slot and our interface.

Here is the schematic diagram for our PCIe 4.0 x4 slot:

I was unable to find the valid SCH files on the official product pages for the ADT R24SF and ADT K24MF. However, in the Chinese product description, I found that this expansion card might have compatibility problems.

I double checked the label and it’s an ordinary paper adhesive. Luckily, because the label already came with the machine :sweat_smile:

The problem seems unrelated to thermals or power draw; it has often happened in the middle of the night when the load was practically zero and temperatures well within the normal operating range (I checked with sysstat and know of course at what time I scheduled my cron jobs). Furthermore, for the last few months I have been running the machine with an additional 120x120mm exhaust fan on top and the CPU/SSD/HDD/MB temperatures are now comfortable even with high load.

The power supply is a Leicke 12V/7.5A model, so the correct voltage. Note that I also had the problem with the original 12V/5A power supply, but then occassionally the machine did not want to boot after it happened. I measured the power draw at the wall and the peak load at startup was around 75W, presumably also due to the the spin up of both harddrives. This is why I assumed that replacing the power supply would resolve my problems, but it only resolved the boot problem, not the hanging.

You are right about the disclaimer on the ADT website regarding their extension cables and therefore I initially blamed them for the problem. This is why I also tried the PCIe adapter that I posted above and I still have the same lockup.

Regarding the BIOS: am I correct that there is no option in the bios to fix the PCIe slot to PCIe 3.0 speeds? I think the suggestion by @PJ_Glasso would be worth a try, but I cannot find the option. I am running the ODYSSEY-TGL-A_v2.0a 7/7/2022 bios, which according to the website is the latest version.