I have a Reserver i3-1125G4 running ProxMox from a mirrored pair of WD Blue SN570 2TB SSDs: one in the M-key m.2 slot and another via the PCIe 4.0 x4 slot.
Ever since I started using this setup I have had system lock-ups due to the PCIe slot. The lock-ups are variable and do not seem to be related to temperature or high load or anything. Sometimes the system hangs only minutes after a reboot, other times it happens after more than a week. If I remove the SSD that is connected to the PCIe slot, then the system runs without problems.
I have tried the following steps
- upgraded to a 90W power supply
- replaced the ADT R24SF PCIe to M.2 extension cable with the shielded ADT K24MF version
- replaced the ADT R24SF PCIe to M.2 extension cable with a direct PCIe to M.2 adapter (no cable)
- updated the bios and reset the settings to default
- tried ProxMox with the default Debian kernel rather than the ProxMox one
- tried FreeBSD with bhyve to exclude the possibility that it’s Linux specific
- swapped the two SSDs to exclude the possibility that one of them is faulty
- tried without the SSD in the m.2 slot
Am I the only one with these problems? Is there perhaps some bios setting that I might try? Because I have run out of ideas and can only conclude it must be a hardware problem.
Below the console output of ProxMox when it hangs
[16400.651838] watchdog: Watchdog detected hard LOCKUP on cpu 0
[16400.651842] Modules linked in: hid_logitech_hidpp joydev input_leds hid_logitech_dj hid_generic usbkbd usbmouse usbhid hid rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs netconsole ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter softdog nf_tables bonding tls sunrpc nfnetlink_log binfmt_misc nfnetlink xe snd_hda_codec_hdmi drm_gpuvm drm_exec snd_hda_codec_realtek gpu_sched drm_suballoc_helper drm_ttm_helper snd_hda_codec_generic intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl kvm_intel snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink kvm soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof irqbypass crct10dif_pclmul polyval_clmulni polyval_generic snd_sof_utils ghash_clmulni_intel snd_soc_hdac_hda sha256_ssse3 snd_hda_ext_core sha1_ssse3 snd_soc_acpi_intel_match aesni_intel snd_soc_acpi soundwire_generic_allocation
[16400.651885] crypto_simd soundwire_bus cryptd rapl snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine intel_cstate i915 mei_pxp mei_hdcp snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep drm_buddy cmdlinepart ttm snd_pcm spi_nor drm_display_helper snd_timer mtd wmi_bmof snd mei_me cec pcspkr ee1004 soundcore cdc_acm 8250_dw mei rc_core i2c_algo_bit intel_pmc_core intel_vsec pmt_telemetry intel_hid sparse_keymap pmt_class acpi_tad acpi_pad mac_hid vhost_net vhost vhost_iotlb tap it87(OE) hwmon_vid efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c spi_pxa2xx_platform dw_dmac dw_dmac_core ahci intel_lpss_pci xhci_pci nvme spi_intel_pci intel_lpss xhci_pci_renesas crc32_pclmul nvme_core libahci xhci_hcd igc i2c_i801 spi_intel i2c_smbus nvme_auth idma64 video wmi pinctrl_tigerlake
[16400.651936] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: P W OE 6.8.4-3-pve #1
[16400.651939] Hardware name: Default string Default string/ODYSSEY-TGL-A, BIOS ODYSSEY-TGL-A_v2.0a 05/19/2022
[16400.651940] RIP: 0010:native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.651945] Code: 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00 48 03 04 d5 e0 ac aa b9 4c 89 20 41 8b 44 24 08 85 c0 75 0b f3 90 <41> 8b 44 24 08 85 c0 74 f5 49 8b 14 24 48 85 d2 74 8b 0f 0d 0a eb
[16400.651947] RSP: 0018:ffffaa6380003ca8 EFLAGS: 00000046
[16400.651949] RAX: 0000000000000000 RBX: ffff8abe402fc1c0 RCX: 0000000000040000
[16400.651950] RDX: 0000000000000006 RSI: 00000000001c0001 RDI: ffff8abe402fc1c0
[16400.651951] RBP: ffffaa6380003cc8 R08: 0000000000000000 R09: 0000000000000000
[16400.651952] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8acdf02359c0
[16400.651953] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8abe4036b800
[16400.651954] FS: 0000000000000000(0000) GS:ffff8acdf0200000(0000) knlGS:0000000000000000
[16400.651956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16400.651957] CR2: 000059e219a47c7c CR3: 00000007fce36006 CR4: 0000000000f72ef0
[16400.651958] PKRU: 55555554
[16400.651959] Call Trace:
[16400.651960] <NMI>
[16400.651963] ? show_regs+0x6d/0x80
[16400.651968] ? watchdog_hardlockup_check+0x15a/0x250
[16400.651972] ? watchdog_overflow_callback+0x6b/0x80
[16400.651974] ? __perf_event_overflow+0xeb/0x2d0
[16400.651979] ? perf_event_overflow+0x19/0x30
[16400.651982] ? handle_pmi_common+0x175/0x3c0
[16400.651989] ? intel_pmu_handle_irq+0x108/0x480
[16400.651991] ? x2apic_send_IPI_self+0x15/0x20
[16400.651995] ? arch_irq_work_raise+0x28/0x40
[16400.651998] ? perf_event_nmi_handler+0x2b/0x50
[16400.652001] ? nmi_handle+0x5d/0x160
[16400.652004] ? default_do_nmi+0x47/0x130
[16400.652008] ? exc_nmi+0x1c2/0x290
[16400.652011] ? end_repeat_nmi+0xf/0x60
[16400.652016] ? native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.652018] ? native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.652020] ? native_queued_spin_lock_slowpath+0x284/0x2d0
[16400.652022] </NMI>
[16400.652023] <IRQ>
[16400.652024] _raw_spin_lock+0x3f/0x60
[16400.652026] qi_submit_sync+0x34d/0x8b0
[16400.652031] qi_flush_iotlb+0x84/0xb0
[16400.652034] intel_flush_iotlb_all+0x78/0x150
[16400.652036] fq_flush_iotlb+0x26/0x40
[16400.652040] fq_flush_timeout+0x2d/0xe0
[16400.652042] ? __pfx_fq_flush_timeout+0x10/0x10
[16400.652045] call_timer_fn+0x27/0x160
[16400.652049] ? __pfx_fq_flush_timeout+0x10/0x10
[16400.652052] __run_timers+0x262/0x300
[16400.652056] run_timer_softirq+0x1d/0x40
[16400.652058] __do_softirq+0xd6/0x31c
[16400.652061] __irq_exit_rcu+0xd7/0x100
[16400.652064] irq_exit_rcu+0xe/0x20
[16400.652066] sysvec_apic_timer_interrupt+0x92/0xd0
[16400.652067] </IRQ>
[16400.652068] <TASK>
[16400.652069] asm_sysvec_apic_timer_interrupt+0x1b/0x20
[16400.652071] RIP: 0010:cpuidle_enter_state+0xce/0x470
[16400.652074] Code: 17 03 ff e8 f4 ee ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 02 07 02 ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f 44 00 00 <45> 85 ff 0f 88 83 01 00 00 49 63 d7 4c 89 f1 48 8d 04 52 48 8d 04
[16400.652075] RSP: 0018:ffffffffba203db8 EFLAGS: 00000246
[16400.652077] RAX: 0000000000000000 RBX: ffffca637fa19a50 RCX: 0000000000000000
[16400.652078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[16400.652079] RBP: ffffffffba203df0 R08: 0000000000000000 R09: 0000000000000000
[16400.652080] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
[16400.652081] R13: ffffffffba46fa00 R14: 00000ee7c133bacb R15: 0000000000000003
[16400.652083] ? cpuidle_enter_state+0xbe/0x470
[16400.652087] cpuidle_enter+0x2e/0x50
[16400.652093] call_cpuidle+0x23/0x60
[16400.652096] do_idle+0x207/0x260
[16400.652098] cpu_startup_entry+0x2a/0x30
[16400.652101] rest_init+0xd0/0xd0
[16400.652103] arch_call_rest_init+0xe/0x30
[16400.652106] start_kernel+0x71b/0xb00
[16400.652109] x86_64_start_reservations+0x18/0x30
[16400.652112] x86_64_start_kernel+0xbf/0x110
[16400.652115] secondary_startup_64_no_verify+0x184/0x18b
[16400.652119] </TASK>
FreeBSD trace
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: resetting controller
nvme0: failing queued i/o
nvme0: WRITE sqid:2 cid:0 nsid:1 lba:682079376 len:24
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:2 cid:0 cdw0:0
nvme0: failing outstanding i/o
nvme0: WRITE sqid:3 cid:121 nsid:1 lba:682784024 len:80
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:121 cdw0:0
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=28a7b490 0 17 0 0 0
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=28b27518 0 4f 0 0 0
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nda0 at nvme0 bus 0 scbus2 target 0 lun 1
nda0: <WD Blue SN570 2TB 234200WD 23073J801495> s/n 23073J801495 detached
(nda0:nvme0:0:0:1): Periph destroyed