Strange Odyssey X86J4125 ethernet dropouts

Issue

Random drops of network connection. Not only a few packets but complete network outage for tens of seconds.

HW

ODYSSEY-X86J4125

General description

After months of seeking the right HW platform for my home NAS (fanless, powered by an external adapter, at least 3 SATA ports) I have found SeeedStudio Odyssey X86J4125 as the best solution for my idea. Ordered, ordered M.2 PCI 5xSATA board, ordered M.2 SATA SSD, ordered 3x 2.5" SATA HDDs, ordered re_computer case, put everything together.

I have considered TrueNAS as my first choice but then I have read about not-well-supported syncthing in TrueNAS and switched my mind to (my beloved) Debian. Installed Debian 11, configured two ZFS pools (one mirrored for important data, one over single disk for unimportant data).

Even during initial configuration work via SSH I have detected freezing of my SSH session. Sometimes for a few seconds, sometimes for even a minute. I have focused to seek these network issues. Tried also Debian 10 (ok, fine, 11 is too new, maybe there are some kernel bugs) but no difference. Every time I considered “YES!!! I made it!”, after some while another network dropout occurred. And even after I have spent bambillion hours of troubleshooting, I still don’t know what’s wrong. No evidence in “dmesg”, no evidence in syslog, nothing strange in “atop” logs,…

What I have tried already

In order you would probably ask me not in the order I have tried it

  1. used a much more powerful power source than the original one (16V/4.5A)
  2. passed memtest (free latest version) with zero errors
  3. upgraded BIOS to lastest one (SD-BS-CJ41G-300-101-H)
  4. upgraded EmbededController firmware to latest one (SD-EC-CJ41G-M-101-Q)
  5. disabled wifi and Bluetooth adapters
  6. used USB based Ethernet adapter on USB3 port
  7. used the same USB Ethernet adapter on the USB2 port
  8. tried several different CAT5E UTP cables
  9. tuned Mikrotik switch to disable all possible STP checks, multicasts,…
  10. connect Odyssey to an ordinary stupid 100Mbps switch together with my PC (to avoid any strange influence of Mikrotik)
  11. disabled any power management in BIOS
  12. disabled any CPU freq management in BIOS
  13. disabled virtualization support in BIOS
  14. disabled “Energy Efficient Ethernet” in Debian
  15. changed eth speed to 100Mbps
  16. tuned any possible queues and buffers for network interface
  17. installed “tuned” and used “throughput-performance” profile
  18. disabled power management in Debian
  19. tried “powertop” to find some strange power consumption
  20. stop all my services running on the box using network (samba, minidlna, syncthing)
  21. stop ZFS
  22. replaced SATA power+data cable
  23. disconnect one of the HDD
  24. disconnect all HDDs
  25. disconnect M.2 PCI 5xSATA card
  26. boot from live Linux distro from USB and run it from ramdisk

Many times after any of these steps above I let ping the server for a few hours and I have realized “Heureeeka!!!” Ping has 0% lost packets. I won!

Then I have just pressed a few keys within the SSH session and the 30-second freeze came again. I could ignore troubles with ssh session, syncthing is also somehow able to retransmit data. But the primary usage - samba for videos and minidlna for music - is unusable. Streaming video from samba is failing every few minutes.

Very strange detail

When I run ping from any other device on my network (notebook, another home server, router,…) the ping losses are 10-40% over several hours.
BUT!!! When I run ping out from Odyssey to some of my other devices, the ping losses are 0%. And now the magic comes - when I run ingress ping at the same time, the losses are just about 2%
Anyway, it couldn’t be used as a “dirty workaround” because those 2% losses don’t mean occasional packet drop but still outages for tens of seconds. Just less often.

Regarding the kernel tuning. To be honest, I don’t remember everything I have tried. Every possible recommendation offered by discussion forums for network troubles.
But, I somehow expect that the not-tuned kernel should provide just worse performance, not malfunction.

Has any of you idea what I could try more?

Many thanks

Hi, thank you for so detailed testing! Can you reset BIOS setting and run with windows for more test? We need to locate if this problem is a hardware issue.

Sorry. I don’t own any Windows license usable.
Even I’m not familiar with Windows on the same level as with Linux and could not run such deep tests on Windows.
What kind of tests are available on Windows and not on Linux?
Anyway, I could reset BIOS at least.
Thanks.