Dual gigabit carrier board (router kit) refuses to power up after weeks of operation

I bought the router kit with the passive heatsink, installed a RPi 4 build of OPNsense, and this was working in operation as a home firewall/router for weeks.

I experienced some link flapping on the LAN7800 interface, particularly when transitioning from mostly idle to busy, like in the mornings, and several times during the day. I suspected heat induced clock jitter between the CM4 and the USB bus on the carrier board, so I got my infrared camera out and decided to add two heatsinks: the CM4 MxL chip, and Broadcom chips were running hot. The Qualcomm chip was reporting max 35C, but I laid a spare aluminum heatsink slightly larger than the router case flat on the top to increase thermal mass and dissipation.

The link flapping nearly stopped, and only happened 2-4 times per week for a couple of weeks, only first thing in the morning, again transitioning from idle to busy. I thought I was making progress, maybe even done. However, the hardware crashed and after rebooting successfully once, started refusing to boot.

The carrier board power LED now refuses to light. The mosfet between the USB-C jack and the I2C FPC connector gets up to 5.4V (from a multimeter probe on the test pad on the back just underneath this chip), but acts like some protection is kicking in because the voltage drops out and cycles on and off.

I ordered another carrier board to see if maybe I’d got a lemon (I was suspicious because the original carrier board had a lot of flux left on it), which arrived today and in good condition. The new carrier board shows power on LED when the USB-C power supply is plugged in, but not when the CM4 is seated. The original carrier board power LED doesn’t light at all.

It seems like something on the CM4 might be shorting out, but I can’t get to the test pads on the CM4 while it’s seated, or even see if anything is thermally “interesting” on the connector side.