Those who've switched to Linux in the last year, how is it going?

kiol@discuss.online · 1 day ago

Those who've switched to Linux in the last year, how is it going?

LyD@lemmy.ca · 4 hours ago

Installed Linux Mint on my old personal laptop (Dell XPS 9560) and unfortunately ran into some issues that made me switch back to Windows. I really want to make it work

It seems to have revealed either a hardware bug or failing hardware in the NVMe drive.

First problem was log spam that filled up the partition:

spoiler

2025-12-29T12:15:46.439880-05:00 redacted kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:04:00.0
2025-12-29T12:15:46.439934-05:00 redacted kernel: nvme 0000:04:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
2025-12-29T12:15:46.439936-05:00 redacted kernel: nvme 0000:04:00.0:   device [126f:2262] error status/mask=00000001/0000e000
2025-12-29T12:15:46.439938-05:00 redacted kernel: nvme 0000:04:00.0:    [ 0] RxErr                  (First)
2025-12-29T12:15:46.439939-05:00 redacted kernel: pcieport 0000:00:1d.0: AER: Multiple Correctable error message received from 0000:04:00.0
2025-12-29T12:15:46.439940-05:00 redacted kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
2025-12-29T12:15:46.439941-05:00 redacted kernel: pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00000000
2025-12-29T12:15:46.439943-05:00 redacted kernel: pcieport 0000:00:1d.0:    [12] Timeout               
2025-12-29T12:15:46.439944-05:00 redacted kernel: nvme 0000:04:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
2025-12-29T12:15:46.439945-05:00 redacted kernel: nvme 0000:04:00.0:   device [126f:2262] error status/mask=00000001/0000e000
2025-12-29T12:15:46.439946-05:00 redacted kernel: nvme 0000:04:00.0:    [ 0] RxErr                  (First)

Some forum posts I found (example) suggested that this was a hardware bug and I could set pcie_aspm=off in grub to work around it. This stopped the log spam and everything seemed to be working fine.

Later while I was doing some programming, everything froze for a while. When it came back, the partition was set to readonly. It wouldn’t boot on restart and loaded up busybox instead. I was able to set it to writable, but it happened again soon after.

I decided to switch back to Windows where there doesn’t seem to be any issues.

I really want to make it work. If it’s failing hardware then I have no choice but to replace the drive, but if it’s just a bug then I want to find a fix without buying new hardware. That would kind of defeat the point for me and I don’t want to spend the money.

I would appreciate any help. I booted into Mint again to grab the logs and I really want to keep using it.

astro@leminal.space · 3 hours ago

I’d wager a toe from my left foot that if you look in the Event Viewer on windows you will see similar looking errors (though not as descriptive, no doubt, it might say something like “corrected read error” or something obtuse instead), this is a hardware issue that linux tends to be more aggressive in handling. These errors are on the physical layer and data link layer, so it is likely a communication problem between the drive and the motherboard, but interestingly, they are corrected on retry, so the data the system is calling from the drive is fine even if it sometimes fails to get there in time. This screams electrical connection to me, either thermal expansion is making the contacts wonky (and they might not be seated perfectly), there is a flaw in the traces somewhere, or there is some power management issue affecting your PCIe bus. Can you try running it with one more kernel parameter? Under pcie_aspm=off add nvme_core.default_ps_max_latency_us=0 and watch dmesg while running something heavy.

LyD@lemmy.ca · 2 hours ago

Checked the logs in Windows, you’re right! A corrected hardware error has occurred from PCI Express Root Port. I reseated the drive with no change.

I should mention that this laptop has always had issues with what I assumed to be thermal throttling. It would play games fine for 10-15 minutes before becoming a slideshow. I eventually stopped trying.

I have set that option and I am currently downloading a GPU benchmark. Is that an appropriate test? What should I be looking for in dmesg?

astro@leminal.space · 2 hours ago

A GPU bench might raise temps in a way that would cause the problem to recur, but I’m not sure you’d see anything without doing something to get data flowing to the drive at the same time, so maybe try running the GPU bench and at the same time run sudo dd if=/dev/{your drive} of=/dev/null bs=1M status=progress (just pull data from the drive and write it to nowhere, but be careful about the of and if or you might overwrite your whole drive), and while those are going, run sudo dmesg -w in another terminal and watch for the same error you were getting before. If you don’t get errors, the problem was probably just some power state problem that the kernel parameter fixed. But I have to tell you, unfortunately, that the presence of the error under windows is a bad sign that points to a hardware problem, so I don’t feel very hopeful. Independent of all the other suggestions, could you try running sudo nvme smart-log /dev/{your drive}? That might give you some data.

LyD@lemmy.ca · 1 hour ago

Thanks for the help. If it’s the hardware then it’s the hardware. I will try a few more things like booting from the drive in a different PC, but I may have to spend some money.

astro@leminal.space · 4 minutes ago

That seems like a solid next step to figure out if it is the drive or the board (or the whole thermal situation in the rig). Good luck and sorry about the bad news, thanks for humoring my troubleshooting compulsion