cross-posted from: https://discuss.online/post/34247715

Curious on the experiences of those recently migrating to Linux from Windows 10, Intel-based MacOS, etc. How is it being on Linux? Anything surprise or frustrate you?

  • LyD@lemmy.ca
    link
    fedilink
    English
    arrow-up
    1
    ·
    4 hours ago

    Installed Linux Mint on my old personal laptop (Dell XPS 9560) and unfortunately ran into some issues that made me switch back to Windows. I really want to make it work

    It seems to have revealed either a hardware bug or failing hardware in the NVMe drive.

    First problem was log spam that filled up the partition:

    spoiler
    2025-12-29T12:15:46.439880-05:00 redacted kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:04:00.0
    2025-12-29T12:15:46.439934-05:00 redacted kernel: nvme 0000:04:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
    2025-12-29T12:15:46.439936-05:00 redacted kernel: nvme 0000:04:00.0:   device [126f:2262] error status/mask=00000001/0000e000
    2025-12-29T12:15:46.439938-05:00 redacted kernel: nvme 0000:04:00.0:    [ 0] RxErr                  (First)
    2025-12-29T12:15:46.439939-05:00 redacted kernel: pcieport 0000:00:1d.0: AER: Multiple Correctable error message received from 0000:04:00.0
    2025-12-29T12:15:46.439940-05:00 redacted kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
    2025-12-29T12:15:46.439941-05:00 redacted kernel: pcieport 0000:00:1d.0:   device [8086:a118] error status/mask=00001000/00000000
    2025-12-29T12:15:46.439943-05:00 redacted kernel: pcieport 0000:00:1d.0:    [12] Timeout               
    2025-12-29T12:15:46.439944-05:00 redacted kernel: nvme 0000:04:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
    2025-12-29T12:15:46.439945-05:00 redacted kernel: nvme 0000:04:00.0:   device [126f:2262] error status/mask=00000001/0000e000
    2025-12-29T12:15:46.439946-05:00 redacted kernel: nvme 0000:04:00.0:    [ 0] RxErr                  (First)
    

    Some forum posts I found (example) suggested that this was a hardware bug and I could set pcie_aspm=off in grub to work around it. This stopped the log spam and everything seemed to be working fine.

    Later while I was doing some programming, everything froze for a while. When it came back, the partition was set to readonly. It wouldn’t boot on restart and loaded up busybox instead. I was able to set it to writable, but it happened again soon after.

    I decided to switch back to Windows where there doesn’t seem to be any issues.

    I really want to make it work. If it’s failing hardware then I have no choice but to replace the drive, but if it’s just a bug then I want to find a fix without buying new hardware. That would kind of defeat the point for me and I don’t want to spend the money.

    I would appreciate any help. I booted into Mint again to grab the logs and I really want to keep using it.

    • astro@leminal.space
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 hours ago

      I’d wager a toe from my left foot that if you look in the Event Viewer on windows you will see similar looking errors (though not as descriptive, no doubt, it might say something like “corrected read error” or something obtuse instead), this is a hardware issue that linux tends to be more aggressive in handling. These errors are on the physical layer and data link layer, so it is likely a communication problem between the drive and the motherboard, but interestingly, they are corrected on retry, so the data the system is calling from the drive is fine even if it sometimes fails to get there in time. This screams electrical connection to me, either thermal expansion is making the contacts wonky (and they might not be seated perfectly), there is a flaw in the traces somewhere, or there is some power management issue affecting your PCIe bus. Can you try running it with one more kernel parameter? Under pcie_aspm=off add nvme_core.default_ps_max_latency_us=0 and watch dmesg while running something heavy.

      • LyD@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 hours ago

        Checked the logs in Windows, you’re right! A corrected hardware error has occurred from PCI Express Root Port. I reseated the drive with no change.

        I should mention that this laptop has always had issues with what I assumed to be thermal throttling. It would play games fine for 10-15 minutes before becoming a slideshow. I eventually stopped trying.

        I have set that option and I am currently downloading a GPU benchmark. Is that an appropriate test? What should I be looking for in dmesg?

        • astro@leminal.space
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 hours ago

          A GPU bench might raise temps in a way that would cause the problem to recur, but I’m not sure you’d see anything without doing something to get data flowing to the drive at the same time, so maybe try running the GPU bench and at the same time run sudo dd if=/dev/{your drive} of=/dev/null bs=1M status=progress (just pull data from the drive and write it to nowhere, but be careful about the of and if or you might overwrite your whole drive), and while those are going, run sudo dmesg -w in another terminal and watch for the same error you were getting before. If you don’t get errors, the problem was probably just some power state problem that the kernel parameter fixed. But I have to tell you, unfortunately, that the presence of the error under windows is a bad sign that points to a hardware problem, so I don’t feel very hopeful. Independent of all the other suggestions, could you try running sudo nvme smart-log /dev/{your drive}? That might give you some data.

          • LyD@lemmy.ca
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 hour ago

            Thanks for the help. If it’s the hardware then it’s the hardware. I will try a few more things like booting from the drive in a different PC, but I may have to spend some money.

            • astro@leminal.space
              link
              fedilink
              English
              arrow-up
              1
              ·
              4 minutes ago

              That seems like a solid next step to figure out if it is the drive or the board (or the whole thermal situation in the rig). Good luck and sorry about the bad news, thanks for humoring my troubleshooting compulsion