cross-posted from: https://discuss.online/post/34247715

Curious on the experiences of those recently migrating to Linux from Windows 10, Intel-based MacOS, etc. How is it being on Linux? Anything surprise or frustrate you?

  • astro@leminal.space
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 hours ago

    I’d wager a toe from my left foot that if you look in the Event Viewer on windows you will see similar looking errors (though not as descriptive, no doubt, it might say something like “corrected read error” or something obtuse instead), this is a hardware issue that linux tends to be more aggressive in handling. These errors are on the physical layer and data link layer, so it is likely a communication problem between the drive and the motherboard, but interestingly, they are corrected on retry, so the data the system is calling from the drive is fine even if it sometimes fails to get there in time. This screams electrical connection to me, either thermal expansion is making the contacts wonky (and they might not be seated perfectly), there is a flaw in the traces somewhere, or there is some power management issue affecting your PCIe bus. Can you try running it with one more kernel parameter? Under pcie_aspm=off add nvme_core.default_ps_max_latency_us=0 and watch dmesg while running something heavy.

    • LyD@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      Checked the logs in Windows, you’re right! A corrected hardware error has occurred from PCI Express Root Port. I reseated the drive with no change.

      I should mention that this laptop has always had issues with what I assumed to be thermal throttling. It would play games fine for 10-15 minutes before becoming a slideshow. I eventually stopped trying.

      I have set that option and I am currently downloading a GPU benchmark. Is that an appropriate test? What should I be looking for in dmesg?

      • astro@leminal.space
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 hours ago

        A GPU bench might raise temps in a way that would cause the problem to recur, but I’m not sure you’d see anything without doing something to get data flowing to the drive at the same time, so maybe try running the GPU bench and at the same time run sudo dd if=/dev/{your drive} of=/dev/null bs=1M status=progress (just pull data from the drive and write it to nowhere, but be careful about the of and if or you might overwrite your whole drive), and while those are going, run sudo dmesg -w in another terminal and watch for the same error you were getting before. If you don’t get errors, the problem was probably just some power state problem that the kernel parameter fixed. But I have to tell you, unfortunately, that the presence of the error under windows is a bad sign that points to a hardware problem, so I don’t feel very hopeful. Independent of all the other suggestions, could you try running sudo nvme smart-log /dev/{your drive}? That might give you some data.

        • LyD@lemmy.ca
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 hour ago

          Thanks for the help. If it’s the hardware then it’s the hardware. I will try a few more things like booting from the drive in a different PC, but I may have to spend some money.

          • astro@leminal.space
            link
            fedilink
            English
            arrow-up
            1
            ·
            8 minutes ago

            That seems like a solid next step to figure out if it is the drive or the board (or the whole thermal situation in the rig). Good luck and sorry about the bad news, thanks for humoring my troubleshooting compulsion