• Septimaeus@infosec.pub
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      This is correct. The popular misconception may arise from the marked difference between model use vs development. Inference is far less demanding than training with respect to time and energy efficiency.

      And you can still train on most consumer GPUs, but for really deep networks like LLMs, well get ready to wait.

    • PlzGivHugs@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 days ago

      Really? When I was trying to get it to run a little while ago, I kept running out of memory with my 3060 12GB running 20B models, but prehaps I had it configured wrong.

      • Arkthos@pawb.social
        link
        fedilink
        English
        arrow-up
        6
        ·
        3 days ago

        You can offload them into ram. The response time gets way slower once this happens, but you can do it. I’ve run a 70b llama model on my 3060 12gb at 2 bit quantisation (I do have plenty of ram so no offloading from ram to disk at least lmao). It took like 6-7 minutes to generate replies but it did work.

      • KeenFlame@feddit.nu
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 days ago

        The pruned models works just as well, but it will be slow if you use ram instead of vram