• chrash0@lemmy.world
      link
      fedilink
      arrow-up
      10
      ·
      7 months ago

      you’d be surprised how fast a model can be if you narrow the scope, quantize, and target specific hardware, like the AI hardware features they’re announcing.

      not a 1-1, but a quantized Mistral 7B runs at ~35 tokens/sec on my M2. that’s not even as optimized as it could be. it can write simple scripts and do some decent writing prompts.

      they could get really narrow in scope (super simple RAG, limited responses, etc), quantize down to even something like 4 bit, and run it on custom accelerated hardware. it doesn’t have to reproduce Shakespeare, but i can imagine a PoC that runs circles around Siri in semantic understanding and generated responses. being able to reach out on Slack to the engineers that built the NPU stack ain’t bad neither.