If you can make a useful MoE thing where each expert model has a small final layer in its neural net, so you don’t need to move much data between cards, then running each MoE on a different card might be viable. Regardless of whether the GPU vendor wants to segment up the gaming and AI markets.
I think that that’s one of the biggest unknowns as to where AI may wind up going. If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models, and it’s going to be much harder for OpenAI or similar to obtain much of a barrier to entry. That may have dramatic impact on who has what degree of access to AI.
Huawei’s model splits the experts into 8 groups, routed so that each group always has the same number of experts active. This means that (on an 8 NPU server) intercommunication is minimized and the load is balanced.
There’s another big MoE (ERNIE? Don’t quote me) that ships with native 2-bit QAT, too. It’s basically explicitly made to cram into 8 gaming GPUs.
If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models
I mean. I can run GLM 4.6 350B at 7 tokens/sec on a single 3090 + Ryzen CPU. With modest token convergence compared to the full model. Most can run GLM air and replace base tier ChatGPT.
Some businesses are already serving models split across cheap GPUs. It can be done, but its not turnkey like it is for NVLink HBM cards.
Honestly the only thing keeping OpenAI in place is name recognition, a timing lead, SEO/convenience and… hype. Basically inertia + anticompetitiveness. The tech to displace them is there, it’s just inaccessible and unknown.
If you can make a useful MoE thing where each expert model has a small final layer in its neural net, so you don’t need to move much data between cards, then running each MoE on a different card might be viable. Regardless of whether the GPU vendor wants to segment up the gaming and AI markets.
I think that that’s one of the biggest unknowns as to where AI may wind up going. If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models, and it’s going to be much harder for OpenAI or similar to obtain much of a barrier to entry. That may have dramatic impact on who has what degree of access to AI.
Kinda already done:
https://arxiv.org/abs/2504.07866
Huawei’s model splits the experts into 8 groups, routed so that each group always has the same number of experts active. This means that (on an 8 NPU server) intercommunication is minimized and the load is balanced.
There’s another big MoE (ERNIE? Don’t quote me) that ships with native 2-bit QAT, too. It’s basically explicitly made to cram into 8 gaming GPUs.
I mean. I can run GLM 4.6 350B at 7 tokens/sec on a single 3090 + Ryzen CPU. With modest token convergence compared to the full model. Most can run GLM air and replace base tier ChatGPT.
Some businesses are already serving models split across cheap GPUs. It can be done, but its not turnkey like it is for NVLink HBM cards.
Honestly the only thing keeping OpenAI in place is name recognition, a timing lead, SEO/convenience and… hype. Basically inertia + anticompetitiveness. The tech to displace them is there, it’s just inaccessible and unknown.