Thank you so much!! I have been putting it off because what I have works but a time will soon come when I’ll want to test new models.
I’m looking for a server but not many parallel calls because I would like to use as much context as I can. When making space for e.g. 4 threads, the context is split and thus 4x as small. With llama 3.1 8b I managed to get 47104 context on the 16GB card (though actually using that much is pretty slow). That’s with KV quant to 8b too. But sometimes I just need that much.
I’ve never tried the llama.cpp directly, thanks for the tip!
Kobold sounds good too but I have some scripts talking to it directly. I’ll read up on that too see if it can do that. I don’t have time now but I’ll do it in the coming days. Thank you!
Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.
You still have to specify a maximum context though, and it is best to set that as low as possible.
…The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.
You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.
Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.
EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?
The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.
Thank you so much!! I have been putting it off because what I have works but a time will soon come when I’ll want to test new models.
I’m looking for a server but not many parallel calls because I would like to use as much context as I can. When making space for e.g. 4 threads, the context is split and thus 4x as small. With llama 3.1 8b I managed to get 47104 context on the 16GB card (though actually using that much is pretty slow). That’s with KV quant to 8b too. But sometimes I just need that much.
I’ve never tried the llama.cpp directly, thanks for the tip!
Kobold sounds good too but I have some scripts talking to it directly. I’ll read up on that too see if it can do that. I don’t have time now but I’ll do it in the coming days. Thank you!
Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.
You still have to specify a maximum context though, and it is best to set that as low as possible.
…The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.
You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.
Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.
EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?
The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.