In case you missed it, 2-3 weeks ago, experimental tensor-parallelism support was merged into llama.cpp.
In a nutshell, this allows in multi-GPU setups to not only combine the VRAM of the cards but also their computing power. The results depend a lot on the specific setup and model, but on my 3x RTX 2000e Ada rig running Qwen3.6-35b it almost doubled generation throughput (these are low-powered cards which are not very powerful on their own).
The option to turn it on is --split-mode tensor.
It’s not yet officially documented, I assume because it’s still experimental. But since #22362 was merged yesterday, in my case it now also work for the latest Qwen3.6 models.
You must log in or # to comment.

