Fresh benchmark numbers are shining a spotlight on a big question for home AI enthusiasts: can you realistically run a massive 230B-parameter model on consumer hardware, and if so, what’s the smartest way to do it?
A recent test shared by stevibe on X explored exactly that using MiniMax M2.7, a 230B AI inference model. The goal was to see how different NVIDIA GPU setups handle large-model inference at home, using the same core settings across the board: a 32k context size and a max token length of 4096. To make the model fit more practical VRAM limits, the test used Unsloth’s UD-IQ3_XXS quantization in GGUF format. Importantly, this was also the largest quantization that still fits inside 96GB of VRAM on the RTX PRO 6000 Blackwell, helping keep the comparison consistent across rigs.
Four systems were tested:
1) Four RTX 4090 GPUs (96GB total VRAM)
Performance: 71.52 tokens/second
TTFT (time to first token): 1045ms
2) Four RTX 5090 GPUs (128GB total VRAM)
Performance: 120.54 tokens/second
TTFT: 725ms
3) One RTX PRO 6000 Blackwell GPU (96GB VRAM)
Performance: 118.74 tokens/second
TTFT: 765ms
4) DGX Spark (128GB memory)
Performance: 24.41 tokens/second
TTFT: 741ms
On raw token generation speed, the standout detail is how close a single RTX PRO 6000 Blackwell gets to a four-card RTX 5090 setup. The RTX 5090 quad-GPU rig leads at 120.54 tokens/second, but the RTX PRO 6000 lands right behind it at 118.74 tokens/second—despite being a single GPU configuration.
Why that matters: in real-world large language model inference, speed alone doesn’t tell the full story. Once you start stacking multiple consumer GPUs, you introduce overhead—data movement, synchronization, and other inefficiencies that can reduce the expected scaling benefits. These results highlight that bigger multi-GPU builds don’t always translate into proportionally better performance, especially when compared to a purpose-built workstation-class card with high VRAM on a single GPU.
Power consumption is where the gap becomes impossible to ignore. The multi-GPU builds demand extreme power budgets:
4x RTX 4090: 1800W peak (450W x 4)
4x RTX 5090: 2300W peak (575W x 4)
RTX PRO 6000 Blackwell: 600W peak
DGX Spark: 240W peak (whole system)
That means the RTX PRO 6000 delivers near-quad-RTX-5090 performance at roughly a quarter of the peak power draw. Compared to four RTX 4090 cards, it’s about one-third the power. For anyone trying to run large AI models at home without building a space-heater tower (or without upgrading electrical circuits), the efficiency advantage is a major point in favor of a single high-VRAM professional GPU.
Pricing adds another layer to the decision. Based on the figures shared:
RTX 4090 average retail price: $3000 per GPU
RTX 5090 average retail price: $3500 per GPU
RTX PRO 6000 average retail price: $9500 per GPU
DGX Spark retail price: $4699
A four RTX 5090 setup lands around $14,000 just for the GPUs, while a single RTX PRO 6000 sits around $9,500. Even though the PRO card is expensive, the comparison suggests it can be a better value than building a multi-GPU consumer stack when you factor in performance per watt, practical overhead, and the simplicity of a single-GPU system.
The DGX Spark, on the other hand, looks far slower in tokens/second (24.41), but its low 240W full-system power draw makes it a more efficient, plug-and-play style option for people who prioritize convenience and energy use over maximum throughput.
The bigger takeaway is that large AI inference at home is becoming more feasible—but the “more GPUs equals better” idea has limits. These results show why a single high-VRAM, workstation-class GPU can outperform expectations versus mainstream multi-GPU builds, delivering similar real generation speeds while dramatically cutting complexity and power consumption. For running huge models like a 230B-parameter LLM, efficiency and overhead can matter just as much as peak specs.






