Can I run Mixtral 8x22B (q3_k_m) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
56.4GB
Headroom
-32.4GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, falls short of the memory requirements for running the Mixtral 8x22B (141.00B) model, even when using quantization. While the q3_k_m quantization brings the model's VRAM footprint down to 56.4GB, this still exceeds the RTX 3090's capacity by 32.4GB. The Ampere architecture of the RTX 3090 provides substantial computational power with its 10496 CUDA cores and 328 Tensor cores, but the primary bottleneck here is insufficient memory. Memory bandwidth, at 0.94 TB/s, is also a critical factor, but it becomes secondary when the model cannot fully reside in VRAM.

Due to the VRAM limitation, attempting to load and run the full Mixtral 8x22B model on the RTX 3090 will result in an out-of-memory error. The model's architecture, consisting of eight expert models, contributes to its large size. Even with aggressive quantization techniques, the model's memory footprint remains substantial. While the RTX 3090's CUDA and Tensor cores could potentially offer reasonable inference speed if the model fit in memory, the VRAM constraint prevents this. Therefore, without workarounds like offloading layers to system RAM (which drastically reduces performance), direct inference is not feasible.

lightbulb Recommendation

Given the VRAM limitation, running the full Mixtral 8x22B model directly on the RTX 3090 is not practical. Consider exploring model parallelism, where the model is split across multiple GPUs if you have access to multiple RTX 3090s or other GPUs. Alternatively, offloading some layers to system RAM is possible, but expect a significant performance decrease. Another approach is to use a smaller, distilled version of the model that fits within the 24GB VRAM. Finally, consider using cloud-based inference services or platforms that offer access to more powerful GPUs with larger memory capacities.

If you proceed with offloading layers to system RAM, use inference frameworks like `llama.cpp` which allow for this configuration. Carefully monitor memory usage and adjust the number of layers offloaded to balance performance and stability. Prioritize quantizing the model further if possible, although q3_k_m is already a relatively aggressive quantization. Be aware that even with these workarounds, the performance will likely be significantly slower compared to running the model entirely on a GPU with sufficient VRAM.

tune Recommended Settings

Batch_Size
1
Context_Length
2048 (reduce to minimize VRAM usage if offloading)
Other_Settings
['Offload layers to system RAM', 'Monitor VRAM usage closely', 'Experiment with different layer offloading strategies']
Inference_Framework
llama.cpp
Quantization_Suggested
q3_k_m (already in use, consider even lower if av…

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA RTX 3090? expand_more
No, the Mixtral 8x22B model, even with q3_k_m quantization, requires more VRAM (56.4GB) than the NVIDIA RTX 3090 offers (24GB).
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
The Mixtral 8x22B model requires approximately 282GB of VRAM in FP16. With q3_k_m quantization, the VRAM requirement is reduced to 56.4GB.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA RTX 3090? expand_more
The Mixtral 8x22B model is unlikely to run on the RTX 3090 due to insufficient VRAM. Attempting to run it by offloading layers to system RAM will result in significantly reduced performance, making it impractical for real-time applications.