Can I run Mixtral 8x22B (INT8 (8-bit Integer)) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
141.0GB
Headroom
-117.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, while a powerful GPU, falls short of the VRAM requirements for running Mixtral 8x22B (141B) even with INT8 quantization. Mixtral 8x22B, even quantized to INT8, demands approximately 141GB of VRAM. The RTX 3090 Ti has only 24GB of VRAM. This discrepancy of 117GB means the model's weights alone cannot fit on the GPU, leading to a compatibility failure. The high memory bandwidth of the RTX 3090 Ti (1.01 TB/s) is irrelevant in this scenario since the model cannot be loaded in the first place.

Even if techniques like offloading some layers to system RAM were employed, the performance would be drastically reduced due to the much slower transfer speeds between system RAM and the GPU. The limited VRAM capacity will prevent any meaningful inference. The 10752 CUDA cores and 336 Tensor cores will remain largely unutilized because the model's data cannot be processed within the GPU's memory.

lightbulb Recommendation

Running Mixtral 8x22B (141B) effectively requires either a GPU with substantially more VRAM or a multi-GPU setup where the model can be distributed across multiple cards. Consider using cloud-based GPU instances with sufficient VRAM, such as those offered by NelsaHost, or explore distributed inference solutions. Alternatively, consider smaller models that fit within the 24GB VRAM limit of the RTX 3090 Ti, or explore extreme quantization methods that might further reduce the VRAM footprint, although this will likely come at a significant performance and accuracy cost.

If you're determined to experiment, explore frameworks like `llama.cpp` with aggressive quantization techniques (e.g., 4-bit quantization) and offloading layers to system RAM. However, be prepared for extremely slow inference speeds and potential instability. The performance will likely be too low for practical use. A more realistic approach would be to use a smaller, more efficient model that is designed to run on consumer-grade hardware.

tune Recommended Settings

Batch_Size
1
Context_Length
2048 (or lower)
Other_Settings
['Offload as many layers as possible to system RAM', 'Use CPU inference if GPU is completely unusable', 'Monitor system RAM usage closely to avoid crashes']
Inference_Framework
llama.cpp (for experimentation only)
Quantization_Suggested
Q4_K_M or lower (for experimentation only)

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the RTX 3090 Ti does not have enough VRAM to run Mixtral 8x22B, even with quantization.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
Mixtral 8x22B requires approximately 141GB of VRAM when quantized to INT8.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA RTX 3090 Ti? expand_more
Due to insufficient VRAM, Mixtral 8x22B will likely not run on the RTX 3090 Ti without significant performance degradation and potential crashes. Expect extremely slow inference speeds if you manage to get it running with offloading and extreme quantization.