Can I run Mixtral 8x7B (INT8 (8-bit Integer)) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
46.7GB
Headroom
-22.7GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, while a powerful GPU with 10752 CUDA cores and 24GB of GDDR6X VRAM, falls short of the VRAM requirements for running the Mixtral 8x7B model, even when quantized to INT8. Mixtral 8x7B, a sparse mixture-of-experts model, demands substantial memory resources. In its INT8 quantized form, the model requires approximately 46.7GB of VRAM to load and operate. The RTX 3090 Ti's 24GB VRAM capacity leaves a deficit of 22.7GB, preventing the model from being loaded entirely onto the GPU. This limitation will cause out-of-memory errors, making direct inference impossible without employing techniques to reduce the memory footprint or offload parts of the model to system RAM, which significantly impacts performance.

Even with the RTX 3090 Ti's impressive memory bandwidth of 1.01 TB/s, the bottleneck is the insufficient VRAM. While high memory bandwidth is crucial for quickly transferring data between the GPU and its memory, it cannot compensate for the lack of capacity to hold the entire model. Running the model with insufficient VRAM would necessitate offloading layers to system RAM, which is significantly slower than GDDR6X, leading to a substantial performance decrease. The Ampere architecture's Tensor Cores would be underutilized, as the model cannot fully reside on the GPU for efficient computation. Therefore, while the RTX 3090 Ti possesses capable hardware, it's fundamentally limited by its VRAM capacity for this specific model.

lightbulb Recommendation

Due to the VRAM limitations, directly running Mixtral 8x7B on the RTX 3090 Ti is not feasible. Consider using CPU offloading or splitting the model across multiple GPUs if available. If offloading to CPU, expect a significant performance hit. Another approach is to explore more aggressive quantization methods, such as Q4 or even lower, which can substantially reduce the VRAM footprint, although this will come at the cost of some accuracy. Model distillation is another advanced technique where a smaller model is trained to mimic the behavior of the larger Mixtral model, making it suitable for the RTX 3090 Ti.

Alternatively, consider using cloud-based GPU instances with higher VRAM capacities, such as those offered by NelsaHost, to run the model without these limitations. If sticking with local hardware, investigate using inference frameworks that support advanced memory management techniques like swapping layers between GPU and system memory, but be aware that this will severely impact inference speed. Prioritize minimizing context length to reduce memory usage as well.

tune Recommended Settings

Batch_Size
1
Context_Length
2048
Other_Settings
['Enable CPU offloading', 'Use a smaller model variant', 'Reduce the number of layers']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_K_M or lower

help Frequently Asked Questions

Is Mixtral 8x7B (46.70B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the RTX 3090 Ti's 24GB VRAM is insufficient for the 46.7GB required by Mixtral 8x7B (INT8).
What VRAM is needed for Mixtral 8x7B (46.70B)? expand_more
Mixtral 8x7B (46.70B) requires approximately 93.4GB of VRAM in FP16 and 46.7GB in INT8.
How fast will Mixtral 8x7B (46.70B) run on NVIDIA RTX 3090 Ti? expand_more
It will likely not run due to VRAM limitations. If offloaded to CPU, performance will be significantly reduced. Expect very low tokens/sec.