Can I run Mistral Large 2 on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
246.0GB
Headroom
-222.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 4090, while a powerful consumer GPU, falls short of the VRAM requirements for running Mistral Large 2 in its native FP16 (half-precision floating point) format. Mistral Large 2, with its 123 billion parameters, necessitates approximately 246GB of VRAM for FP16 inference. The RTX 4090 only provides 24GB of VRAM, leaving a significant deficit of 222GB. This discrepancy means that the entire model cannot be loaded onto the GPU at once, preventing direct inference.

Memory bandwidth also plays a role. While the RTX 4090 boasts a respectable 1.01 TB/s memory bandwidth, this is largely irrelevant when the entire model cannot reside in VRAM. Even with sufficient bandwidth, constantly swapping model layers between system RAM and GPU memory (offloading) would introduce substantial latency, severely degrading performance. Without the ability to load the entire model, achieving reasonable tokens/second is not feasible, and batch size will effectively be zero in a naive implementation.

lightbulb Recommendation

Due to the VRAM limitations, directly running Mistral Large 2 on a single RTX 4090 is impractical without significant modifications. The primary recommendation is to explore quantization techniques, such as 4-bit or 8-bit quantization (using libraries like bitsandbytes or llama.cpp), to substantially reduce the model's memory footprint. Another option is to consider offloading layers to system RAM, although this will dramatically decrease inference speed. Alternatively, consider using cloud-based inference services or distributed inference across multiple GPUs with sufficient combined VRAM. Fine tuning a smaller model is also an option.

If you choose to proceed with quantization and offloading, prioritize fast system RAM and a high-bandwidth connection between the CPU and GPU (e.g., PCIe 4.0 or 5.0). Experiment with different quantization levels to balance memory usage and performance. Be prepared for significantly reduced inference speeds compared to running the model entirely in VRAM on a more capable GPU.

tune Recommended Settings

Batch_Size
1 (adjust based on available VRAM after quantizat…
Context_Length
Reduce context length if VRAM is still constraine…
Other_Settings
['Enable GPU acceleration in llama.cpp (cuBLAS or CUDA)', 'Experiment with layer offloading to system RAM', 'Use a fast CPU with high core count to minimize offloading latency']
Inference_Framework
llama.cpp or vLLM
Quantization_Suggested
4-bit or 8-bit quantization (Q4_K_M or Q8_0)

help Frequently Asked Questions

Is Mistral Large 2 compatible with NVIDIA RTX 4090? expand_more
Not directly. The RTX 4090's 24GB of VRAM is insufficient for the 246GB required by Mistral Large 2 in FP16. Quantization or distributed inference is necessary.
What VRAM is needed for Mistral Large 2? expand_more
In FP16, Mistral Large 2 requires approximately 246GB of VRAM. Quantization can significantly reduce this requirement.
How fast will Mistral Large 2 run on NVIDIA RTX 4090? expand_more
Without quantization and offloading, it won't run. With aggressive quantization and offloading, expect significantly reduced inference speeds compared to running on hardware with sufficient VRAM. Performance will be highly dependent on the chosen quantization level and system RAM speed.