The NVIDIA RTX 4090, while a powerful consumer GPU, falls short of the VRAM requirements for running Mistral Large 2 in its native FP16 (half-precision floating point) format. Mistral Large 2, with its 123 billion parameters, necessitates approximately 246GB of VRAM for FP16 inference. The RTX 4090 only provides 24GB of VRAM, leaving a significant deficit of 222GB. This discrepancy means that the entire model cannot be loaded onto the GPU at once, preventing direct inference.
Memory bandwidth also plays a role. While the RTX 4090 boasts a respectable 1.01 TB/s memory bandwidth, this is largely irrelevant when the entire model cannot reside in VRAM. Even with sufficient bandwidth, constantly swapping model layers between system RAM and GPU memory (offloading) would introduce substantial latency, severely degrading performance. Without the ability to load the entire model, achieving reasonable tokens/second is not feasible, and batch size will effectively be zero in a naive implementation.
Due to the VRAM limitations, directly running Mistral Large 2 on a single RTX 4090 is impractical without significant modifications. The primary recommendation is to explore quantization techniques, such as 4-bit or 8-bit quantization (using libraries like bitsandbytes or llama.cpp), to substantially reduce the model's memory footprint. Another option is to consider offloading layers to system RAM, although this will dramatically decrease inference speed. Alternatively, consider using cloud-based inference services or distributed inference across multiple GPUs with sufficient combined VRAM. Fine tuning a smaller model is also an option.
If you choose to proceed with quantization and offloading, prioritize fast system RAM and a high-bandwidth connection between the CPU and GPU (e.g., PCIe 4.0 or 5.0). Experiment with different quantization levels to balance memory usage and performance. Be prepared for significantly reduced inference speeds compared to running the model entirely in VRAM on a more capable GPU.