The NVIDIA RTX 5000 Ada, while a powerful GPU, falls short of the VRAM requirements for running Mistral Large 2 in its full FP16 precision. Mistral Large 2, with its 123 billion parameters, necessitates approximately 246GB of VRAM when using FP16 (half-precision floating point). The RTX 5000 Ada only offers 32GB of VRAM. This significant deficit of 214GB means the model cannot be loaded entirely onto the GPU for inference, leading to a compatibility failure.
Furthermore, even if VRAM was sufficient, the memory bandwidth of 0.58 TB/s on the RTX 5000 Ada could become a performance bottleneck for a model as large as Mistral Large 2. High memory bandwidth is crucial for rapidly transferring model weights and activations during inference. Limited bandwidth can restrict the number of tokens processed per second, resulting in slow response times and a poor user experience. Without sufficient VRAM, estimating tokens per second and optimal batch size is not possible.
Due to the substantial VRAM difference, running Mistral Large 2 directly on the RTX 5000 Ada without significant modifications is not feasible. Consider exploring quantization techniques such as 4-bit or even lower precision to drastically reduce the model's memory footprint. Frameworks like `llama.cpp` or `text-generation-inference` are excellent for quantized model inference. Alternatively, explore cloud-based inference solutions or consider using a multi-GPU setup if local hosting is essential. If you intend to run the model locally, investigate methods like offloading layers to system RAM, although this will severely impact performance.