The primary limiting factor for running Mistral Large 2 on an NVIDIA RTX A6000 is the significant VRAM discrepancy. Mistral Large 2, with its 123 billion parameters, requires approximately 246GB of VRAM when using FP16 precision. The RTX A6000, equipped with 48GB of VRAM, falls drastically short, resulting in a VRAM headroom deficit of 198GB. This means the model, in its native FP16 format, cannot be loaded entirely onto the GPU, preventing successful inference. Memory bandwidth, while substantial at 0.77 TB/s on the A6000, becomes a secondary concern as the model cannot be fully loaded to begin with.
Without sufficient VRAM, the system would attempt to offload parts of the model to system RAM, which is significantly slower. This would lead to extremely poor performance, potentially rendering the model unusable for real-time or even interactive applications. The 10752 CUDA cores and 336 Tensor cores on the A6000 are powerful resources, but they cannot be effectively utilized if the model is bottlenecked by memory limitations. The estimated tokens/second and batch size are therefore unavailable in this configuration due to the fundamental VRAM constraint.
To run Mistral Large 2 on an RTX A6000, you must employ aggressive quantization techniques. Consider using 4-bit quantization (bitsandbytes or similar) which can significantly reduce the VRAM footprint, potentially bringing it within the A6000's capacity. Even with quantization, performance will likely be lower compared to GPUs with more VRAM, so expect slower inference speeds. Another option is to explore model parallelism across multiple GPUs if available, but this adds significant complexity to the setup. If neither quantization nor model parallelism is feasible, consider using cloud-based inference services offering Mistral Large 2, which will handle the hardware requirements for you.