The primary limiting factor for running Mistral Large 2 on the NVIDIA Jetson AGX Orin 64GB is the significant VRAM disparity. Mistral Large 2, with its 123 billion parameters, necessitates approximately 246GB of VRAM when using FP16 precision. The Jetson AGX Orin 64GB, however, only provides 64GB of VRAM. This 182GB shortfall means the model in its full FP16 form cannot be loaded onto the device. The Ampere architecture of the Jetson AGX Orin, while capable, is ultimately constrained by its memory capacity for such a large model. Memory bandwidth, at 0.21 TB/s, would also become a bottleneck even if sufficient VRAM were available, impacting the tokens/second generation rate.
Even with techniques like offloading layers to system RAM, the performance would be severely degraded due to the slower transfer speeds between system RAM and the GPU. The 2048 CUDA cores and 64 Tensor cores would be underutilized as the system spends a significant portion of its time swapping data. The large context length of 128,000 tokens compounds the memory pressure, as the attention mechanism requires substantial memory resources. Therefore, running Mistral Large 2 in its original form on the Jetson AGX Orin 64GB is not feasible.
To run Mistral Large 2 on the Jetson AGX Orin 64GB, aggressive quantization is essential. Explore using 4-bit or even 3-bit quantization techniques to significantly reduce the VRAM footprint. Frameworks like `llama.cpp` are well-suited for this purpose, offering various quantization methods and CPU offloading capabilities. However, expect a noticeable reduction in model accuracy compared to the FP16 version.
Alternatively, consider using a smaller, fine-tuned model that is more manageable for the Jetson AGX Orin's resources. Another approach would be to use a cloud-based inference service where the model is hosted remotely, and the Jetson AGX Orin acts as a client for sending requests and receiving responses. This offloads the computational burden but requires a stable internet connection.