The NVIDIA Jetson AGX Orin 64GB, while a powerful embedded system GPU, falls short of the VRAM requirements for running LLaVA 1.6 34B in FP16 precision. LLaVA 1.6 34B demands approximately 68GB of VRAM when using FP16 (half-precision floating point), whereas the Jetson AGX Orin provides 64GB. This 4GB deficit means the model, in its default FP16 configuration, cannot be loaded entirely onto the GPU, leading to out-of-memory errors. Furthermore, even if the model could be squeezed in, the Jetson AGX Orin's memory bandwidth of 210 GB/s might become a bottleneck, particularly during large batch inferences or when dealing with longer context lengths.
To run LLaVA 1.6 34B on the Jetson AGX Orin 64GB, you'll need to significantly reduce the model's memory footprint. The primary method is through quantization. Consider using Q4_K_M or even lower quantization levels available in llama.cpp or similar frameworks. This will compress the model's weights, drastically reducing VRAM usage. Be aware that aggressive quantization can impact model accuracy, so experiment to find a balance between performance and quality. Additionally, optimizing batch size and context length can further alleviate memory pressure and improve inference speed. If these optimizations are insufficient, consider using a smaller model variant or exploring distributed inference strategies across multiple devices if feasible.