The NVIDIA Jetson AGX Orin 64GB, with its 64GB of LPDDR5 VRAM, is well-suited for running the LLaVA 1.6 13B model. LLaVA 1.6 13B requires approximately 26GB of VRAM when using FP16 precision. The Orin's ample 64GB VRAM provides a significant 38GB headroom, ensuring the model and its associated data structures can comfortably reside in memory. The Orin's 2048 CUDA cores and 64 Tensor cores will be utilized for the matrix multiplications and other computations inherent in the model's architecture, while the 210 GB/s memory bandwidth will allow for efficient data transfer between the VRAM and the compute units. This headroom also allows for larger batch sizes and longer context lengths without immediately running into memory constraints, though the memory bandwidth may become a limiting factor as batch size increases.
Given the substantial VRAM headroom, experiment with larger batch sizes (up to the estimated 14) to maximize throughput. Start with FP16 precision for optimal speed and then consider quantization (e.g., Q4_K_M) to further reduce memory footprint and potentially improve inference speed, though with a potential trade-off in accuracy. Monitor VRAM usage and token generation speed during experimentation. Consider using a framework like `llama.cpp` with appropriate hardware acceleration flags enabled for the Jetson AGX Orin to optimize performance. If you encounter performance bottlenecks, investigate optimizing the image encoding and decoding pipelines used by LLaVA.