The NVIDIA Jetson AGX Orin 32GB, with its Ampere architecture, offers a potent combination of CUDA and Tensor cores suitable for AI inference. However, its 32GB of LPDDR5 VRAM presents a significant bottleneck when attempting to run LLaVA 1.6 34B, a large vision model. LLaVA 1.6 34B in FP16 precision requires approximately 68GB of VRAM to load the model weights and manage activations during inference. The Orin's memory bandwidth of 0.21 TB/s, while respectable, further exacerbates the issue as it limits the rate at which data can be transferred between memory and the GPU cores.
Due to the insufficient VRAM, directly loading and running LLaVA 1.6 34B in FP16 on the Jetson AGX Orin 32GB is not feasible. The model's parameters alone exceed the available memory, leading to out-of-memory errors. Even with aggressive memory management techniques, the substantial VRAM deficit will prevent successful inference. The limited memory bandwidth will further constrain performance if any workaround is attempted, resulting in extremely slow token generation speeds.
Given the VRAM constraints, running LLaVA 1.6 34B directly on the Jetson AGX Orin 32GB is impractical without significant modifications. Consider using quantization techniques like 4-bit or 8-bit to substantially reduce the model's memory footprint. Utilizing inference frameworks optimized for low-resource environments, such as llama.cpp with appropriate quantization levels, is crucial.
Alternatively, explore smaller models or consider offloading some layers to the CPU. However, CPU offloading will introduce significant performance degradation. As a last resort, consider using cloud-based inference solutions where the model can run on more powerful hardware. Fine-tuning a smaller vision model to perform a similar task is another viable option, which will be much more resource-friendly for the Orin.