Can I run LLaVA 1.6 34B on NVIDIA Jetson AGX Orin 64GB?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
64.0GB
Required
68.0GB
Headroom
-4.0GB

VRAM Usage

0GB 100% used 64.0GB

info Technical Analysis

The NVIDIA Jetson AGX Orin 64GB, while a powerful embedded system GPU, falls short of the VRAM requirements for running LLaVA 1.6 34B in FP16 precision. LLaVA 1.6 34B demands approximately 68GB of VRAM when using FP16 (half-precision floating point), whereas the Jetson AGX Orin provides 64GB. This 4GB deficit means the model, in its default FP16 configuration, cannot be loaded entirely onto the GPU, leading to out-of-memory errors. Furthermore, even if the model could be squeezed in, the Jetson AGX Orin's memory bandwidth of 210 GB/s might become a bottleneck, particularly during large batch inferences or when dealing with longer context lengths.

lightbulb Recommendation

To run LLaVA 1.6 34B on the Jetson AGX Orin 64GB, you'll need to significantly reduce the model's memory footprint. The primary method is through quantization. Consider using Q4_K_M or even lower quantization levels available in llama.cpp or similar frameworks. This will compress the model's weights, drastically reducing VRAM usage. Be aware that aggressive quantization can impact model accuracy, so experiment to find a balance between performance and quality. Additionally, optimizing batch size and context length can further alleviate memory pressure and improve inference speed. If these optimizations are insufficient, consider using a smaller model variant or exploring distributed inference strategies across multiple devices if feasible.

tune Recommended Settings

Batch_Size
1
Context_Length
2048 or lower
Other_Settings
['Enable memory mapping (mmap) to reduce RAM usage', 'Experiment with different quantization methods for optimal balance', 'Monitor VRAM usage closely during inference', 'Consider using CPU offloading for certain layers if VRAM is still insufficient']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_K_M or lower

help Frequently Asked Questions

Is LLaVA 1.6 34B compatible with NVIDIA Jetson AGX Orin 64GB? expand_more
No, not without significant quantization and optimization due to VRAM limitations.
What VRAM is needed for LLaVA 1.6 34B? expand_more
Approximately 68GB of VRAM is needed for LLaVA 1.6 34B in FP16 precision. Quantization can reduce this significantly.
How fast will LLaVA 1.6 34B run on NVIDIA Jetson AGX Orin 64GB? expand_more
Performance will be limited by VRAM capacity and memory bandwidth. Expect significantly reduced tokens/second compared to higher-end GPUs, especially with quantization. Exact speed depends on the quantization level and optimization techniques employed.