Can I run LLaVA 1.6 34B on NVIDIA Jetson AGX Orin 32GB?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
32.0GB
Required
68.0GB
Headroom
-36.0GB

VRAM Usage

0GB 100% used 32.0GB

info Technical Analysis

The NVIDIA Jetson AGX Orin 32GB, with its Ampere architecture, offers a potent combination of CUDA and Tensor cores suitable for AI inference. However, its 32GB of LPDDR5 VRAM presents a significant bottleneck when attempting to run LLaVA 1.6 34B, a large vision model. LLaVA 1.6 34B in FP16 precision requires approximately 68GB of VRAM to load the model weights and manage activations during inference. The Orin's memory bandwidth of 0.21 TB/s, while respectable, further exacerbates the issue as it limits the rate at which data can be transferred between memory and the GPU cores.

Due to the insufficient VRAM, directly loading and running LLaVA 1.6 34B in FP16 on the Jetson AGX Orin 32GB is not feasible. The model's parameters alone exceed the available memory, leading to out-of-memory errors. Even with aggressive memory management techniques, the substantial VRAM deficit will prevent successful inference. The limited memory bandwidth will further constrain performance if any workaround is attempted, resulting in extremely slow token generation speeds.

lightbulb Recommendation

Given the VRAM constraints, running LLaVA 1.6 34B directly on the Jetson AGX Orin 32GB is impractical without significant modifications. Consider using quantization techniques like 4-bit or 8-bit to substantially reduce the model's memory footprint. Utilizing inference frameworks optimized for low-resource environments, such as llama.cpp with appropriate quantization levels, is crucial.

Alternatively, explore smaller models or consider offloading some layers to the CPU. However, CPU offloading will introduce significant performance degradation. As a last resort, consider using cloud-based inference solutions where the model can run on more powerful hardware. Fine-tuning a smaller vision model to perform a similar task is another viable option, which will be much more resource-friendly for the Orin.

tune Recommended Settings

Batch_Size
1
Context_Length
512
Other_Settings
['Enable GPU acceleration in llama.cpp', 'Reduce the number of layers used if possible', 'Monitor VRAM usage closely']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_K_M or lower

help Frequently Asked Questions

Is LLaVA 1.6 34B compatible with NVIDIA Jetson AGX Orin 32GB? expand_more
No, LLaVA 1.6 34B is not directly compatible with the NVIDIA Jetson AGX Orin 32GB due to insufficient VRAM.
What VRAM is needed for LLaVA 1.6 34B? expand_more
LLaVA 1.6 34B requires approximately 68GB of VRAM in FP16 precision.
How fast will LLaVA 1.6 34B run on NVIDIA Jetson AGX Orin 32GB? expand_more
Without significant optimization like quantization, LLaVA 1.6 34B will likely not run on the NVIDIA Jetson AGX Orin 32GB due to VRAM limitations. Even with optimizations, performance will be significantly impacted.