Can I run Qwen 2.5 72B on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
144.0GB
Headroom
-120.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The Qwen 2.5 72B model, with its 72 billion parameters, requires a substantial amount of VRAM for inference. In FP16 (half-precision floating point), the model necessitates approximately 144GB of VRAM to load and operate efficiently. The NVIDIA RTX 4090, while a powerful consumer GPU, is equipped with only 24GB of VRAM. This creates a significant shortfall of 120GB, rendering the model incompatible for direct loading and inference in FP16 precision on a single RTX 4090. Memory bandwidth, although high at 1.01 TB/s on the RTX 4090, becomes a secondary concern when the model cannot even fit into the available memory. Attempting to run the model without sufficient VRAM will result in out-of-memory errors or extremely slow performance due to constant swapping between system RAM and GPU VRAM, effectively making it unusable.

lightbulb Recommendation

Given the VRAM limitations, direct inference of the Qwen 2.5 72B model on a single RTX 4090 is not feasible without employing aggressive quantization techniques. Consider using quantization methods such as 4-bit or even 2-bit quantization to significantly reduce the model's memory footprint. Frameworks like llama.cpp excel at running quantized models, potentially enabling you to load and run the model, albeit with a reduction in accuracy. Alternatively, explore distributed inference solutions that split the model across multiple GPUs or utilize cloud-based inference services that offer the necessary resources. Fine-tuning a smaller, more manageable model might also be a viable option if the specific task allows.

tune Recommended Settings

Batch_Size
1
Context_Length
Consider reducing context length to minimize VRAM…
Other_Settings
['Use CPU offloading if VRAM is still insufficient, but expect significant performance degradation', 'Experiment with different quantization methods to find the best balance between accuracy and VRAM usage', 'Monitor VRAM usage closely to avoid out-of-memory errors']
Inference_Framework
llama.cpp
Quantization_Suggested
4-bit quantization (Q4_K_M or similar)

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 4090? expand_more
No, Qwen 2.5 72B requires 144GB of VRAM in FP16, while the RTX 4090 only has 24GB. It is incompatible without quantization.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
Qwen 2.5 72B needs approximately 144GB of VRAM for FP16 inference. Quantization can significantly reduce this requirement.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 4090? expand_more
Without quantization, it will not run due to insufficient VRAM. With aggressive quantization (e.g., 4-bit), it may run, but the tokens/sec will be significantly lower than on higher-end GPUs with more VRAM. Expect a noticeable performance decrease.