Can I run Phi-3 Medium 14B (q3_k_m) on NVIDIA RTX 4090?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
24.0GB
Required
5.6GB
Headroom
+18.4GB

VRAM Usage

0GB 23% used 24.0GB

Performance Estimate

Tokens/sec ~60.0
Batch size 6
Context 128000K

info Technical Analysis

The NVIDIA RTX 4090, with its 24GB of GDDR6X VRAM and 1.01 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Medium 14B model, especially when using quantization. Phi-3 Medium 14B in its full FP16 precision would require 28GB of VRAM, exceeding the 4090's capacity. However, with q3_k_m quantization, the model's memory footprint is reduced to approximately 5.6GB. This leaves a substantial 18.4GB of VRAM headroom, ensuring that the model and its associated processes can operate comfortably without encountering memory constraints. The RTX 4090's 16384 CUDA cores and 512 Tensor Cores further accelerate the computations required for inference, resulting in high throughput.

lightbulb Recommendation

For optimal performance, leverage the RTX 4090's Tensor Cores by using an inference framework that supports them, such as `llama.cpp` with CUDA or TensorRT, or `vLLM`. Experiment with different batch sizes to maximize throughput without exceeding the GPU's memory or causing latency issues. While the q3_k_m quantization allows the model to fit comfortably within the VRAM, explore slightly higher quantization levels if you need even more headroom for larger batch sizes or longer context lengths. Always monitor GPU utilization and temperature to ensure stable operation, especially during extended inference sessions.

tune Recommended Settings

Batch_Size
6 (experiment to optimize)
Context_Length
128000
Other_Settings
['Enable CUDA or TensorRT acceleration', 'Monitor GPU utilization and temperature', 'Optimize batch size for throughput and latency']
Inference_Framework
llama.cpp (with CUDA), vLLM, or TensorRT
Quantization_Suggested
q3_k_m (or potentially higher, depending on perfo…

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA RTX 4090? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA RTX 4090, especially when using quantization techniques like q3_k_m.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
With q3_k_m quantization, Phi-3 Medium 14B requires approximately 5.6GB of VRAM.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA RTX 4090? expand_more
You can expect approximately 60 tokens per second with the RTX 4090, though this can vary based on the specific inference framework and settings used.