Can I run Phi-3 Medium 14B (q3_k_m) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
5.6GB
Headroom
+74.4GB

VRAM Usage

0GB 7% used 80.0GB

Performance Estimate

Tokens/sec ~78.0
Batch size 26
Context 128000K

info Technical Analysis

The NVIDIA H100 PCIe, with its substantial 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Medium 14B model. The model, when quantized to q3_k_m, requires only 5.6GB of VRAM, leaving a significant 74.4GB of headroom. This ample VRAM allows for large batch sizes and extended context lengths without encountering memory limitations. The H100's 14592 CUDA cores and 456 Tensor Cores provide substantial computational power for both inference and potential fine-tuning tasks.

Given the memory bandwidth and compute capabilities of the H100, the Phi-3 Medium 14B model can leverage the hardware effectively. The estimated tokens/second generation rate of 78 is a strong indicator of responsive performance. The large VRAM headroom means that users can experiment with larger batch sizes (estimated at 26) to maximize throughput, especially when serving multiple concurrent requests. The Hopper architecture's optimizations for transformer models further enhance the efficiency of the inference process.

lightbulb Recommendation

For optimal performance, leverage an inference framework like vLLM or NVIDIA's TensorRT. These frameworks are designed to maximize throughput and minimize latency on NVIDIA GPUs. Experiment with different batch sizes to find the sweet spot between latency and throughput for your specific use case. Monitor GPU utilization to ensure the H100 is being fully utilized; if not, consider increasing the batch size or number of concurrent requests.

While q3_k_m provides excellent memory savings, consider experimenting with higher precision quantization (e.g., q4_k_m or even FP16 if memory allows) to potentially improve the model's accuracy, especially for tasks requiring high precision. However, be mindful of the increased VRAM usage and adjust batch sizes accordingly.

tune Recommended Settings

Batch_Size
26 (adjust based on latency requirements)
Context_Length
128000
Other_Settings
['Enable CUDA graph capture', 'Use Paged Attention', 'Experiment with different scheduling algorithms (e.g., continuous batching)']
Inference_Framework
vLLM
Quantization_Suggested
q3_k_m (consider q4_k_m for higher accuracy if VR…

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Phi-3 Medium 14B is perfectly compatible with the NVIDIA H100 PCIe.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
With q3_k_m quantization, Phi-3 Medium 14B requires approximately 5.6GB of VRAM.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA H100 PCIe? expand_more
You can expect approximately 78 tokens per second with the specified configuration.