Can I run Phi-3 Mini 3.8B (q3_k_m) on NVIDIA H100 SXM?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
1.5GB
Headroom
+78.5GB

VRAM Usage

0GB 2% used 80.0GB

Performance Estimate

Tokens/sec ~135.0
Batch size 32
Context 128000K

info Technical Analysis

The NVIDIA H100 SXM, with its substantial 80GB of HBM3 memory and a memory bandwidth of 3.35 TB/s, is exceptionally well-suited for running the Phi-3 Mini 3.8B model. The q3_k_m quantization further reduces the model's VRAM footprint to a mere 1.5GB, leaving a significant 78.5GB of VRAM headroom. This ample VRAM allows for large batch sizes and extended context lengths without encountering memory constraints. The H100's 16896 CUDA cores and 528 Tensor Cores ensure rapid computation, critical for achieving high inference speeds.

Given the H100's architecture and specifications, the Phi-3 Mini 3.8B model can leverage its Tensor Cores for accelerated matrix multiplications, a core component of transformer-based language models. The high memory bandwidth minimizes data transfer bottlenecks, ensuring that the GPU cores are continuously fed with the necessary data. This results in optimal utilization of the H100's computational resources and translates to high throughput in terms of tokens generated per second. The estimated 135 tokens/sec is a testament to this efficient utilization.

lightbulb Recommendation

For optimal performance, leverage an inference framework like `vLLM` or `text-generation-inference` that is optimized for NVIDIA GPUs and supports TensorRT. Experiment with different batch sizes to find the sweet spot between latency and throughput; a batch size of 32 is a good starting point. Given the large VRAM headroom, consider increasing the context length to fully utilize the model's capabilities. Monitor GPU utilization to ensure that the H100 is being fully utilized and adjust parameters accordingly.

If you encounter any performance bottlenecks, profile your application to identify the source. Consider using techniques like speculative decoding or attention quantization to further improve inference speed. Given the H100's power consumption (700W TDP), ensure that your system has adequate cooling and power delivery to maintain stable performance.

tune Recommended Settings

Batch_Size
32
Context_Length
128000
Other_Settings
['Enable TensorRT', 'Use CUDA graphs', 'Experiment with speculative decoding']
Inference_Framework
vLLM
Quantization_Suggested
q3_k_m

help Frequently Asked Questions

Is Phi-3 Mini 3.8B (3.80B) compatible with NVIDIA H100 SXM? expand_more
Yes, Phi-3 Mini 3.8B is perfectly compatible with the NVIDIA H100 SXM, offering excellent performance due to the H100's ample VRAM and processing power.
What VRAM is needed for Phi-3 Mini 3.8B (3.80B)? expand_more
With q3_k_m quantization, Phi-3 Mini 3.8B requires approximately 1.5GB of VRAM.
How fast will Phi-3 Mini 3.8B (3.80B) run on NVIDIA H100 SXM? expand_more
You can expect an estimated inference speed of around 135 tokens per second on the NVIDIA H100 SXM.