Can I run Phi-3 Mini 3.8B (q3_k_m) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
1.5GB
Headroom
+78.5GB

VRAM Usage

0GB 2% used 80.0GB

Performance Estimate

Tokens/sec ~117.0
Batch size 32
Context 128000K

info Technical Analysis

The NVIDIA H100 PCIe, with its 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Mini 3.8B model. The model, when quantized to q3_k_m, requires a mere 1.5GB of VRAM. This leaves a substantial 78.5GB of VRAM headroom, allowing for large batch sizes and concurrent execution of multiple model instances or other GPU-intensive tasks. The H100's Hopper architecture, featuring 14592 CUDA cores and 456 Tensor Cores, provides ample computational power for accelerating inference.

Given the available resources, the primary bottleneck is unlikely to be memory capacity or compute capability. Instead, the achieved tokens/second will largely depend on the efficiency of the chosen inference framework and the optimization strategies employed. The estimated 117 tokens/sec is a solid starting point, but it can be significantly improved through careful configuration. The high memory bandwidth of the H100 will ensure rapid data transfer between the GPU and memory, minimizing latency and maximizing throughput during inference.

lightbulb Recommendation

To maximize performance, leverage inference frameworks optimized for NVIDIA GPUs, such as vLLM or NVIDIA's TensorRT. Experiment with different batch sizes to find the optimal balance between throughput and latency. While a batch size of 32 is a good starting point, explore larger batch sizes to potentially increase tokens/second, especially if latency is not a primary concern. Consider using techniques like speculative decoding to further boost inference speed. Profile the model's performance to identify any bottlenecks and optimize accordingly.

tune Recommended Settings

Batch_Size
32 (experiment with larger sizes)
Context_Length
128000
Other_Settings
['Enable CUDA graph capture', 'Use Tensor Cores for FP16/BF16 acceleration', 'Experiment with speculative decoding']
Inference_Framework
vLLM or TensorRT
Quantization_Suggested
q3_k_m (or higher precision if needed)

help Frequently Asked Questions

Is Phi-3 Mini 3.8B (3.80B) compatible with NVIDIA H100 PCIe? expand_more
Yes, it is perfectly compatible. The H100 has more than enough resources to run the model efficiently.
What VRAM is needed for Phi-3 Mini 3.8B (3.80B)? expand_more
With q3_k_m quantization, Phi-3 Mini 3.8B requires approximately 1.5GB of VRAM.
How fast will Phi-3 Mini 3.8B (3.80B) run on NVIDIA H100 PCIe? expand_more
You can expect around 117 tokens/sec initially, but this can be significantly improved with optimization techniques.