Can I run Phi-3 Mini 3.8B (Q4_K_M (GGUF 4-bit)) on NVIDIA H100 SXM?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
1.9GB
Headroom
+78.1GB

VRAM Usage

0GB 2% used 80.0GB

Performance Estimate

Tokens/sec ~135.0
Batch size 32
Context 128000K

info Technical Analysis

The NVIDIA H100 SXM, with its massive 80GB of HBM3 VRAM and impressive 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Mini 3.8B model. Phi-3 Mini, in its Q4_K_M (4-bit) quantized form, requires a mere 1.9GB of VRAM. This leaves an enormous 78.1GB of VRAM headroom, ensuring that VRAM limitations will not be a bottleneck. The H100's 16896 CUDA cores and 528 Tensor Cores further accelerate the model's computations, leading to high throughput and low latency. The high memory bandwidth is crucial for rapidly transferring model weights and activations between the GPU and memory, maximizing performance.

lightbulb Recommendation

Given the substantial VRAM headroom, experiment with larger batch sizes to further increase throughput. Start with the suggested batch size of 32 and gradually increase it until you observe diminishing returns or encounter memory limitations. Consider using techniques like speculative decoding to boost token generation speed. Explore different inference frameworks such as `vLLM` or `text-generation-inference` to leverage optimized kernels and scheduling algorithms for the H100 architecture. Monitor GPU utilization to ensure you're maximizing the H100's capabilities.

tune Recommended Settings

Batch_Size
32 (experiment with higher values)
Context_Length
128000
Other_Settings
['Enable CUDA graph capture', 'Use Paged Attention', 'Experiment with speculative decoding']
Inference_Framework
vLLM or text-generation-inference
Quantization_Suggested
Q4_K_M (or higher precision if desired)

help Frequently Asked Questions

Is Phi-3 Mini 3.8B (3.80B) compatible with NVIDIA H100 SXM? expand_more
Yes, Phi-3 Mini 3.8B is perfectly compatible with the NVIDIA H100 SXM.
What VRAM is needed for Phi-3 Mini 3.8B (3.80B)? expand_more
When quantized to Q4_K_M (4-bit), Phi-3 Mini 3.8B requires approximately 1.9GB of VRAM.
How fast will Phi-3 Mini 3.8B (3.80B) run on NVIDIA H100 SXM? expand_more
You can expect an estimated token generation speed of around 135 tokens per second, potentially higher with optimizations.