Can I run Phi-3 Small 7B (INT8 (8-bit Integer)) on NVIDIA H100 SXM?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
7.0GB
Headroom
+73.0GB

VRAM Usage

0GB 9% used 80.0GB

Performance Estimate

Tokens/sec ~135.0
Batch size 32
Context 128000K

info Technical Analysis

The NVIDIA H100 SXM, with its substantial 80GB of HBM3 VRAM and 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Small 7B model. Phi-3 Small 7B, when quantized to INT8, requires only 7GB of VRAM. This leaves a massive 73GB of VRAM headroom on the H100, ensuring ample space for large batch sizes and extended context lengths without encountering memory limitations. The H100's Hopper architecture, boasting 16896 CUDA cores and 528 Tensor Cores, is optimized for both training and inference workloads, providing significant computational power for demanding language models.

lightbulb Recommendation

Given the abundant VRAM and high memory bandwidth, users should prioritize maximizing batch size to improve throughput. Experiment with different batch sizes to find the optimal balance between latency and throughput for your specific application. While INT8 quantization is already efficient, consider exploring FP16 or BF16 if higher precision is required and the performance impact is acceptable. Utilize the H100's Tensor Cores by leveraging optimized inference libraries that take advantage of mixed-precision computation to further accelerate performance.

tune Recommended Settings

Batch_Size
32 (start), experiment up to 64 or higher
Context_Length
128000 (maximum supported)
Other_Settings
['Enable TensorRT for further optimization', 'Use CUDA graphs to reduce CPU overhead', 'Profile performance to identify bottlenecks and fine-tune settings']
Inference_Framework
vLLM or text-generation-inference
Quantization_Suggested
INT8 (default)

help Frequently Asked Questions

Is Phi-3 Small 7B (7.00B) compatible with NVIDIA H100 SXM? expand_more
Yes, Phi-3 Small 7B is perfectly compatible with the NVIDIA H100 SXM. The H100 offers significantly more VRAM and computational power than required to run this model.
What VRAM is needed for Phi-3 Small 7B (7.00B)? expand_more
With INT8 quantization, Phi-3 Small 7B requires approximately 7GB of VRAM.
How fast will Phi-3 Small 7B (7.00B) run on NVIDIA H100 SXM? expand_more
Expect approximately 135 tokens per second with optimal settings. Actual performance may vary depending on batch size, context length, and the specific inference framework used.