The NVIDIA H100 SXM, with its substantial 80GB of HBM3 VRAM and 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Small 7B model. Phi-3 Small 7B, when quantized to INT8, requires only 7GB of VRAM. This leaves a massive 73GB of VRAM headroom on the H100, ensuring ample space for large batch sizes and extended context lengths without encountering memory limitations. The H100's Hopper architecture, boasting 16896 CUDA cores and 528 Tensor Cores, is optimized for both training and inference workloads, providing significant computational power for demanding language models.
Given the abundant VRAM and high memory bandwidth, users should prioritize maximizing batch size to improve throughput. Experiment with different batch sizes to find the optimal balance between latency and throughput for your specific application. While INT8 quantization is already efficient, consider exploring FP16 or BF16 if higher precision is required and the performance impact is acceptable. Utilize the H100's Tensor Cores by leveraging optimized inference libraries that take advantage of mixed-precision computation to further accelerate performance.