The NVIDIA H100 PCIe, with its 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Mini 3.8B model. The model, when quantized to q3_k_m, requires a mere 1.5GB of VRAM. This leaves a substantial 78.5GB of VRAM headroom, allowing for large batch sizes and concurrent execution of multiple model instances or other GPU-intensive tasks. The H100's Hopper architecture, featuring 14592 CUDA cores and 456 Tensor Cores, provides ample computational power for accelerating inference.
Given the available resources, the primary bottleneck is unlikely to be memory capacity or compute capability. Instead, the achieved tokens/second will largely depend on the efficiency of the chosen inference framework and the optimization strategies employed. The estimated 117 tokens/sec is a solid starting point, but it can be significantly improved through careful configuration. The high memory bandwidth of the H100 will ensure rapid data transfer between the GPU and memory, minimizing latency and maximizing throughput during inference.
To maximize performance, leverage inference frameworks optimized for NVIDIA GPUs, such as vLLM or NVIDIA's TensorRT. Experiment with different batch sizes to find the optimal balance between throughput and latency. While a batch size of 32 is a good starting point, explore larger batch sizes to potentially increase tokens/second, especially if latency is not a primary concern. Consider using techniques like speculative decoding to further boost inference speed. Profile the model's performance to identify any bottlenecks and optimize accordingly.