The NVIDIA H100 SXM, with its substantial 80GB of HBM3 VRAM and 3.35 TB/s memory bandwidth, offers ample resources for running the Llama 3 8B model. Llama 3 8B, in FP16 precision, requires approximately 16GB of VRAM, leaving a significant 64GB headroom on the H100. This large VRAM surplus allows for increased batch sizes and longer context lengths, leading to higher throughput. The H100's 528 Tensor Cores are crucial for accelerating the matrix multiplications inherent in transformer models like Llama 3, contributing significantly to faster inference speeds. The high memory bandwidth ensures that data can be moved quickly between the GPU's compute units and memory, preventing bottlenecks during model execution.
Given the H100's Hopper architecture and its optimized Tensor Cores, Llama 3 8B is expected to perform exceptionally well. The estimated 108 tokens/sec is a reasonable expectation, but this can vary based on the specific inference framework used and the level of optimization applied. The large VRAM capacity also allows for experimentation with larger batch sizes, potentially further increasing throughput. However, it's important to note that power consumption (TDP of 700W) should be considered, ensuring adequate cooling and power supply are available.
For optimal performance, it is recommended to use an optimized inference framework such as vLLM or NVIDIA's TensorRT. These frameworks can leverage the H100's Tensor Cores and high memory bandwidth to their full potential. Experiment with different batch sizes to find the sweet spot between throughput and latency. Start with a batch size of 32 and gradually increase it until you observe diminishing returns or run into memory constraints. While FP16 is a good starting point, consider experimenting with quantization techniques like INT8 or even FP8 to further reduce memory footprint and potentially increase throughput, while monitoring for acceptable accuracy.