The NVIDIA RTX 4090, with its 24GB of GDDR6X VRAM, is exceptionally well-suited for running the Llama 3 8B model, especially when using quantization. The provided Q3_K_M quantization brings the model's VRAM footprint down to a mere 3.2GB, leaving a substantial 20.8GB of headroom. This ample VRAM allows for larger batch sizes and longer context lengths without encountering memory limitations. Furthermore, the RTX 4090's impressive memory bandwidth of 1.01 TB/s ensures rapid data transfer between the GPU and memory, minimizing bottlenecks during inference. The 16384 CUDA cores and 512 Tensor Cores further accelerate the matrix multiplications and other computations crucial for LLM inference, leading to high throughput.
Given the substantial VRAM headroom, experiment with increasing the batch size to further improve throughput. Start with the suggested batch size of 13 and gradually increase it until you observe diminishing returns in tokens/sec or encounter VRAM limitations. Also, consider using a higher quantization level (e.g., Q4_K_M) to potentially improve accuracy without significantly impacting performance, as the RTX 4090 has ample resources to handle slightly larger models. Finally, ensure you are using the latest NVIDIA drivers and CUDA toolkit for optimal performance.