The primary bottleneck for running Llama 3.1 405B on an RTX 3090 is the VRAM limitation. Llama 3.1 405B in INT8 quantization requires approximately 405GB of VRAM. The RTX 3090, with only 24GB of VRAM, falls drastically short, resulting in a VRAM headroom deficit of -381GB. This massive discrepancy makes it impossible to load the entire model onto the GPU for inference. Even with quantization, the model's memory footprint far exceeds the GPU's capacity.
Beyond VRAM, even if the model *could* fit, memory bandwidth would become a secondary constraint. The RTX 3090's 0.94 TB/s memory bandwidth, while substantial, would still be taxed by the sheer size of the model and the continuous data transfer required during inference. This would likely result in significantly reduced tokens/second generation speed. The number of CUDA and Tensor cores, while important for computational throughput, are rendered less relevant due to the primary VRAM bottleneck. Without sufficient VRAM, these cores remain largely underutilized.
Due to the extreme VRAM deficit, the RTX 3090 cannot run Llama 3.1 405B, even in INT8. Attempting to do so would result in out-of-memory errors or the system crashing. Therefore, performance metrics like tokens/second and batch size are not applicable in this scenario.
Given the RTX 3090's VRAM limitations, running Llama 3.1 405B is not feasible. Consider using a smaller model that fits within the 24GB VRAM or exploring distributed inference techniques across multiple GPUs, which would require significant infrastructure investment and specialized software. As an alternative, cloud-based inference services offer a practical solution, allowing you to leverage more powerful hardware on demand without the upfront cost. Another option is to use a heavily quantized version of the model (e.g. 4-bit quantization), but even then, the performance may be severely impacted and not fit into VRAM.
If sticking to local inference is desired, explore smaller Llama 3 variants or other models with significantly fewer parameters that can fit within the RTX 3090's VRAM. Fine-tuning a smaller model on a specific task can often achieve comparable results to a larger model with more general capabilities.