The AMD RX 7900 XTX, with its 24GB of GDDR6 VRAM, falls short of the VRAM requirements for running Llama 3 70B, even in its INT8 quantized form. Llama 3 70B quantized to INT8 requires approximately 70GB of VRAM to load the model and perform inference. The deficit of 46GB means the entire model cannot reside on the GPU's memory simultaneously, preventing successful execution without significant workarounds. While the RX 7900 XTX boasts a substantial 0.96 TB/s memory bandwidth, this bandwidth cannot compensate for the lack of sufficient on-device VRAM. The absence of Tensor Cores on the AMD GPU also impacts performance, as specialized hardware acceleration for matrix multiplication operations, crucial for LLM inference, is unavailable, leading to slower processing times compared to GPUs with dedicated Tensor Cores.
Due to the significant VRAM shortfall, directly running Llama 3 70B on the RX 7900 XTX is not feasible without employing techniques like offloading layers to system RAM or using extremely aggressive quantization methods. Layer offloading will severely impact performance, as data transfer between system RAM and the GPU is much slower than accessing VRAM. Consider using a smaller model variant, such as Llama 3 8B or 15B, which have significantly lower VRAM requirements and can potentially fit within the RX 7900 XTX's 24GB VRAM. Alternatively, explore distributed inference solutions that split the model across multiple GPUs or machines if high performance is critical.