The primary limiting factor in running large language models (LLMs) like Mixtral 8x22B on consumer GPUs is VRAM capacity. Mixtral 8x22B, even when quantized to q3_k_m, requires approximately 56.4GB of VRAM to load the model and perform inference. The NVIDIA RTX 4090, while a powerful GPU, is equipped with only 24GB of VRAM. This 32.4GB deficit means the entire model cannot reside on the GPU's memory simultaneously, leading to a compatibility failure. Memory bandwidth, while significant at 1.01 TB/s on the RTX 4090, becomes less relevant when the model cannot fit entirely within the available VRAM. The model would have to constantly swap data between system RAM and GPU VRAM, which would be extremely slow.
Due to the VRAM limitations of the RTX 4090, directly running Mixtral 8x22B (141.00B) even in its q3_k_m quantized form is not feasible. Consider using CPU offloading or splitting the model across multiple GPUs if possible. Alternatively, explore using a smaller model that fits within the 24GB VRAM of the RTX 4090 or using cloud-based inference services that offer sufficient GPU resources. If CPU offloading is the only option, expect significantly reduced performance compared to a full GPU implementation.