The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM and Ampere architecture, is well-suited for running the Phi-3 Medium 14B model, especially when utilizing quantization. The base FP16 (half-precision floating point) model requires 28GB of VRAM, exceeding the 3090 Ti's capacity. However, with q3_k_m quantization, the model's memory footprint is reduced to approximately 5.6GB. This leaves a substantial 18.4GB of VRAM headroom, allowing for comfortable operation and potentially enabling larger batch sizes or longer context lengths without encountering memory limitations. The RTX 3090 Ti's 1.01 TB/s memory bandwidth ensures that data can be transferred efficiently between the GPU and memory, which is crucial for maintaining high inference speeds.
For optimal performance with the Phi-3 Medium 14B model on your RTX 3090 Ti, stick with the q3_k_m quantization as it significantly reduces VRAM usage. Experiment with batch sizes up to 6 to maximize throughput, keeping a close eye on VRAM usage to avoid exceeding available memory. Consider using a framework like `llama.cpp` or `vLLM` to further optimize inference speed and memory management. If you need to experiment with larger batch sizes or longer context lengths, monitor your VRAM usage and consider further quantization (e.g., Q2_K) if necessary, although this may slightly impact model accuracy.