The NVIDIA RTX 6000 Ada, with its 48GB of GDDR6 VRAM and Ada Lovelace architecture, is exceptionally well-suited for running the BGE-Small-EN embedding model. BGE-Small-EN, being a relatively small model with only 0.03 billion parameters, requires a mere 0.1GB of VRAM in FP16 precision. This leaves a substantial 47.9GB of VRAM headroom, allowing for large batch sizes and concurrent execution of multiple instances or other larger models simultaneously.
Furthermore, the RTX 6000 Ada's memory bandwidth of 0.96 TB/s ensures rapid data transfer between the GPU and its memory, minimizing potential bottlenecks. The 18176 CUDA cores and 568 Tensor cores provide ample computational power for the matrix multiplications and other operations inherent in the BGE-Small-EN model, leading to high throughput. The Ada Lovelace architecture also brings improvements in Tensor Core utilization and overall efficiency compared to previous generations.
Given the abundant VRAM and computational resources, you can maximize throughput by increasing the batch size. Start with a batch size of 32, as estimated, and experiment with larger values to find the optimal point before encountering diminishing returns or memory limitations. Consider using a framework like vLLM or text-generation-inference to further optimize inference speed and memory utilization. If you are only using the RTX 6000 Ada for the BGE-Small-EN model, you could also consider running multiple instances of the model in parallel to fully utilize the GPU's resources.