The NVIDIA Jetson AGX Orin 32GB is exceptionally well-suited for running the BGE-M3 embedding model. With 32GB of LPDDR5 VRAM and a memory bandwidth of 0.21 TB/s, the Orin has ample resources to handle BGE-M3's modest 1GB VRAM footprint (FP16). The Ampere architecture provides 1792 CUDA cores and 56 Tensor Cores, which are beneficial for accelerating the matrix multiplications and other computations involved in embedding generation. The significant VRAM headroom (31GB) ensures that the model can be deployed without memory constraints, even with larger batch sizes or when combined with other applications running concurrently on the device.
The Orin's memory bandwidth, while not as high as a dedicated desktop GPU, is sufficient for the BGE-M3 model's size. The 0.5B parameter model can be loaded and processed efficiently within the available bandwidth. The Tensor Cores, in particular, are designed to optimize deep learning workloads, contributing to the model's performance. The 40W TDP of the Orin makes it an energy-efficient solution for edge deployment scenarios where power consumption is a critical factor. The Ampere architecture is known for its power efficiency, allowing for reasonable performance without excessive power draw.
Based on our estimates, the Jetson AGX Orin 32GB can achieve approximately 90 tokens/second with the BGE-M3 model. This estimate takes into account the model's size, the GPU's specifications, and typical performance characteristics of similar models on comparable hardware. A batch size of 32 is likely optimal for maximizing throughput without exceeding the GPU's memory capacity or significantly increasing latency. Actual performance may vary depending on the specific implementation, optimization techniques used, and other system factors.
For optimal performance, we recommend using a framework like ONNX Runtime or TensorRT, which are optimized for NVIDIA hardware and can leverage the Tensor Cores effectively. Experiment with different batch sizes to find the sweet spot between throughput and latency. Monitor GPU utilization and memory usage to ensure that the system is not bottlenecked by other processes. If latency is a concern, consider reducing the batch size or using a lower precision format like INT8, although this may slightly impact the quality of the embeddings.
Since the VRAM headroom is significant, explore running multiple instances of the model concurrently to increase overall throughput, especially if the application involves processing a large number of embeddings. Ensure that the Jetson AGX Orin is properly cooled to prevent thermal throttling, which can significantly impact performance. Regularly update the NVIDIA drivers and software stack to benefit from the latest optimizations and bug fixes.