BGE-M3 on Jetson AGX Orin: Compatibility & Performance

info Technical Analysis

The NVIDIA Jetson AGX Orin 32GB is exceptionally well-suited for running the BGE-M3 embedding model. With 32GB of LPDDR5 VRAM and a memory bandwidth of 0.21 TB/s, the Orin has ample resources to handle BGE-M3's modest 1GB VRAM footprint (FP16). The Ampere architecture provides 1792 CUDA cores and 56 Tensor Cores, which are beneficial for accelerating the matrix multiplications and other computations involved in embedding generation. The significant VRAM headroom (31GB) ensures that the model can be deployed without memory constraints, even with larger batch sizes or when combined with other applications running concurrently on the device.

The Orin's memory bandwidth, while not as high as a dedicated desktop GPU, is sufficient for the BGE-M3 model's size. The 0.5B parameter model can be loaded and processed efficiently within the available bandwidth. The Tensor Cores, in particular, are designed to optimize deep learning workloads, contributing to the model's performance. The 40W TDP of the Orin makes it an energy-efficient solution for edge deployment scenarios where power consumption is a critical factor. The Ampere architecture is known for its power efficiency, allowing for reasonable performance without excessive power draw.

Based on our estimates, the Jetson AGX Orin 32GB can achieve approximately 90 tokens/second with the BGE-M3 model. This estimate takes into account the model's size, the GPU's specifications, and typical performance characteristics of similar models on comparable hardware. A batch size of 32 is likely optimal for maximizing throughput without exceeding the GPU's memory capacity or significantly increasing latency. Actual performance may vary depending on the specific implementation, optimization techniques used, and other system factors.

lightbulb Recommendation

For optimal performance, we recommend using a framework like ONNX Runtime or TensorRT, which are optimized for NVIDIA hardware and can leverage the Tensor Cores effectively. Experiment with different batch sizes to find the sweet spot between throughput and latency. Monitor GPU utilization and memory usage to ensure that the system is not bottlenecked by other processes. If latency is a concern, consider reducing the batch size or using a lower precision format like INT8, although this may slightly impact the quality of the embeddings.

Since the VRAM headroom is significant, explore running multiple instances of the model concurrently to increase overall throughput, especially if the application involves processing a large number of embeddings. Ensure that the Jetson AGX Orin is properly cooled to prevent thermal throttling, which can significantly impact performance. Regularly update the NVIDIA drivers and software stack to benefit from the latest optimizations and bug fixes.

tune Recommended Settings

Batch_Size

32

Context_Length

8192

Other_Settings

['Optimize CUDA kernels', 'Enable TensorRT optimizations', 'Monitor GPU temperature']

Inference_Framework

ONNX Runtime or TensorRT

Quantization_Suggested

INT8 (optional, for latency reduction)

help Frequently Asked Questions

Is BGE-M3 compatible with NVIDIA Jetson AGX Orin 32GB? expand_more

Yes, BGE-M3 is fully compatible with the NVIDIA Jetson AGX Orin 32GB.

What VRAM is needed for BGE-M3? expand_more

BGE-M3 requires approximately 1GB of VRAM when using FP16 precision.

How fast will BGE-M3 run on NVIDIA Jetson AGX Orin 32GB? expand_more

You can expect approximately 90 tokens/second with BGE-M3 on the NVIDIA Jetson AGX Orin 32GB.

NelsaHost

Can I run BGE-M3 on NVIDIA Jetson AGX Orin 32GB?

VRAM Usage

Performance Estimate

info Technical Analysis

lightbulb Recommendation

tune Recommended Settings

help Frequently Asked Questions

GPU

AI Model

More with Jetson AGX Orin 32GB