The NVIDIA Jetson AGX Orin 32GB, with its Ampere architecture, 1792 CUDA cores, and 32GB of LPDDR5 VRAM, provides ample resources for running the BGE-Small-EN embedding model. BGE-Small-EN, being a relatively small model with only 0.03B parameters and requiring just 0.1GB of VRAM in FP16 precision, fits comfortably within the Orin's memory capacity. The Orin's 210 GB/s memory bandwidth is also sufficient to handle the data transfer demands of this model, ensuring efficient operation. The 56 Tensor Cores on the Orin will accelerate the matrix multiplication operations inherent in the BGE-Small-EN model, leading to faster inference times.
Given the substantial VRAM headroom (31.9GB), the Jetson AGX Orin can easily accommodate larger batch sizes and potentially run multiple instances of the BGE-Small-EN model concurrently. The Ampere architecture's improvements in memory management and compute efficiency further contribute to the model's performance. With an estimated throughput of 90 tokens/sec, the Orin provides a responsive and practical platform for embedding tasks using BGE-Small-EN. The power efficiency of the Jetson AGX Orin (40W TDP) makes it suitable for edge deployment scenarios where power consumption is a concern.
For optimal performance, begin with a batch size of 32 and a context length of 512 tokens, as these parameters are well-suited to the Jetson AGX Orin's capabilities. Explore using the ONNX Runtime or TensorRT to further optimize the model for the Orin's architecture. Consider quantizing the model to INT8 or even INT4 to reduce memory footprint and potentially increase inference speed, though this may come at a slight cost to accuracy.
Monitor VRAM usage and inference latency during initial testing to fine-tune the batch size and other parameters for your specific application. If you encounter performance bottlenecks, profile the application to identify areas for further optimization, such as kernel fusion or memory access patterns. If higher throughput is required, explore parallelizing inference across multiple Orin devices if your use case allows.