The NVIDIA Jetson AGX Orin 64GB, with its Ampere architecture, 64GB of LPDDR5 VRAM, and 2048 CUDA cores, is exceptionally well-suited for running the BGE-M3 embedding model. BGE-M3, being a relatively small model with only 0.5B parameters and requiring just 1.0GB of VRAM in FP16 precision, leaves a substantial 63.0GB of VRAM headroom on the Orin. This ample VRAM allows for large batch sizes and the potential to load multiple model instances or other AI tasks concurrently. The Orin's 0.21 TB/s memory bandwidth, while not the highest available, is sufficient for BGE-M3's memory access patterns, preventing memory bandwidth from becoming a bottleneck.
Given the abundant VRAM and computational resources of the Jetson AGX Orin, users should prioritize maximizing throughput by experimenting with larger batch sizes. Start with a batch size of 32 and incrementally increase it until you observe diminishing returns or encounter memory limitations. Consider using TensorRT for optimized inference, which can significantly improve the model's performance on NVIDIA hardware. Also, because BGE-M3 is small, explore running multiple instances in parallel to improve overall system utilization.