Can I run BGE-Large-EN on NVIDIA Jetson AGX Orin 32GB?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
32.0GB
Required
0.7GB
Headroom
+31.3GB

VRAM Usage

0GB 2% used 32.0GB

Performance Estimate

Tokens/sec ~90.0
Batch size 32

info Technical Analysis

The NVIDIA Jetson AGX Orin 32GB, with its Ampere architecture, 1792 CUDA cores, and 32GB of LPDDR5 VRAM, provides ample resources for running the BGE-Large-EN embedding model. BGE-Large-EN, a relatively small model at 0.33B parameters, requires approximately 0.7GB of VRAM in FP16 precision. The Orin's 32GB VRAM offers a significant headroom of 31.3GB, ensuring that the model and associated operations can be loaded and executed without memory constraints. This abundant VRAM allows for larger batch sizes and potentially the concurrent execution of other tasks on the device.

Furthermore, the Orin's memory bandwidth of 0.21 TB/s is sufficient for the data transfer demands of BGE-Large-EN. Embedding models, while not as computationally intensive as large language models, still benefit from high memory bandwidth, especially when processing large batches of text. The 56 Tensor Cores contribute to accelerating the matrix multiplications inherent in the embedding process. With the Orin's TDP of 40W, power consumption should be manageable, making it suitable for edge deployment scenarios. The expected performance is around 90 tokens/sec, which is a reasonable speed for embedding tasks on an edge device.

lightbulb Recommendation

Given the substantial VRAM headroom, users can experiment with larger batch sizes to maximize throughput. Start with a batch size of 32, as estimated, and incrementally increase it until performance plateaus or memory errors occur. Consider using ONNX Runtime for optimized inference, which can leverage the Tensor Cores effectively. Additionally, monitor the Orin's temperature during extended use, especially in constrained environments, to prevent thermal throttling. If performance is critical, explore quantization techniques like INT8 to further reduce memory footprint and potentially increase processing speed, although this may come with a slight trade-off in accuracy.

For deployment, ensure efficient memory management and minimize unnecessary data transfers between the CPU and GPU. Optimize the input pipeline to pre-process data in parallel and avoid bottlenecks. In edge deployment scenarios, prioritize low-latency inference for real-time applications. If facing resource constraints, consider offloading some pre-processing steps to the CPU.

tune Recommended Settings

Batch_Size
32 (start value, tune upwards)
Context_Length
512
Other_Settings
['Enable CUDA graph capture', 'Use asynchronous data loading', 'Optimize input data format']
Inference_Framework
ONNX Runtime or TensorRT
Quantization_Suggested
INT8 (optional, for further optimization)

help Frequently Asked Questions

Is BGE-Large-EN compatible with NVIDIA Jetson AGX Orin 32GB? expand_more
Yes, BGE-Large-EN is fully compatible with the NVIDIA Jetson AGX Orin 32GB.
What VRAM is needed for BGE-Large-EN? expand_more
BGE-Large-EN requires approximately 0.7GB of VRAM in FP16 precision.
How fast will BGE-Large-EN run on NVIDIA Jetson AGX Orin 32GB? expand_more
You can expect approximately 90 tokens/sec on the NVIDIA Jetson AGX Orin 32GB.