While VRAM isn't a concern, optimizing for throughput will be the key to maximizing performance. The Orin's 60W TDP means power efficiency is a priority, so choosing efficient inference frameworks and quantization levels will be important. The estimated 90 tokens/sec is a good starting point, but this can likely be improved through optimizations. The estimated batch size of 32 is reasonable and will help to saturate the GPU's compute capabilities. Keep in mind that the BGE-Small-EN model is designed for embedding generation, so the 'tokens/sec' metric doesn't directly translate to language generation speed; it reflects the speed at which embeddings can be created.
Start by using a high-performance inference framework like ONNX Runtime or TensorRT to leverage the Jetson AGX Orin's hardware acceleration capabilities. Since the model is so small, experiment with different batch sizes to find the optimal balance between latency and throughput. A larger batch size will generally increase throughput but also increase latency. Given the ample VRAM headroom, you can likely increase the batch size significantly beyond the initial estimate of 32. Consider quantizing the model to INT8 or even INT4 to further improve performance and reduce memory bandwidth requirements, even though the VRAM usage is already minimal. Finally, profile your application to identify any bottlenecks and optimize accordingly.