The NVIDIA Jetson AGX Orin 64GB, with its Ampere architecture, 2048 CUDA cores, and 64 Tensor cores, provides ample resources for running the CLIP ViT-L/14 model. The model's relatively small size of 0.4 billion parameters and low VRAM requirement of 1.5GB (FP16) make it an excellent fit for the Orin's 64GB of LPDDR5 memory. The substantial VRAM headroom (62.5GB) ensures that the model can be loaded and executed without memory constraints, even with larger batch sizes or more complex image processing pipelines.
While VRAM capacity is not a limiting factor, the memory bandwidth of 0.21 TB/s will influence the inference speed. The model's performance will be primarily bound by the speed at which data can be transferred to and from the GPU's memory. The 64 Tensor Cores significantly accelerate the matrix multiplications inherent in the CLIP model, contributing to efficient processing. The estimated tokens/sec of 90 and batch size of 32 are reasonable expectations, but these values can fluctuate based on the specific implementation and optimization techniques used.
Given the Jetson AGX Orin's limited power budget (60W TDP), optimizing for energy efficiency is crucial. Employing techniques like quantization (e.g., INT8) can further reduce memory footprint and accelerate inference, potentially improving tokens/sec. Experiment with different batch sizes to find the sweet spot between throughput and latency, considering the limitations of the memory bandwidth.
For deployment, consider using NVIDIA's TensorRT for optimized inference. This framework allows for graph optimizations and kernel fusion, leading to significant performance gains. Monitor GPU utilization and power consumption to fine-tune the model's configuration and ensure stable operation within the Jetson's thermal constraints. If you need higher throughput, explore distributed inference strategies if the task is suitable.