The primary limiting factor for running large language models (LLMs) like Llama 3.3 70B is VRAM. This model, in FP16 precision, requires approximately 140GB of VRAM to load the model weights and perform inference. The NVIDIA Jetson AGX Orin 64GB, while a powerful embedded system, only provides 64GB of VRAM. This 76GB deficit means the model cannot be loaded in its entirety using FP16 precision. Furthermore, even if the model could be squeezed into the available VRAM through quantization, the memory bandwidth of 0.21 TB/s on the Jetson AGX Orin would likely become a bottleneck, significantly impacting inference speed. The Ampere architecture's tensor cores would assist in accelerating matrix multiplications, but the limited VRAM and memory bandwidth remain the dominant constraints.
Beyond VRAM, the context length of 128,000 tokens for Llama 3.3 70B presents another challenge. While the model supports this large context window, processing such long sequences requires substantial memory and computational resources during inference. The Jetson AGX Orin, with its 2048 CUDA cores, would struggle to efficiently handle the computational demands of processing very long contexts in real-time. This is especially true given that the model is already operating at the edge of the hardware's capabilities due to the VRAM limitations.
Given the significant VRAM shortfall, running Llama 3.3 70B directly on the Jetson AGX Orin 64GB is not feasible without substantial modifications. The most practical approach would involve aggressive quantization techniques, such as Q4 or even lower precisions, using frameworks like llama.cpp or potentially exllama if compatible with the Jetson architecture. Model parallelism, where the model is split across multiple devices, is not an option with a single Jetson AGX Orin. Consider using a smaller model, like Llama 3.3 8B or 15B, which are designed to fit within the VRAM constraints of the Jetson AGX Orin.
Alternatively, consider offloading inference to a more powerful server with sufficient GPU resources. The Jetson AGX Orin could then act as a client, sending requests to the server and receiving the generated text. If you must run a large model locally, explore techniques like CPU offloading for certain layers, but be aware that this will dramatically decrease performance. Focus on optimizing the inference pipeline and minimizing memory usage wherever possible.