arXiv

Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

June 2, 2026 · Zifan He, Rui Ma, Yizhou Sun, Jason Cong · Original Source

Title: Streamlining the Memory Processing Pipeline for Efficient Large Language Model Inference

Abstract: Contemporary large language models (LLMs) rely heavily on sophisticated mechanisms for handling long contexts and generating output, such as sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to facilitate complex reasoning tasks. This study demonstrates that these diverse optimizations can be consolidated into a unified four-stage memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Systematic profiling reveals that memory processing constitutes a significant bottleneck, introducing an overhead ranging from 22% to 97% during LLM inference, with considerable variation in computational traits. Leveraging these findings, we propose that heterogeneous architectures are ideally positioned to accelerate memory processing, thereby enhancing overall inference performance. We validate this hypothesis on a GPU-FPGA hybrid system, where memory-bound, sparse, and irregular operations are offloaded to FPGAs, while compute-heavy tasks remain on the GPU. Benchmarks conducted on an AMD MI210 GPU paired with an Alveo U55C FPGA show that our approach delivers speedups of up to 2.2x and reduces energy consumption by as much as 4.7x compared to a GPU-only baseline across various LLM inference optimizations. Similar performance gains were observed on NVIDIA A100 hardware. These outcomes highlight the viability of heterogeneous systems for efficient LLM memory processing and provide critical insights for the design of future heterogeneous hardware.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC