Mitigating the Bandwidth Wall via Data-Streaming System–Accelerator Co-Design

20260 citationsJournal Articlediamond Open Access

Authors

Qunyou Liu · Embedded Systems (United States)

Marina Zapater · HES-SO University of Applied Sciences and Arts Western Switzerland

D. Atienza · Embedded Systems (United States)

Abstract

Transformers have revolutionized AI in natural language processing and computer vision, but their enormous computation and memory demands pose significant challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth, not just raw MAC count. This work proposes a unified system–accelerator co-design approach to efficiently accelerate transformer inference by jointly optimizing a novel hardware matrix accelerator and its system integration with paged, streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow , a loosely‐coupled 16 × 16 systolic‐array accelerator featuring a block‐based matrix multiplication method that is page-aligned (4 KB tiles), uses only a small (≈ 20 KB) on-chip buffer, and runs a pipelined schedule of DMA, compute, and DMA-out to fully utilize interconnect bandwidth, emphasizing standard DMA-driven streaming rather than large on-chip reuse. On the system side, we develop Gem5‐AcceSys , an extension of the gem5 full system simulator allowing exploration of standard interconnects (PCIe) and configurable memory hierarchies including Direct-Memory (DM), Direct-Cache (DC), and Device-Memory (DevMem) modes with SMMU/TLB effects. Through co‐design, MatrixFlow’s novel dataflow and the Gem5‐AcceSys platform are tuned in tandem to alleviate data‐movement bottlenecks without requiring specialized CPU‐instruction‐set modifications. We validate our approach with gem5 simulations on representative transformer models (BERT and ViT) across multiple data types and system setups. Results demonstrate up to 22 × speed‐up in end‐to‐end inference over a CPU‐only baseline and performance gains of 5 × –8 × over state‐of‐the‐art loosely‐ and tightly‐coupled accelerators. Furthermore, we show that a standard PCIe-based host memory design can achieve ∼ 80% of the performance of on‐device HBM memory. Overall, paged streaming and pipeline overlap , not large local SRAMs, emerge as the most effective knobs for efficient transformer inference under realistic system constraints.

Topics & Keywords

Parallel Computing and Optimization Techniques Embedded Systems Design Techniques Advanced Neural Network Applications

Publication Details

Published in: ACM Transactions on Architecture and Code Optimization

DOI: 10.1145/3803551

Field-Weighted Citation Impact: 0.00

Command Palette

Mitigating the Bandwidth Wall via Data-Streaming System–Accelerator Co-Design

Authors

Abstract

Topics & Keywords

Publication Details