Optimizing LLM inference for FPGAs

20251 citationsJournal Article

Authors

Jorge R De Freitas · Imperial College London

Jose G. F. Coutinho · Imperial College London

Ce Guo · Imperial College London

S.S. Demirsoy · Altera (United Kingdom)

W. Luk · Imperial College London

Zhiqiang Walkie Que · Imperial College London

Abstract

Large Language Models (LLMs) deliver state-of-the-art performance but demand high computation and memory, making deployment in resource-limited settings challenging. Field-Programmable Gate Arrays (FPGAs) offer parallelism and efficiency, yet most prior FPGA accelerators rely on low-level, platform-specific flows that hinder portability. This work presents oneLLM, to our knowledge, the first FPGA-based LLM inference design using Intel’s oneAPI, enabling a unified high-level programming model across CPUs, GPUs, and FPGAs. Our deeply pipelined, multi-kernel hardware architecture connects specialized kernels via oneAPI pipes for on-chip streaming, reducing host–device communication. Implemented on an Intel Agilex 7 FPGA, it achieves 3 times faster than a CPU implementation, and 8.8 times faster than a non-pipelined baseline while meeting resource constraints, demonstrating the potential of portable FPGA development for LLM acceleration. Code available at https://github.com/custom-computing-ic/llm-oneapi-fpga.

Topics & Keywords

Natural Language Processing Techniques Topic Modeling Advanced Neural Network Applications

Publication Details

DOI: 10.1109/asicon66040.2025.11326063

Field-Weighted Citation Impact: 3.13

Command Palette

Optimizing LLM inference for FPGAs

Authors

Abstract

Topics & Keywords

Publication Details