Lightweight Explainability for RAG Pipelines at the Edge: A Real-Time Attribution Layer

20260 citationsJournal Article

Authors

Satya Karteek Gudipati · Publicis Groupe (Germany)

NAVEEN ANAND MISHRA · Cigna (United States)

SAMBASIVA RAO AKKISETTI · American Express (United States)

Akbar Mohammed · American Express (United States)

GOPIKANTH ANKAM · Celanese (United States)

Abstract

Retrieval-Augmented Generation (RAG) systems increasingly operate on edge devices to satisfy emerging requirements for low latency, privacy preservation, and offline or bandwidth-constrained operation. While RAG architectures have matured in accuracy and efficiency, their inference pathways remain opaque, particularly on resource-limited edge deployments where conventional explainability tools are too computationally expensive. This paper introduces a lightweight, realtime explainability layer that exposes token-level and documentlevel attribution signals throughout the retrieval and generation stages of a RAG pipeline, without modifying model weights or requiring multi-pass decoding. The proposed layer integrates three components: (1) efficient retriever attribution using vector influence scoring over compressed embedding spaces, (2) generator-side token attribution derived from incremental logit differentials during streaming decoding, and (3) a crossstage alignment mechanism that links retrieved evidence segments to generative token contributions under tight latency and memory budgets. We implement the framework on ARMand CUDA-capable edge devices using an optimized operator pipeline that confines explainability computation to at most 8 percent additional latency overhead. Experiments across multiple retriever-generator configurations (bge-small, MiniLM, Llama-3-Instruct quantized variants) demonstrate that the method achieves attribution fidelity comparable to gradient-based and perturbation-based techniques while reducing compute cost by up to 4.2x. Evaluation under constrained settings (mobile GPU, 816 GB RAM, intermittent network) shows that the explainability layer remains stable even when retrieval granularity, token streaming rate, and memory pressure vary dynamically. The results establish that interpretable RAG is feasible on edge devices without sacrificing responsiveness or energy efficiency. Beyond transparency, the proposed layer enables on-device trust assessment, debuggability, misretrieval detection, and provenance verification, which are critical for RAG-driven assistants in privacy-sensitive or infrastructure-limited environments. This work positions explainability as a first-class systems primitive for next-generation edge AI deployments.

Topics & Keywords

Explainable Artificial Intelligence (XAI)Adversarial Robustness in Machine Learning Advanced Graph Neural Networks

Publication Details

DOI: 10.1109/ccwc67433.2026.11393707

Field-Weighted Citation Impact: 0.00

Command Palette

Lightweight Explainability for RAG Pipelines at the Edge: A Real-Time Attribution Layer

Authors

Abstract

Topics & Keywords

Publication Details