Search for a command to run...
Retrieval-Augmented Generation (RAG) systems increasingly operate on edge devices to satisfy emerging requirements for low latency, privacy preservation, and offline or bandwidth-constrained operation. While RAG architectures have matured in accuracy and efficiency, their inference pathways remain opaque, particularly on resource-limited edge deployments where conventional explainability tools are too computationally expensive. This paper introduces a lightweight, realtime explainability layer that exposes token-level and documentlevel attribution signals throughout the retrieval and generation stages of a RAG pipeline, without modifying model weights or requiring multi-pass decoding. The proposed layer integrates three components: (1) efficient retriever attribution using vector influence scoring over compressed embedding spaces, (2) generator-side token attribution derived from incremental logit differentials during streaming decoding, and (3) a crossstage alignment mechanism that links retrieved evidence segments to generative token contributions under tight latency and memory budgets. We implement the framework on ARMand CUDA-capable edge devices using an optimized operator pipeline that confines explainability computation to at most 8 percent additional latency overhead. Experiments across multiple retriever-generator configurations (bge-small, MiniLM, Llama-3-Instruct quantized variants) demonstrate that the method achieves attribution fidelity comparable to gradient-based and perturbation-based techniques while reducing compute cost by up to 4.2x. Evaluation under constrained settings (mobile GPU, 816 GB RAM, intermittent network) shows that the explainability layer remains stable even when retrieval granularity, token streaming rate, and memory pressure vary dynamically. The results establish that interpretable RAG is feasible on edge devices without sacrificing responsiveness or energy efficiency. Beyond transparency, the proposed layer enables on-device trust assessment, debuggability, misretrieval detection, and provenance verification, which are critical for RAG-driven assistants in privacy-sensitive or infrastructure-limited environments. This work positions explainability as a first-class systems primitive for next-generation edge AI deployments.