SwiftInference: Consistent Low-Latency AI Inference at the Network Edge

20260 citationsPreprintgreen Open Access

Authors

Ananyi Kendall · Swift Engineering (United States)

Abstract

We present SwiftInference, a distributed edge AI inference platform achieving consistent sub-125ms P90 latency through strategic GPU placement at telecommunications sites. Across 1000 trials under mobile-realistic WiFi conditions, edge deployment demonstrates 36% faster P90 latency (125ms vs 194ms) and 3.7°ø lower variance (σ=27ms vs σ=100ms) compared to cloud infrastructure, despite using GPU hardware with 3°ø slower raw compute performance. While cloud providers achieve 70ms median latency through aggressive caching, this optimization creates bimodal behavior with high variance—50% of requests experience 180-256ms latency. Edge placement delivers unimodal consistency with 111ms median and tight 27ms standard deviation, enabling strict P99 SLA guarantees (.185ms) that cloud providers cannot economically match. Our architecture separates control anddata planes, enabling towers to remain inbound-dark for management while accepting inference traffic via carrier on-net paths. For SLA-driven workloads requiring predictable latency—autonomous vehicles, real-time voice AI, industrial robotics—variance reduction and tail latency optimization represent more valuable metrics than median speed. Production deployment with matching GPU hardware (RTX PRO 6000 Blackwell) projects 60% end-to-end latency advantage while maintaining architectural variance benefits, positioning edge inference as both faster and more consistent than cloud alternatives.

Topics & Keywords

IoT and Edge/Fog Computing Caching and Content Delivery IoT Networks and Protocols

UN Sustainable Development Goals

Industry, innovation and infrastructure

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18882252