Search for a command to run...
Edge intelligence deploys artificial intelligence models on edge nodes proximal to data sources, and delivers real-time inference support for resource-constrained devices. To realize this vision, inference offloading differs from conventional computation offloading by tailoring offloading strategies to the intrinsic characteristics of AI inference tasks. In this field, existing researchs generally lack fine-grained model partitioning capabilities and long-term resource adaptability, failing to optimize resource utilization and sustain stable performance in mobile environments. To address these issues, we propose an adaptive inference acceleration framework that dynamically partitions inference models into hierarchical subtasks and offloads these subtasks to heterogeneous edge servers. We formulate a joint optimization problem for task partitioning, offloading and resource allocation, which takes queue stability as the constraint and aims to minimize the long-term average task completion time. To realize the optimal trade-off between latency and stability without future state prediction, we adopt Lyapunov optimization to decompose the long-term stochastic optimization into slot-by-slot solvable deterministic subproblems. For these slot-by-slot subproblems, we design a Q-network Mixing (QMIX)-based multi-agent reinforcement learning method to enable collaborative strategy selection across edge servers. Experimental simulations show that, compared with baseline algorithms including the greedy, genetic and MAD2RL methods, our proposed framework achieves a substantial reduction in task completion time while preserving inference accuracy and queue stability.