Search for a command to run...
Deploying large object detectors on embedded edge platforms is governed by a joint trade-off among detection accuracy, end-to-end throughput, and total system power. This paper benchmarks the large variants YOLOv8l and RT-DETR-l across heterogeneous deployment runtimes on two edge platforms: Raspberry Pi 5 with CPU and NPU offload, and Nvidia Jetson Orin NX with GPU acceleration. Accuracy is evaluated on COCO val2017 using mAP50-95, while throughput and energy efficiency (FPS/W) are measured on a realistic end-to-end video pipeline that includes decoding, preprocessing, inference, and post-processing. Model-execution latency is analyzed separately from pipeline throughput to avoid ambiguity between inference time and end-to-end processing rate. On Raspberry Pi 5, CPU-only execution of large models is impractical due to multi-second per-frame latency, whereas NPU acceleration substantially improves energy efficiency for YOLOv8l, albeit with deployment constraints that can reduce accuracy. On Jetson Orin NX, TensorRT provides the strongest deployment path for both architectures; however, the relative ranking of YOLOv8l and RT-DETR-l depends on runtime realization rather than on nominal FLOPs alone. A generalized interpretation based on normalized latency and energy per nominal GFLOP, together with a decomposition of conversion sensitivity and quantization sensitivity, shows that deployment efficiency is jointly determined by nominal compute demand, memory-system behavior, runtime overhead, and accuracy retention after export and quantization. Under the tested conditions, none of the evaluated large-model configurations reaches the strict real-time target of 25 FPS for the full pipeline, indicating that further hardware-specific optimization and/or smaller model variants are still required for real-time edge deployment.