Search for a command to run...
Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, an innovative unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need for task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement are done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architectural changes. With a unified framework and similar hyperparameters, OmniEvent outperforms (task-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig. 1).
Published in: Proceedings of the AAAI Conference on Artificial Intelligence
Volume 40, Issue 14, pp. 11568-11576