Search for a command to run...
Abstract The exponential growth of multimedia data in today’s highly digitized, mobile-centric society poses formidable challenges to efficient data storage and rapid information retrieval. To overcome these bottlenecks, we introduce ViT-LSH, a highly optimized algorithm that seamlessly integrates Vision Transformer (ViT) with locality-sensitive hashing (LSH) to deliver unprecedented image matching accuracy and retrieval speeds. The superiority of ViT-LSH stems from its dual-engine architecture. First, the ViT module revolutionizes feature extraction by dividing images into fixed-size patches and projecting them into token sequences without relying on down-sampling. This perfectly preserves the original image resolution and achieves exceptional global information modeling for semantic segmentation. Second, the LSH module drastically minimizes space consumption and accelerates query processing by mapping proximate data points to identical hash values. Evaluated against five existing baseline schemes, our LSH approach demonstrates vastly superior time and space efficiency while guaranteeing highly accurate approximate nearest neighbor queries. The proposed algorithm follows a streamlined, robust pipeline: image preprocessing (e.g., denoising) and segmentation, followed by ViT-driven feature extraction, dimensionality reduction, and LSH-based hash coding. By utilizing Hamming distance for direct similarity calculation, the system achieves ultra-fast sorting and matching. Extensive experimental results unequivocally validate that ViT-LSH provides a massive leap forward in both the computational efficiency and precision of large-scale image matching.