Scalable Visual Search

Efficient and scalable image matching using local descriptors and projective hashing

Vector quantization free (VQF) image matching using binary-valued local image descriptors and projective hashing with multiple index tables.

Visual search has developed a basic processing pipeline in the last decade or so on top of the “bag of visual words” representation built on local image descriptors.   There’s been a steady stream of work on image matching using the representation in combination with approximate nearest neighbor search and various downstream geometric verification strategies.  Such approaches have been applied with considerable success in the EMM project.

In practice, the most computationally daunting stage can be the construction of the visual codebook which is usually accomplished via k-means or tree structured vector quantization.  The problem is to cluster (possibly billions of) local descriptors, and this offline clustering may need to be repeated when there are any significant changes to the image database.  Each descriptor cluster is represented by one element in a visual vocabulary (codebook).  In turn, each image is represented by a bag (vector) of these visual words (quantized descriptors).

Vector Quantization Free (VQF) image search uses projective hashing in combination with binary valued local image descriptors such as ORB that improve efficiency with negligible loss of accuracy in various matching scenarios.   Rather than clustering the collected descriptors harvested globally from the image database, the codebook is implicitly defined via projective hashing.  Subsets of the elements of ORB descriptors are hashed by projection (i.e. all but a small number of bits are discarded) to form an index table.

By creating multiple different tables, image matching is implemented by a voting scheme based on the number of collisions (i.e. partial matches) between the descriptors in a test image and those in a database image. Experimental results on image databases validate the expected significant increase in efficiency and scalability using the VQF approach.  The results also show improved performance over some competitive baselines in near duplicate image search.

Technical Contact

Related Publications

2014
Publication Details
  • ACM ICMR 2014
  • Apr 1, 2014

Abstract

Close
Motivated by scalable partial-duplicate visual search, there has been growing interest on a wealth of compact and efficient binary feature descriptors (e.g. ORB, FREAK, BRISK). Typically, binary descriptors are clustered into codewords and quantized with Hamming distance, which follows conventional bag-of-words strategy. However, such codewords formulated in Hamming space did not present obvious indexing and search performance improvement as compared to the Euclidean ones. In this paper, without explicit codeword construction, we explore to utilize binary descriptors as direct codebook indices (addresses). We propose a novel approach to build multiple index tables which parallelly check the collision of same hash values. The evaluation is performed on two public image datasets: DupImage and Holidays. The experimental results demonstrate the index efficiency and retrieval accuracy of our approach.
2011
Publication Details
  • ACM International Conference on Multimedia Retrieval (ICMR) 2011
  • Apr 17, 2011

Abstract

Close
Embedded Media Marker (EMM) identification system allows users to retrieve relevant dynamic media associated with a static paper document via camera phones. The user supplies a query image by capturing an EMM-signified patch of a paper document through a camera phone; the system recognizes the query and in turn retrieves and plays the corresponding media on the phone. Accurate image matching is crucial for positive user experience in this application. To address the challenges posed by large datasets and variations in camera-phone-captured query images, we introduce a novel image matching scheme based on geometrically consistent correspondences. Two matching constraints - "injection" and "approximate global geometric consistency" (AGGC), which are unique in EMM identification, are presented. A hierarchical scheme, combined with two constraining functions, is designed to detect the "injective-AGGC" correspondences between images. A spatial neighborhood search approach is further proposed to address challenging cases with large translational shift. Experimental results on a 100k+ dataset show that our solution achieves high accuracy with low memory and time complexity and outperforms the standard bag-of-words approach.
2009
Publication Details
  • ACM Multimedia 2009 Workshop on Large-Scale Multimedia Retrieval and Mining
  • Oct 23, 2009

Abstract

Close
We describe an efficient and scalable system for automatic image categorization. Our approach seeks to marry scalable “model-free” neighborhood-based annotation with accurate boosting-based per-tag modeling. For accelerated neighborhood-based classification, we use a set of spatial data structures as weak classifiers for an arbitrary number of categories. We employ standard edge and color features and an approximation scheme that scales to large training sets. The weak classifier outputs are combined in a tag-dependent fashion via boosting to improve accuracy. The method performs competitively with standard SVM-based per-tag classification with substantially reduced computational requirements. We present multi-label image annotation experiments using data sets of more than two million photos.
Publication Details
  • 2009 IEEE International Conference on Multimedia and Expo (ICME)
  • Jun 30, 2009

Abstract

Close

This paper presents a tool and a novel Fast Invariant Transform (FIT) algorithm for language independent e-documents access. The tool enables a person to access an e-document through an informal camera capture of a document hardcopy. It can save people from remembering/exploring numerous directories and file names, or even going through many pages/paragraphs in one document. It can also facilitate people’s manipulation of a document or people’s interactions through documents. Additionally, the algorithm is useful for binding multimedia data to language independent paper documents. Our document recognition algorithm is inspired by the widely known SIFT descriptor [4] but can be computed much more efficiently for both descriptor construction and search. It also uses much less storage space than the SIFT approach. By testing our algorithm with randomly scaled and rotated document pages, we can achieve a 99.73% page recognition rate on the 2188-page ICME06 proceedings and 99.9% page recognition rate on a 504-page Japanese math book.