Figure 1. Inverted index maps terms onto documents that contain them.
Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify matching documents for a query.
Term-based search, however, is just one example of interactive information seeking. Other examples include offering suggestions of documents similar to ones already found, or identifying effective query expansion terms that the user might wish to use. More generally, these fall into several categories: query term suggestion, relevance feedback, and pseudo-relevance feedback.
We can combine the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, such as query expansion and relevance (and pseudo-relevance) feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.
Here’s how it works:
Figure 2. Reverted index maps document ids onto terms effective at retrieving the documents.
- First, we build a regular inverted index. (Figure 1)
- Then we create a collection of queries, which we call basis queries. Basis queries can be created from query logs, or they can be synthesized in some manner. One naive approach that works reasonably well is to use terms from the inverted index that have a minimum DF.
- We evaluate each basis query against the inverted index using some ranking algorithm. In general, high-precision algorithms should work well for this, although we haven't tested this intuition formally. For each query, we record the ranked list of retrieved document ids in the reverted index. In other words, we create a virtual document with id=basis query that contains as its terms the ids of documents that were retrieved by that query. Putting all this into the reverted index (which is just a second instance of an inverted index) creates a new structure that can be queried with document ids of documents from the inverted index, and will create a ranked list of queries that are effective at retrieving the specified documents.
Once you’ve identified these basis queries, they can be used for any other purpose such as offering term suggestions, expanding queries through pseudo-relevance feedback or through user-controlled relevance feedback, etc.
How well does this work? Read our CIKM 2010 paper for the full details, but the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.
We are integrating this technique into a number of systems, including TalkMiner and Querium.
The following Java source code that defines the Reverted Indexing framework and offers an
implementation using the Lucene search engine is now available as open source under the New BSD license.
Feedback is always welcome!
Technical Contact: Gene Golovchinsky.