Document Genre Identification

Genre has been used to categorize a variety of art and composition forms, including movies, songs, and literature. While entering query terms is the traditional method of searching for documents, genre can provide a complementary, non-topical means to characterize documents and web pages, and can serve as useful metadata when indexing, organizing, and searching for documents. For example, search results can be grouped by genre, or topical search queries can be augmented by web page genre.

In addition to HTML-based documents on the Web, it is now commonplace to search for and distribute documents in other formats commonly associated with office documents, such as PowerPoint or Word. PDF, which is portable, is even more popular than these document creation formats. However, the genre of PDF documents is often unknown, and document creation programs can be used to create documents in multiple genres.

We have developed a system to identify the genre(s) of documents based on image features. Example genres are shown in the upper figure. The system has been used to tag a corpus for the DocuBrowse system by genre. Results when the genre facet is set to 'Tech paper' are shown in the lower figure.

Technical Contact: Francine Chen.

Related Publications

