Document Genre Identification
Genre has been used to categorize a variety of art and composition
forms, including movies, songs, and literature. While entering query terms is the traditional method of
searching for documents, genre can provide a complementary, non-topical
means to characterize documents and web pages, and can serve as useful
metadata when indexing, organizing, and searching for documents.
For example, search results can be grouped by genre, or topical search queries can
be augmented by web page genre.
In addition to HTML-based documents on the Web, it is now commonplace
to search for and distribute documents in other formats commonly associated with office documents, such as
PowerPoint or Word. PDF, which is portable, is even more popular than these document creation formats. However, the genre of PDF documents is often unknown, and document creation programs can be used to create documents in multiple genres.
We have developed a system to identify the genre(s) of documents based on image features. Example genres are shown in the upper figure. The system has been used to tag a corpus for the DocuBrowse system by genre. Results when the genre facet is set to 'Tech paper' are shown in the lower figure.
Technical Contact: Francine Chen.
Related Publications
|
|
|