Genre identification for office document search and browsing

Abstract

When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting
features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve genre identification performance. In the open-set identification of four office document genres, our experiments show that when combined with image-based
features, text-based features do not significantly influence performance. These results provide support for a
topic-independent approach to genre identification of office documents. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone.
We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.