Matthew Cooper, D.Sc.

Senior Research Scientist

Matthew Cooper

Matt Cooper is a research scientist at FXPAL, and currently leads the social and enterprise media group. He develops automatic content analysis technologies for interactive applications including the exploration, retrieval, and reuse of multimedia information.

At FXPAL, he has worked on a variety of projects including LoComixMeet, TalkMinerInteractive Video SearchProjectorBox, and the FXPAL Photo Application.

He earned his B.S., M.S., and D.Sc. in Electrical Engineering at Washington University in St. Louis, and is a senior member of the IEEE and a distinguished member of the ACM.

Co-Authors

Publications

2017

Abstract

Close
For tourists, interactions with digital public displays often depend on specific technologies that users may not be familiar with (QR codes, NFC, Bluetooth); may not have access to because of networking issues (SMS), may lack a required app (QR codes), or device technology (NFC); may not want to use because of time constraints (WiFi, Bluetooth); or may not want to use because they are worried about sharing their data with a third-party service (text, WiFi). In this demonstration, we introduce ItineraryScanner, a system that allows users to seamlessly share content with a public travel kiosk system.
Publication Details
  • ACM Document Engineering 2017
  • Aug 30, 2017

Abstract

Close
In this paper, we describe DocHandles, a novel system that allows users to link to specific document parts in their chat applications. As users type a message, they can invoke the tool by referring to a specific part of a document, e.g., “@fig1 needs revision”. By combining text parsing and document layout analysis, DocHandles can find and present all the figures “1” inside previously shared documents, allowing users to explicitly link to the relevant “document handle”. Documents become first-class citizens inside the conversation stream where users can seamlessly integrate documents in their text-centric messaging application.
Publication Details
  • TRECVID Workshop
  • Mar 1, 2017

Abstract

Close
This is a summary of our participation in the TRECVID 2016 video hyperlinking task (LNK). We submitted four runs in total. A baseline system combined on established vectorspace text indexing and cosine similarity. Our other runs explored the use of distributed word representations in combination with fine-grained inter-segment text similarity measures.
2016
Publication Details
  • ACM MM
  • Oct 15, 2016

Abstract

Close
The proliferation of workplace multimedia collaboration applications has meant on one hand more opportunities for group work but on the other more data locked away in proprietary interfaces. We are developing new tools to capture and access multimedia content from any source. In this demo, we focus primarily on new methods that allow users to rapidly reconstitute, enhance, and share document-based information.
Publication Details
  • Document Engineering DocEng 2016
  • Sep 13, 2016

Abstract

Close
In this paper we describe DocuGram, a novel tool to capture and share documents from any application. As users scroll through pages of their document inside the native application (Word, Google Docs, web browser), the system captures and analyses in real-time the video frames and reconstitutes the original document pages into an easy to view HTML-based representation. In addition to regenerating the document pages, a DocuGram also includes the interactions users had over them, e.g. mouse motions and voice comments. A DocuGram acts as a modern copy machine, allowing users to copy and share any document from any application.
Publication Details
  • ICME 2016
  • Jul 11, 2016

Abstract

Close
Captions are a central component in image posts that communicate the background story behind photos. Captions can enhance the engagement with audiences and are therefore critical to campaigns or advertisement. Previous studies in image captioning either rely solely on image content or summarize multiple web documents related to image's location; both neglect users' activities. We propose business-aware latent topics as a new contextual cue for image captioning that represent user activities. The idea is to learn the typical activities of people who posted images from business venues with similar categories (e.g., fast food restaurants) to provide appropriate context for similar topics (e.g., burger) in new posts. User activities are modeled via a latent topic representation. In turn, the image captioning model can generate sentences that better reflect user activities at business venues. In our experiments, the business-aware latent topics are effective for adapting to captions to images captured in various businesses than the existing baselines. Moreover, they complement other contextual cues (image, time) in a multi-modal framework.
Publication Details
  • ACM International Conference on Multimedia Retrieval (ICMR)
  • Jun 6, 2016

Abstract

Close
We propose a method for extractive summarization of audiovisual recordings focusing on topic-level segments. We first build a content similarity graph between all segments of all documents in the collection, using word vectors from the transcripts, and then select the most central segments for the summaries. We evaluate the method quantitatively on the AMI Meeting Corpus using gold standard reference summaries and the Rouge metric, and qualitatively on lecture recordings using a novel two-tiered approach with human judges. The results show that our method compares favorably with others in terms of Rouge, and outperforms the baselines for human scores, thus also validating our evaluation protocol.
Publication Details
  • Personal and Ubiquitous Computing (Springer)
  • Feb 19, 2016

Abstract

Close
In recent years, there has been an explosion of services that lever- age location to provide users novel and engaging experiences. However, many applications fail to realize their full potential because of limitations in current location technologies. Current frameworks work well outdoors but fare poorly indoors. In this paper we present LoCo, a new framework that can provide highly accurate room-level indoor location. LoCo does not require users to carry specialized location hardware—it uses radios that are present in most contemporary devices and, combined with a boosting classification technique, provides a significant runtime performance improvement. We provide experiments that show the combined radio technique can achieve accuracy that improves on current state-of-the-art Wi-Fi only techniques. LoCo is designed to be easily deployed within an environment and readily leveraged by application developers. We believe LoCo’s high accuracy and accessibility can drive a new wave of location-driven applications and services.
2015
Publication Details
  • MM Commons Workshop co-located with ACM Multimedia 2015.
  • Oct 30, 2015

Abstract

Close
In this paper, we analyze the association between a social media user's photo content and their interests. Visual content of photos is analyzed using state-of-the-art deep learning based automatic concept recognition. An aggregate visual concept signature is thereby computed for each user. User tags manually applied to their photos are also used to construct a tf-idf based signature per user. We also obtain social groups that users join to represent their social interests. In an effort to compare the visual-based versus tag-based user profiles with social interests, we compare corresponding similarity matrices with a reference similarity matrix based on users' group memberships. A random baseline is also included that groups users by random sampling while preserving the actual group sizes. A difference metric is proposed and it is shown that the combination of visual and text features better approximates the group-based similarity matrix than either modality individually. We also validate the visual analysis against the reference inter-user similarity using the Spearman rank correlation coefficient. Finally we cluster users by their visual signatures and rank clusters using a cluster uniqueness criteria.
Publication Details
  • ACM MM
  • Oct 26, 2015

Abstract

Close
Establishing common ground is one of the key problems for any form of communication. The problem is particularly pronounced in remote meetings, in which participants can easily lose track of the details of dialogue for any number of reasons. In this demo we present a web-based tool, MixMeet, that allows teleconferencing participants to search the contents of live meetings so they can rapidly retrieve previously shared content to get on the same page, correct a misunderstanding, or discuss a new idea.
Publication Details
  • DocEng 2015
  • Sep 8, 2015

Abstract

Close
Web-based tools for remote collaboration are quickly becoming an established element of the modern workplace. During live meetings, people share web sites, edit presentation slides, and share code editors. It is common for participants to refer to previously spoken or shared content in the course of synchronous distributed collaboration. A simple approach is to index with Optical Character Recognition (OCR) the video frames, or key-frames, being shared and let user retrieve them with text queries. Here we show that a complementary approach is to look at the actions users take inside the live document streams. Based on observations of real meetings, we focus on two important signals: text editing and mouse cursor motion. We describe the detection of text and cursor motion, their implementation in our WebRTC-based system, and how users are better able to search live documents during a meeting based on these detected and indexed actions.

Abstract

Close
Location-enabled applications now permeate the mobile computing landscape. As technologies like Bluetooth Low Energy (BLE) and Apple's iBeacon protocols begin to see widespread adoption, we will no doubt see a proliferation of indoor location enabled application experiences. While not essential to each of these applications, many will require that the location of the device be true and verifiable. In this paper, we present LocAssure, a new framework for trusted indoor location estimation. The system leverages existing technologies like BLE and iBeacons, making the solution practical and compatible with technologies that are already in use today. In this work, we describe our system, situate it within a broad location assurance taxonomy, describe the protocols that enable trusted localization in our system, and provide an analysis of early deployment and use characteristics. Through developer APIs, LocAssure can provide critical security support for a broad range of indoor location applications.
Publication Details
  • IEEE Pervasive Computing
  • Jul 1, 2015

Abstract

Close
Tutorials are one of the most fundamental means of conveying knowledge. In this paper, we present a suite of applications that allow users to combine different types of media captured from handheld, standalone, or wearable devices to create multimedia tutorials. We conducted a study comparing standalone (camera on tripod) versus wearable capture (Google Glass). The results show that tutorial authors have a slight preference for wearable capture devices, especially when recording activities involving larger objects.
Publication Details
  • Presented in "Everyday Telepresence" workshop at CHI 2015
  • Apr 18, 2015

Abstract

Close
As video-mediated communication reaches broad adoption, improving immersion and social interaction are important areas of focus in the design of tools for exploration and work-based communication. Here we present three threads of research focused on developing new ways of enabling exploration of a remote environment and interacting with the people and artifacts therein.
2014

Multi-modal Language Models for Lecture Video Retrieval

Publication Details
  • ACM Multimedia 2014
  • Nov 2, 2014

Abstract

Close
We propose Multi-modal Language Models (MLMs), which adapt latent variable models for text document analysis to modeling co-occurrence relationships in multi-modal data. In this paper, we focus on the application of MLMs to indexing slide and spoken text associated with lecture videos, and subsequently employ a multi-modal probabilistic ranking function for lecture video retrieval. The MLM achieves highly competitive results against well established retrieval methods such as the Vector Space Model and Probabilistic Latent Semantic Analysis. Retrieval performance with MLMs is also shown to improve with the quality of the available extracted spoken text.
Publication Details
  • UIST 2014
  • Oct 5, 2014

Abstract

Close
Video Text Retouch is a technique for retouching textual content found in many online videos such as screencasts, recorded presentations and many online e-learning videos. Viewed through our special, HTML5-based player, users can edit in real-time the textual content of the video frames, such as correcting typos or inserting new words between existing characters. Edits are overlaid and tracked at the desired position for as long as the original video content remains similar. We describe the interaction techniques, image processing algorithms and give implementation details of the system.
Publication Details
  • DocEng 2014
  • Sep 16, 2014

Abstract

Close
Distributed teams must co-ordinate a variety of tasks. To do so they need to be able to create, share, and annotate documents as well as discuss plans and goals. Many workflow tools support document sharing, while other tools support videoconferencing, however there exists little support for connecting the two. In this work we describe a system that allows users to share and markup content during web meetings. This shared content can provide important conversational props within the context of a meeting; it can also help users review archived meetings. Users can also extract shared content from meetings directly into other workflow tools.
Publication Details
  • Ubicomp 2014
  • Sep 9, 2014

Abstract

Close
In recent years, there has been an explosion of social and collaborative applications that leverage location to provide users novel and engaging experiences. Current location technologies work well outdoors but fare poorly indoors. In this paper we present LoCo, a new framework that can provide highly accurate room-level location using a supervised classification scheme. We provide experiments that show this technique is orders of magnitude more efficient than current state-of-the-art Wi- Fi localization techniques. Low classification overhead and computational footprint make classification practical and efficient even on mobile devices. Our framework has also been designed to be easily deployed and lever- aged by developers to help create a new wave of location- driven applications and services.

Supporting media bricoleurs

Publication Details
  • ACM interactions
  • Jul 1, 2014

Abstract

Close
Online video is incredibly rich. A 15-minute home improvement YouTube tutorial might include 1500 words of narration, 100 or more significant keyframes showing a visual change from multiple perspectives, several animated objects, references to other examples, a tool list, comments from viewers and a host of other metadata. Furthermore, video accounts for 90% of worldwide Internet traffic. However, it is our observation that video is not widely seen as a full-fledged document; dismissed as a media that, at worst, gilds over substance and, at best, simply augments text-based communications. In this piece, we suggest that negative attitudes toward multimedia documents that include audio and video are largely unfounded and arise mostly because we lack the necessary tools to treat video content as first-order media or to support seamlessly mixing media.
Publication Details
  • Fuji Xerox Technical Report, No. 23, 2014, pp. 34-42
  • Feb 20, 2014

Abstract

Close
Video content creators invest enormous effort creating work that is in turn typically viewed passively. However, learning tasks using video requires users not only to consume the content but also to engage, interact with, and repurpose it. Furthermore, to promote learning with video in domains where content creators are not necessarily videographers, it is important that capture tools facilitate creation of interactive content. In this paper, we describe some early experiments toward this goal. A literature review coupled with formative field studies led to a system design that can incorporate a broad set of video-creation and interaction styles.
2013
Publication Details
  • Education and Information Technologies journal
  • Oct 11, 2013

Abstract

Close
Video tends to be imbalanced as a medium. Typically, content creators invest enormous effort creating work that is then watched passively. However, learning tasks require that users not only consume video but also engage, interact with, and repurpose content. Furthermore, to promote learning across domains where content creators are not necessarily videographers, it is important that capture tools facilitate creation of interactive content. In this paper, we describe some early experiments toward this goal. Specifically, we describe a needfinding study involving interviews with amateur video creators as well as our experience with an early prototype to support expository capture and access. Our findings led to a system redesign that can incorporate a broad set of video-creation and interaction styles.
Publication Details
  • DocEng 2013
  • Sep 10, 2013

Abstract

Close
Unlike text, copying and pasting parts of video documents is challenging. Yet, the huge amount of video documents now available in the form of how-to tutorials begs for simpler techniques that allow users to easily copy and paste fragments of video materials into new documents. We describe new direct video manipulation techniques that allow users to quickly copy and paste content from video documents such as how-to tutorials into a new document. While the video plays, users interact with the video canvas to select text regions, scrollable regions, slide sequences built up across many frames, or semantically meaningful regions such as dialog boxes. Instead of relying on the timeline to accurately select sub-parts of the video document, users navigate using familiar selection techniques such as mouse-wheel to scroll back and forward over a video region where content scrolls, double-clicks over rectangular regions to select them, or clicks and drags over textual regions of the video canvas to select them. We describe the video processing techniques that run in real-time in modern web browsers using HTML5 and JavaScript; and show how they help users quickly copy and paste video fragments into new documents, allowing them to efficiently reuse video documents for authoring or note-taking.
Publication Details
  • IUI 2013
  • Mar 19, 2013

Abstract

Close
We describe direct video manipulation interactions applied to screen-based tutorials. In addition to using the video timeline, users of our system can quickly navigate into the video by mouse-wheel, double click over a rectangular region to zoom in and out, or drag a box over the video canvas to select text and scrub the video until the end of a text line even if not shown in the current frame. We describe the video processing techniques developed to implement these direct video manipulation techniques, and show how there are implemented to run in most modern web browsers using HTML5's CANVAS and Javascript.
Publication Details
  • SPIE Electronic Imaging 2013
  • Feb 3, 2013

Abstract

Close
Video is becoming a prevalent medium for e-learning. Lecture videos contain useful information in both the visual and aural channels: the presentation slides and lecturer's speech respectively. To extract the visual information, we apply video content analysis to detect slides and optical character recognition (OCR) to obtain their text. Automatic speech recognition (ASR) is used similarly to extract spoken text from the recorded audio. These two text sources have distinct characteristics and relative strengths for video retrieval. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Experiments reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Additional experiments demonstrate higher precision video retrieval using automatically extracted slide text.  
2012
Publication Details
  • IPIN2012
  • Nov 13, 2012

Abstract

Close
We describe Explorer, a system utilizing mirror worlds - dynamic 3D virtual models of physical spaces that reflect the structure and activities of those spaces to help support navigation, context awareness and tasks such as planning and recollection of events. A rich sensor network dynamically updates the models, determining the position of people, status of rooms, or updating textures to reflect displays or bulletin boards. Through views on web pages, portable devices, or on 'magic window' displays located in the physical space, remote people may 'Clook in' to the space, while people within the space are provided with augmented views showing information not physically apparent. For example, by looking at a mirror display, people can learn how long others have been present, or where they have been. People in one part of a building can get a sense of activities in the rest of the building, know who is present in their office, and look in to presentations in other rooms. A spatial graph is derived from the 3D models which is used both to navigational paths and for fusion of acoustic, WiFi, motion and image sensors used for positioning. We describe usage scenarios for the system as deployed in two research labs, and a conference venue.
Publication Details
  • International Journal on Document Analysis and Recognition (IJDAR): Volume 15, Issue 3 (2012), pp. 167-182.
  • Sep 1, 2012

Abstract

Close
When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve genre identification performance. In the open-set identification of four office document genres, our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to genre identification of office documents. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.

TalkMiner: A Lecture Video Search Engine

Publication Details
  • Fuji Xerox Technical Report, No. 21, 2012, pp. 118-128
  • Feb 3, 2012

Abstract

Close
The design and implementation of a search engine for lecture webcasts is described. A searchable text index is created allowing users to locate material within lecture videos found on a variety of websites such as YouTube and Berkeley webcasts. The searchable index is built from the text of presentation slides appearing in the video along with other associated metadata such as the title and abstract when available. The automatic identification of distinct slides within the video stream presents several challenges. For example, picture-in-picture compositing of a speaker and a presentation slide, switching cameras, and slide builds confuse basic algorithms for extracting keyframe slide images. Enhanced algorithms are described that improve slide identification. A public system was deployed to test the algorithms and the utility of the search engine at www.talkminer.com. To date, over 17,000 lecture videos have been indexed from a variety of public sources.
2011
Publication Details
  • ACM Multimedia 2011
  • Nov 28, 2011

Abstract

Close
This paper describes methods for clustering photos that include both time stamps and location coordinates. We present versions of a two part method that first detects clusters using time and location information independently. These candidate clusters partition the set of time-ordered photos. A subset of the candidate clusters is selected by an efficient dynamic programming procedure to optimize a cost function. We propose several cost functions to design clusterings that are coherent in space, time, or both. One set of cost functions minimizes inter-photo distances directly. A second set maximizes an information measure to select clusterings for consistency in both time and space across scale.
Publication Details
  • IS&T and SPIE International Conference on Multimedia Content Access: Algorithms and Systems
  • Jan 23, 2011

Abstract

Close
This paper describes research activities at FX Palo Alto Laboratory (FXPAL) in the area of multimedia browsing, search, and retrieval. We first consider interfaces for organization and management of personal photo collections. We then survey our work on interactive video search and retrieval. Throughout we discuss the evolution of both the research challenges in these areas and our proposed solutions.
2010

Reverted Indexing for Feedback and Expansion

Publication Details
  • ACM Conference on Information and Knowledge Management (CIKM 2010)
  • Oct 26, 2010

Abstract

Close
Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its relative frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify the matching documents for a query. In this paper, we turn the process around: instead of indexing documents, we index query result sets. First, queries are run through a chosen retrieval system. For each query, the resulting document IDs are treated as terms and the score or rank of the document is used as the frequency statistic. An index of documents retrieved by basis queries is created. We call this index a reverted index. Finally, with reverted indexes, standard retrieval algorithms can retrieve the matching queries (as results) for a set of documents (used as queries). These recovered queries can then be used to identify additional documents, or to aid the user in query formulation, selection, and feedback.

TalkMiner: A Search Engine for Online Lecture Video

Publication Details
  • ACM Multimedia 2010 - Industrial Exhibits
  • Oct 25, 2010

Abstract

Close
TalkMiner is a search engine for lecture webcasts. Lecture videos are processed to recover a set of distinct slide images and OCR is used to generate a list of indexable terms from the slides. On our prototype system, users can search and browse lists of lectures, slides in a specific lecture, and play the lecture video. Over 10,000 lecture videos have been indexed from a variety of sources. A public website now allows users to experiment with the search engine.

TalkMiner: A Lecture Webcast Search Engine

Publication Details
  • ACM Multimedia 2010
  • Oct 25, 2010

Abstract

Close
The design and implementation of a search engine for lecture webcasts is described. A searchable text index is created allowing users to locate material within lecture videos found on a variety of websites such as YouTube and Berkeley webcasts. The index is created from words on the presentation slides appearing in the video along with any associated metadata such as the title and abstract when available. The video is analyzed to identify a set of distinct slide images, to which OCR and lexical processes are applied which in turn generate a list of indexable terms. Several problems were discovered when trying to identify distinct slides in the video stream. For example, picture-in-picture compositing of a speaker and a presentation slide, switching cameras, and slide builds confuse basic frame-differencing algorithms for extracting keyframe slide images. Algorithms are described that improve slide identification. A prototype system was built to test the algorithms and the utility of the search engine. Users can browse lists of lectures, slides in a specific lecture, or play the lecture video. Over 10,000 lecture videos have been indexed from a variety of sources. A public website will be published in mid 2010 that allows users to experiment with the search engine.
2009
Publication Details
  • ACM Multimedia 2009 Workshop on Large-Scale Multimedia Retrieval and Mining
  • Oct 23, 2009

Abstract

Close
We describe an efficient and scalable system for automatic image categorization. Our approach seeks to marry scalable “model-free” neighborhood-based annotation with accurate boosting-based per-tag modeling. For accelerated neighborhood-based classification, we use a set of spatial data structures as weak classifiers for an arbitrary number of categories. We employ standard edge and color features and an approximation scheme that scales to large training sets. The weak classifier outputs are combined in a tag-dependent fashion via boosting to improve accuracy. The method performs competitively with standard SVM-based per-tag classification with substantially reduced computational requirements. We present multi-label image annotation experiments using data sets of more than two million photos.
Publication Details
  • Proceedings of TRECVID 2008 Workshop
  • Mar 1, 2009

Abstract

Close
In 2008 FXPAL submitted results for two tasks: rushes summarization and interactive search. The rushes summarization task has been described at the ACM Multimedia workshop [1]. Interested readers are referred to that publication for details. We describe our interactive search experiments in this notebook paper.
Publication Details
  • IUI '09
  • Feb 8, 2009

Abstract

Close
We designed an interactive visual workspace, MediaGLOW, that supports users in organizing personal and shared photo collections. The system interactively places photos with a spring layout algorithm using similarity measures based on visual, temporal, and geographic features. These similarity measures are also used for the retrieval of additional photos. Unlike traditional spring-based algorithms, our approach provides users with several means to adapt the layout to their tasks. Users can group photos in stacks that in turn attract neighborhoods of similar photos. Neighborhoods partition the workspace by severing connections outside the neighborhood. By placing photos into the same stack, users can express a desired organization that the system can use to learn a neighborhood-specific combination of distances.
2008

Interactive Multimedia Search: Systems for Exploration and Collaboration

Publication Details
  • Fuji Xerox Technical Report
  • Dec 15, 2008

Abstract

Close
We have developed an interactive video search system that allows the searcher to rapidly assess query results and easily pivot off those results to form new queries. The system is intended to maximize the use of the discriminative power of the human searcher. The typical video search scenario we consider has a single searcher with the ability to search with text and content-based queries. In this paper, we evaluate a new collaborative modification of our search system. Using our system, two or more users with a common information need search together, simultaneously. The collaborative system provides tools, user interfaces and, most importantly, algorithmically-mediated retrieval to focus, enhance and augment the team's search and communication activities. In our evaluations, algorithmic mediation improved the collaborative performance of both retrieval (allowing a team of searchers to find relevant information more efficiently and effectively), and exploration (allowing the searchers to find relevant information that cannot be found while working individually). We present analysis and conclusions from comparative evaluations of the search system.
Publication Details
  • ACM Multimedia 2008 Workshop: TrecVid Summarization 2008 (TVS'08)
  • Oct 26, 2008

Abstract

Close
In this paper we describe methods for video summarization in the context of the TRECVID 2008 BBC Rushes Summarization task. Color, motion, and audio features are used to segment, filter, and cluster the video. We experiment with varying the segment similarity measure to improve the joint clustering of segments with and without camera motion. Compared to our previous effort for TRECVID 2007 we have reduced the complexity of the summarization process as well as the visual complexity of the summaries themselves. We find our objective (inclusion) performance to be competitive with systems exhibiting similar subjective performance.
Publication Details
  • ACM Conf. on Image and Video Retrieval (CIVR) 2008
  • Jul 7, 2008

Abstract

Close
We have developed an interactive video search system that allows the searcher to rapidly assess query results and easily pivot on those results to form new queries. The system is intended to maximize the use of the discriminative power of the human searcher. This is accomplished by providing a hierarchical segmentation, streamlined interface, and redundant visual cues throughout. The typical search scenario includes a single searcher with the ability to search with text and content-based queries. In this paper, we evaluate new variations on our basic search system. In particular we test the system using only visual content-based search capabilities, and using paired searchers in a realtime collaboration. We present analysis and conclusions from these experiments.

FXPAL Interactive Search Experiments for TRECVID 2007

Publication Details
  • TRECVid 2007
  • Mar 1, 2008

Abstract

Close
In 2007 FXPAL submitted results for two tasks: rushes summarization and interactive search. The rushes summarization task has been described at the ACM Multimedia workshop. Interested readers are referred to that publication for details. We describe our interactive search experiments in this notebook paper.
2007
Publication Details
  • TRECVID Video Summarization Workshop at ACM Multimedia 2007
  • Sep 28, 2007

Abstract

Close
This paper describes a system for selecting excerpts from unedited video and presenting the excerpts in a short sum- mary video for eciently understanding the video contents. Color and motion features are used to divide the video into segments where the color distribution and camera motion are similar. Segments with and without camera motion are clustered separately to identify redundant video. Audio fea- tures are used to identify clapboard appearances for exclu- sion. Representative segments from each cluster are selected for presentation. To increase the original material contained within the summary and reduce the time required to view the summary, selected segments are played back at a higher rate based on the amount of detected camera motion in the segment. Pitch-preserving audio processing is used to bet- ter capture the sense of the original audio. Metadata about each segment is overlayed on the summary to help the viewer understand the context of the summary segments in the orig- inal video.
Publication Details
  • IEEE Intl. Conf. on Semantic Computing
  • Sep 17, 2007

Abstract

Close
We present methods for semantic annotation of multimedia data. The goal is to detect semantic attributes (also referred to as concepts) in clips of video via analysis of a single keyframe or set of frames. The proposed methods integrate high performance discriminative single concept detectors in a random field model for collective multiple concept detection. Furthermore, we describe a generic framework for semantic media classification capable of capturing arbitrary complex dependencies between the semantic concepts. Finally, we present initial experimental results comparing the proposed approach to existing methods.
Publication Details
  • ACM Conf. on Image and Video Retrieval 2007
  • Jul 29, 2007

Abstract

Close
This paper describes FXPAL's interactive video search application, "MediaMagic". FXPAL has participated in the TRECVID interactive search task since 2004. In our search application we employ a rich set of redundant visual cues to help the searcher quickly sift through the video collection. A central element of the interface and underlying search engine is a segmentation of the video into stories, which allows the user to quickly navigate and evaluate the relevance of moderately-sized, semantically-related chunks.
Publication Details
  • IEEE Transactions on Multimedia
  • Apr 1, 2007

Abstract

Close
We present a general approach to temporal media segmentation using supervised classification. Given standard low-level features representing each time sample, we build intermediate features via pairwise similarity. The intermediate features comprehensively characterize local temporal structure, and are input to an efficient supervised classifier to identify shot boundaries. We integrate discriminative feature selection based on mutual information to enhance performance and reduce processing requirements. Experimental results using large-scale test sets provided by the TRECVID evaluations for abrupt and gradual shot boundary detection are presented, demonstrating excellent performance.
2006
Publication Details
  • Interactive Video; Algorithms and Technologies Hammoud, Riad (Ed.) 2006, XVI, 250 p., 109 illus., Hardcover.
  • Jun 7, 2006

Abstract

Close
This chapter describes tools for browsing and searching through video to enable users to quickly locate video passages of interest. Digital video databases containing large numbers of video programs ranging from several minutes to several hours in length are becoming increasingly common. In many cases, it is not sufficient to search for relevant videos, but rather to identify relevant clips, typically less than one minute in length, within the videos. We offer two approaches for finding information in videos. The first approach provides an automatically generated interactive multi-level summary in the form of a hypervideo. When viewing a sequence of short video clips, the user can obtain more detail on the clip being watched. For situations where browsing is impractical, we present a video search system with a flexible user interface that incorporates dynamic visualizations of the underlying multimedia objects. The system employs automatic story segmentation, and displays the results of text and image-based queries in ranked sets of story summaries. Both approaches help users to quickly drill down to potentially relevant video clips and to determine the relevance by visually inspecting the material.

Visualization in Audio-Based Music Information Retrieval

Publication Details
  • Computer Music Journal Vol. 30, Issue 2, pp. 42-62, 2006.
  • Jun 6, 2006

Abstract

Close
Music Information Retrieval (MIR) is an emerging research area that explores how music stored digitally can be effectively organized, searched, retrieved and browsed. The explosive growth of online music distribution, portable music players and lowering costs of recording indicate that in the near future most of recorded music in human history will be available digitally. MIR is steadily growing as a research area as can be evidenced by the international conference on music information retrieval (ISMIR) series soon in its sixth year and the increasing number of MIR-related publications in the Computer Music Journal as well as other journals and conferences.
2005

Seamless presentation capture, indexing, and management

Publication Details
  • Internet Multimedia Management Systems VI (SPIE Optics East 2005)
  • Oct 26, 2005

Abstract

Close
Technology abounds for capturing presentations. However, no simple solution exists that is completely automatic. ProjectorBox is a "zero user interaction" appliance that automatically captures, indexes, and manages presentation multimedia. It operates continuously to record the RGB information sent from presentation devices, such as a presenter's laptop, to display devices, such as a projector. It seamlessly captures high-resolution slide images, text and audio. It requires no operator, specialized software, or changes to current presentation practice. Automatic media analysis is used to detect presentation content and segment presentations. The analysis substantially enhances the web-based user interface for browsing, searching, and exporting captured presentations. ProjectorBox has been in use for over a year in our corporate conference room, and has been deployed in two universities. Our goal is to develop automatic capture services that address both corporate and educational needs.
Publication Details
  • World Conference on E-Learning in Corporate, Government, Healthcare, & Higher Education (E-Learn 2005)
  • Oct 24, 2005

Abstract

Close
Automatic lecture capture can help students, instructors, and educational institutions. Students can focus less on note-taking and more on what the instructor is saying. Instructors can provide access to lecture archives to help students study for exams and make-up missed classes. And online lecture recordings can be used to support distance learning. For these and other reasons, there has been great interest in automatically capturing classroom presentations. However, there is no simple solution that is completely automatic. ProjectorBox is our attempt to create a "zero user interaction" appliance that automatically captures, indexes, and manages presentation multimedia. It operates continuously to record the RGB information sent from presentation devices, such as an instructor's laptop, to display devices such as a projector. It seamlessly captures high-resolution slide images, text, and audio. A web-based user interface allows students to browse, search, replay, and export captured presentations.
Publication Details
  • INTERACT 2005, LNCS 3585, pp. 781-794
  • Sep 12, 2005

Abstract

Close
A video database can contain a large number of videos ranging from several minutes to several hours in length. Typically, it is not sufficient to search just for relevant videos, because the task still remains to find the relevant clip, typically less than one minute of length, within the video. This makes it important to direct the users attention to the most promising material and to indicate what material they already investigated. Based on this premise, we created a video search system with a powerful and flexible user interface that incorporates dynamic visualizations of the underlying multimedia objects. The system employes an automatic story segmentation, combines text and visual search, and displays search results in ranked sets of story keyframe collages. By adapting the keyframe collages based on query relevance and indicating which portions of the video have already been explored, we enable users to quickly find relevant sections. We tested our system as part of the NIST TRECVID interactive search evaluation, and found that our user interface enabled users to find more relevant results within the allotted time than other systems employing more sophisticated analysis techniques but less helpful user interfaces.
Publication Details
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Aug 8, 2005

Abstract

Close
Organizing digital photograph collections according to events such as holiday gatherings or vacations is a common practice among photographers. To support photographers in this task, we present similarity-based methods to cluster digital photos by time and image content. The approach is general, unsupervised, and makes minimal assumptions regarding the structure or statistics of the photo collection. We present several variants of an automatic unsupervised algorithm to partition a collection of digital photographs based either on temporal similarity alone, or on temporal and content-based similarity. First, inter-photo similarity is quantified at multiple temporal scales to identify likely event clusters. Second, the final clusters are determined according to one of three clustering goodness criteria. The clustering criteria trade off computational complexity and performance. We also describe a supervised clustering method based on learning vector quantization. Finally, we review the results of an experimental evaluation of the proposed algorithms and existing approaches on two test collections.
Publication Details
  • International Conference on Image and Video Retrieval 2005
  • Jul 21, 2005

Abstract

Close
Large video collections present a unique set of challenges to the search system designer. Text transcripts do not always provide an accurate index to the visual content, and the performance of visually based semantic extraction techniques is often inadequate for search tasks. The searcher must be relied upon to provide detailed judgment of the relevance of specific video segments. We describe a video search system that facilitates this user task by efficiently presenting search results in semantically meaningful units to simplify exploration of query results and query reformulation. We employ a story segmentation system and supporting user interface elements to effectively present query results at the story level. The system was tested in the 2004 TRECVID interactive search evaluations with very positive results.
Publication Details
  • 2005 IEEE International Conference on Multimedia & Expo
  • Jul 6, 2005

Abstract

Close
A convenient representation of a video segment is a single keyframe. Keyframes are widely used in applications such as non-linear browsing and video editing. With existing methods of keyframe selection, similar video segments result in very similar keyframes, with the drawback that actual differences between the segments may be obscured. We present methods for keyframe selection based on two criteria: capturing the similarity to the represented segment, and preserving the differences from other segment keyframes, so that different segments will have visually distinct representations. We present two discriminative keyframe selection methods, and an example of experimental results.
Publication Details
  • CHI 2005 Extended Abstracts, ACM Press, pp. 1395-1398
  • Apr 1, 2005

Abstract

Close
We present a search interface for large video collections with time-aligned text transcripts. The system is designed for users such as intelligence analysts that need to quickly find video clips relevant to a topic expressed in text and images. A key component of the system is a powerful and flexible user interface that incorporates dynamic visualizations of the underlying multimedia objects. The interface displays search results in ranked sets of story keyframe collages, and lets users explore the shots in a story. By adapting the keyframe collages based on query relevance and indicating which portions of the video have already been explored, we enable users to quickly find relevant sections. We tested our system as part of the NIST TRECVID interactive search evaluation, and found that our user interface enabled users to find more relevant results within the allotted time than those of many systems employing more sophisticated analysis techniques.
2004
Publication Details
  • ACM Multimedia 2004
  • Oct 28, 2004

Abstract

Close
In this paper, we compare several recent approaches to video segmentation using pairwise similarity. We first review and contrast the approaches within the common framework of similarity analysis and kernel correlation. We then combine these approaches with non-parametric supervised classification for shot boundary detection. Finally, we discuss comparative experimental results using the 2002 TRECVID shot boundary detection test collection.

Shot boundary detection via similarity analysis

Publication Details
  • Proceedings TRECVID 2003
  • Mar 1, 2004

Abstract

Close
In this paper, we present a framework for analyzing video using self-similarity. Video scenes are located by analyzing inter-frame similarity matrices. The approach is flexible to the choice of both feature parametrization and similarity measure and it is robust because the data is used to model itself. We present the approach and its application to shot boundary detection.
2003
Publication Details
  • Proc. ACM Multimedia 2003. pp. 364-373
  • Nov 1, 2003

Abstract

Close
We present similarity-based methods to cluster digital photos by time and image content. The approach is general, unsupervised, and makes minimal assumptions regarding the structure or statistics of the photo collection. We present results for the algorithm based solely on temporal similarity, and jointly on temporal and content-based similarity. We also describe a supervised algorithm based on learning vector quantization. Finally, we include experimental results for the proposed algorithms and several competing approaches on two test collections.
Publication Details
  • 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
  • Oct 19, 2003

Abstract

Close
We present a framework for summarizing digital media based on structural analysis. Though these methods are applicable to general media, we concentrate here on characterizing repetitive structure in popular music. In the first step, a similarity matrix is calculated from inter-frame spectral similarity. Segment boundaries, such as verse-chorus transitions, are found by correlating a kernel along the diagonal of the matrix. Once segmented, spectral statistics of each segment are computed. In the second step, segments are clustered based on the pairwise similarity of their statistics, using a matrix decomposition approach. Finally, the audio is summarized by combining segments representing the clusters most frequently repeated throughout the piece. We present results on a small corpus showing more than 90% correct detection of verse and chorus segments.
Publication Details
  • Proc. IEEE Intl. Conf. on Image Processing
  • Sep 14, 2003

Abstract

Close
We present similarity-based methods to cluster digital photos by time and image content. This approach is general, unsupervised, and makes minimal assumptions regarding the structure or statistics of the photo collection. We describe versions of the algorithm using temporal similarity with and without content-based similarity, and compare the algorithms with existing techniques, measured against ground-truth clusters created by humans.
Publication Details
  • Human-Computer Interaction INTERACT '03, IOS Press, pp. 196-203
  • Sep 1, 2003

Abstract

Close
With digital still cameras, users can easily collect thousands of photos. Our goal is to make organizing and browsing photos simple and quick, while retaining scalability to large collections. To that end, we created a photo management application concentrating on areas that improve the overall experience without neglecting the mundane components of such an application. Our application automatically divides photos into meaningful events such as birthdays or trips. Several user interaction mechanisms enhance the user experience when organizing photos. Our application combines a light table for showing thumbnails of the entire photo collection with a tree view that supports navigating, sorting, and filtering photos by categories such as dates, events, people, and locations. A calendar view visualizes photos over time and allows for the quick assignment of dates to scanned photos. We fine-tuned our application by using it with large personal photo collections provided by several users.
Publication Details
  • Proc. SPIE Storage and Retrieval for Multimedia Databases, Vol. 5021, pp. 167-75
  • Jan 20, 2003

Abstract

Close
We present a framework for analyzing the structure of digital media streams. Though our methods work for video,text,and audio,we concentrate on detecting the structure of digital music files. In the first step,spectral data is used to construct a similarity matrix calculated from inter-frame spectral similarity. The digital audio can be robustly segmented by correlating a ernel along the diagonal of the similarity matrix. Once segmented, spectral statistics of each segment are computed.In the second step,segments are clustered based on the self- similarity of their statistics. This reveals the structure of the digital music in a set of segment boundaries and labels.Finally,the music can be summarized by selecting clusters with repeated segments throughout the piece. The summaries can be customized for various applications based on the structure of the original music.
2002
Publication Details
  • IEEE Multimedia Signal Processing Workshop
  • Dec 11, 2002

Abstract

Close
We present a novel approach to automatically ex-tracting summary excerpts from audio and video. Our approach is to maximize the average similarity between the excerpt and the source. We first calculate a similarity matrix by comparing each pair of time samples using a quantitative similarity measure. To determine the segment with highest average similarity, we maximize the summation of the self-similarity matrix over the support of the segment. To select multiple excerpts while avoiding redundancy, we compute the non-negative matrix factorization (NMF) of the similarity matrix into its essential structural components. We then build a summary comprised of excerpts from the main components, selecting the excerpts for maximum average similarity within each component. Variations integrating segmentation and other information are also discussed, and experimental results are presented.
Publication Details
  • ACM Multimedia 2002
  • Dec 1, 2002

Abstract

Close
We present methods for automatic and semi-automatic creation of music videos, given an arbitrary audio soundtrack and source video. Significant audio changes are automatically detected; similarly, the source video is automatically segmented and analyzed for suitability based on camera motion and exposure. Video with excessive camera motion or poor contrast is penalized with a high unsuitability score, and is more likely to be discarded in the final edit. High quality video clips are then automatically selected and aligned in time with significant audio changes. Video clips are adjusted to match the audio segments by selecting the most suitable region of the desired length. Besides a fully automated solution, our system can also start with clips manually selected and ordered using a graphical interface. The video is then created by truncating the selected clips (preserving the high quality portions) to produce a video digest that is synchronized with the soundtrack music, thus enhancing the impact of both.
Publication Details
  • 2002 International Symposium on Music Information Retrieval
  • Oct 13, 2002

Abstract

Close
We present methods for automatically producing summary excerpts or thumbnails of music. To find the most representative excerpt, we maximize the average segment similarity to the entire work. After window-based audio parameterization, a quantitative similarity measure is calculated between every pair of windows, and the results are embedded in a 2-D similarity matrix. Summing the similarity matrix over the support of a segment results in a measure of how similar that segment is to the whole. This measure is maximized to find the segment that best represents the entire work. We discuss variations on the method, and present experimental results for orchestral music, popular songs, and jazz. These results demonstrate that the method finds significantly representative excerpts, using very few assumptions about the source audio.

Audio Retrieval by Rhythmic Similarity

Publication Details
  • 2002 International Symposium on Music Information Retrieval
  • Oct 13, 2002

Abstract

Close
We present a method for characterizing both the rhythm and tempo of music. We also present ways to quantitatively measure the rhythmic similarity between two or more works of music. This allows rhythmically similar works to be retrieved from a large collection. A related application is to sequence music by rhythmic similarity, thus providing an automatic "disc jockey" function for musical libraries. Besides specific analysis and retrieval methods, we present small-scale experiments that demonstrate ranking and retrieving musical audio by rhythmic similarity.
Publication Details
  • Proceedings IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002
  • Aug 26, 2002

Abstract

Close
We present a method for rapidly and robustly extracting audio excerpts without the overhead of speech recognition or speaker segmentation. An immediate application is to automatically augment keyframe-based video summaries with informative audio excerpts associated with the video segments represented by the keyframes. Short audio clips combined with keyframes comprise an extremely lightweight and Web-browsable interface for auditioning video or similar media, without using bandwidth-intensive streaming video or audio.
Publication Details
  • IEEE International Conference on Multimedia and Expo 2002
  • Aug 26, 2002

Abstract

Close
This paper presents a camera system called FlySPEC. In contrast to a traditional camera system that provides the same video stream to every user, FlySPEC can simultaneously serve different video-viewing requests. This flexibility allows users to conveniently participate in a seminar or meeting at their own pace. Meanwhile, the FlySPEC system provides a seamless blend of manual control and automation. With this control mix, users can easily make tradeoffs between video capture effort and video quality. The FlySPEC camera is constructed by installing a set of Pan/Tilt/Zoom (PTZ) cameras near a high-resolution panoramic camera. While the panoramic camera provides the basic functionality of serving different viewing requests, the PTZ camera is managed by our algorithm to improve the overall video quality that may affect users watching details. The video resolution improvements from using different camera management strategies are compared in the experimental section.
2001
Publication Details
  • In Workshop on Identifying Objects Across Variations in Lighting: Psychophysics & Computation, Proc. IEEE Intl. Conf. on Computer Vision & Pattern Recognition 2001.
  • Dec 12, 2001

Abstract

Close
In this paper, we document an extension to traditional pattern-theoretic object templates to jointly accommodate variations in object pose and in the radiant appearance of the object surface. We first review classical object templates accommodating pose variation. We then develop an efficient subspace representation for the object radiance indexed on the surface of the three dimensional object template. We integrate the low-dimensional representation for the object radiance, or signature, into the pattern-theoretic template, and present the results of orientation estimation experiments. The experiments demonstrate both estimation performance fluctuations under varying illumination conditions and performance degradations associated with unknown scene illumination. We also present a Bayesian approach for estimation accommodating illumination variability.
Publication Details
  • In Proceedings of the International Conference on Image Processing, Thessaloniki, Greece. October 7-10, 2001.
  • Oct 7, 2001

Abstract

Close
In this paper, we present a novel framework for analyzing video using self-similarity. Video scenes are located by analyzing inter-frame similarity matrices. The approach is flexible to the choice of similarity measure and is robust and data-independent because the data is used to model itself. We present the approach and its application to scene boundary detection. This is shown to dramatically outperform a conventional scene-boundary detector that uses a histogram-based measure of frame difference.
Publication Details
  • Proc. International Conference on Computer Music (ICMC), Habana, Cuba, September 2001.
  • Sep 12, 2001

Abstract

Close
This paper presents a novel approach to visualizing the time structure of musical waveforms. The acoustic similarity between any two instants of an audio recording is displayed in a static 2D representation, which makes structural and rhythmic characteristics visible. Unlike practically all prior work, this method characterizes self-similarity rather than specific audio attributes such as pitch or spectral features. Examples are presented for classical and popular music.