Publications

By Lynn Wilcox (Clear Search)

2002
Publication Details
  • ACM Multimedia 2002
  • Dec 1, 2002

Abstract

Close
FlySPEC is a video camera system designed for real-time remote operation. A hybrid design combines the high resolution possible using an optomechanical video camera, with the wide field of view always available from a panoramic camera. The control system integrates requests from multiple users with the result that each controls a virtual camera. The control system seamlessly integrates manual and fully automatic control. It supports a range of options from untended automatic to full manual control, and the system can learn control strategies from user requests. Additionally, the panoramic view is always available for an intuitive interface, and objects are never out of view regardless of the zoom factor. We present the system architecture, an information-theoretic approach to combining panoramic and zoomed images to optimally satisfy user requests, and experimental results that show the FlySPEC system significantly assists users in a remote inspection tasks.
Publication Details
  • Proceedings IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002
  • Aug 26, 2002

Abstract

Close
We present a method for rapidly and robustly extracting audio excerpts without the overhead of speech recognition or speaker segmentation. An immediate application is to automatically augment keyframe-based video summaries with informative audio excerpts associated with the video segments represented by the keyframes. Short audio clips combined with keyframes comprise an extremely lightweight and Web-browsable interface for auditioning video or similar media, without using bandwidth-intensive streaming video or audio.
Publication Details
  • IEEE International Conference on Multimedia and Expo 2002
  • Aug 26, 2002

Abstract

Close
This paper presents a camera system called FlySPEC. In contrast to a traditional camera system that provides the same video stream to every user, FlySPEC can simultaneously serve different video-viewing requests. This flexibility allows users to conveniently participate in a seminar or meeting at their own pace. Meanwhile, the FlySPEC system provides a seamless blend of manual control and automation. With this control mix, users can easily make tradeoffs between video capture effort and video quality. The FlySPEC camera is constructed by installing a set of Pan/Tilt/Zoom (PTZ) cameras near a high-resolution panoramic camera. While the panoramic camera provides the basic functionality of serving different viewing requests, the PTZ camera is managed by our algorithm to improve the overall video quality that may affect users watching details. The video resolution improvements from using different camera management strategies are compared in the experimental section.
Publication Details
  • SPIE ITCOM 2002
  • Jul 31, 2002

Abstract

Close
We present a framework, motivated by rate-distortion theory and the human visual system, for optimally representing the real world given limited video resolution. To provide users with high fidelity views, we built a hybrid video camera system that combines a fixed wide-field panoramic camera with a controllable pan/tilt/zoom (PTZ) camera. In our framework, a video frame is viewed as a limited-frequency representation of some "true" image function. Our system combines outputs from both cameras to construct the highest fidelity views possible, and controls the PTZ camera to maximize information gain available from higher spatial frequencies. In operation, each remote viewer is presented with a small panoramic view of the entire scene, and a larger close-up view of a selected region. Users may select a region by marking the panoramic view. The system operates the PTZ camera to best satisfy requests from multiple users. When no regions are selected, the system automatically operates the PTZ camera to minimize predicted video distortion. High-resolution images are cached and sent if a previously recorded region has not changed and the PTZ camera is pointed elsewhere. We present experiments demonstrating that the panoramic image can effectively predict where to gain the most information, and also that the system provides better images to multiple users than conventional camera systems.
2001
Publication Details
  • IEEE Computer, 34(9), pp. 61-67
  • Sep 1, 2001

Abstract

Close

To meet the diverse needs of business, education, and personal video users, the authors developed three visual interfaces that help identify potentially useful or relevant video segments. In such interfaces, keyframes-still images automatically extracted from video footage-can distinguish videos, summarize them, and provide access points. Well-chosen keyframes enhance a listing's visual appeal and help users select videos. Keyframe selection can vary depending on the application's requirements: A visual summary of a video-captured meeting may require only a few highlight keyframes, a video editing system might need a keyframe for every clip, while a browsing interface requires an even distribution of keyframes over the video's full length. The authors conducted user studies for each of their three interfaces, gathering input for subsequent interface improvements. The studies revealed that finding a similarity measure for collecting video clips into groups that more closely match human perception poses a challenge. Another challenge is to further improve the video-segmentation algorithm used for selecting keyframes. A new version will provide users with more information and control without sacrificing the interface's ease of use.

Publication Details
  • In Proceedings of Human-Computer Interaction (INTERACT '01), IOS Press, Tokyo, Japan, pp. 464-471
  • Jul 9, 2001

Abstract

Close
Hitchcock is a system to simplify the process of editing video. Its key features are the use of automatic analysis to find the best quality video clips, an algorithm to cluster those clips into meaningful piles, and an intuitive user interface for combining the desired clips into a final video. We conducted a user study to determine how the automatic clip creation and pile navigation support users in the editing process. The study showed that users liked the ease-of-use afforded by automation, but occasionally had problems navigating and overriding the automated editing decisions. These findings demonstrate the need for a proper balance between automation and user control. Thus, we built a new version of Hitchcock that retains the automatic editing features, but provides additional controls for navigation and for allowing users to modify the system decisions.
2000
Publication Details
  • In Proceedings of UIST '00, ACM Press, pp. 81-89, 2000.
  • Nov 4, 2000

Abstract

Close
Hitchcock is a system that allows users to easily create custom videos from raw video shot with a standard video camera. In contrast to other video editing systems, Hitchcock uses automatic analysis to determine the suitability of portions of the raw video. Unsuitable video typically has fast or erratic camera motion. Hitchcock first analyzes video to identify the type and amount of camera motion: fast pan, slow zoom, etc. Based on this analysis, a numerical "unsuitability" score is computed for each frame of the video. Combined with standard editing rules, this score is used to identify clips for inclusion in the final video and to select their start and end points. To create a custom video, the user drags keyframes corresponding to the desired clips into a storyboard. Users can lengthen or shorten the clip without specifying the start and end frames explicitly. Clip lengths are balanced automatically using a spring-based algorithm.
Publication Details
  • In Proceedings of IEEE International Conference on Multimedia and Expo, vol. III, pp. 1329-1332, 2000.
  • Jul 30, 2000

Abstract

Close
We describe a genetic segmentation algorithm for video. This algorithm operates on segments of a string representation. It is similar to both classical genetic algorithms that operate on bits of a string and genetic grouping algorithms that operate on subsets of a set. For evaluating segmentations, we define similarity adjacency functions, which are extremely expensive to optimize with traditional methods. The evolutionary nature of genetic algorithms offers a further advantage by enabling incremental segmentation. Applications include video summarization and indexing for browsing, plus adapting to user access patterns.
Publication Details
  • In Proceedings of the Genetic and Evolutionary Computation Conference, Morgan Kaufmann Publishers, pp. 666-673, 2000.
  • Jul 8, 2000

Abstract

Close
We describe a genetic segmentation algorithm for image data streams and video. This algorithm operates on segments of a string representation. It is similar to both classical genetic algorithms that operate on bits of a string and genetic grouping algorithms that operate on subsets of a set. It employs a segment fair crossover operation. For evaluating segmentations, we define similarity adjacency functions, which are extremely expensive to optimize with traditional methods. The evolutionary nature of genetic algorithms offers a further advantage by enabling incremental segmentation. Applications include browsing and summarizing video and collections of visually rich documents, plus a way of adapting to user access patterns.
Publication Details
  • In RIAO'2000 Conference Proceedings, Content-Based Multimedia Information Access, C.I.D., pp. 637-648, 2000.
  • Apr 12, 2000

Abstract

Close
We present and interactive system that allows a user to locate regions of video that are similar to a video query. Thus segments of video can be found by simply providing an example of the video of interest. The user selects a video segment for the query from either a static frame-based interface or a video player. A statistical model of the query is calculated on-the-fly, and is used to find similar regions of video. The similarity measure is based on a Gaussian model of reduced frame image transform coefficients. Similarity in a single video is displayed in the Metadata Media Player. The player can be used to navigate through the video by jumping between regions of similarity. Similarity can be rapidly calculated for multiple video files as well. These results are displayed in MBase, a Web-based video browser that allows similarity in multiple video files to be visualized simultaneously.
1999
Publication Details
  • In Proceedings of ACM Multimedia '99, Orlando, Florida, November 1999.
  • Oct 30, 1999

Abstract

Close
NoteLook is a client-server system designed and built to support multimedia note taking in meetings with digital video and ink. It is integrated into a conference room equipped with computer controllable video cameras, video conference camera, and a large display rear video projector. The NoteLook client application runs on wireless pen-based notebook computers. Video channels containing images of the room activity and presentation material are transmitted by the NoteLook servers to the clients, and the images can be interactively and automatically incorporated into the note pages. Users can select channels, snap in large background images and sequences of thumbnails, and write freeform ink notes. A smart video source management component enables the capture of high quality images of the presentation material from a variety of sources. For accessing and browsing the notes and recorded video, NoteLook generates Web pages with links from the images and ink strokes correlated to the video.
Publication Details
  • In Proceedings of the Second International Workshop on Cooperative Buildings (CoBuild'99). Lecture Notes in Computer Science, Vol. 1670 Springer-Verlag, pp. 79-88, 1999.
  • Oct 1, 1999

Abstract

Close
We describe a media enriched conference room designed for capturing meetings. Our goal is to do this in a flexible, seamless, and unobtrusive manner in a public conference room that is used for everyday work. Room activity is captured by computer controllable video cameras, video conference cameras, and ceiling microphones. Presentation material displayed on a large screen rear video projector is captured by a smart video source management component that automatically locates the highest fidelity image source. Wireless pen-based notebook computers are used to take notes, which provide indexes to the captured meeting. Images can be interactively and automatically incorporated into the notes. Captured meetings may be browsed on the Web with links to recorded video.
Publication Details
  • In Human-Computer Interaction INTERACT '99, IOS Press, pp. 205-212, 1999.
  • Aug 30, 1999

Abstract

Close
When reviewing collections of video such as recorded meetings or presentations, users are often interested only in an overview or short segments of these documents. We present techniques that use automatic feature analysis, such as slide detection and applause detection, to help locate the desired video and to navigate to regions of interest within it. We built a web-based interface that graphically presents information about the contents of each video in a collection such as its keyframes and the distribution of a particular feature over time. A media player is tightly integrated with the web interface. It supports navigation within a selected file by visualiz-ing confidence scores for the presence of features and by using them as index points. We conducted a user study to refine the usability of these tools.
Publication Details
  • In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (Phoenix, AZ), vol. 6, pp. 3029-3032, 1999.
  • Mar 14, 1999

Abstract

Close
This paper describes a method for finding segments in video-recorded meetings that correspond to presentations. These segments serve as indexes into the recorded meeting. The system automatically detects intervals of video that correspond to presentation slides. We assume that only one person speaks during an interval when slides are detected. Thus these intervals can be used as training data for a speaker spotting system. An HMM is automatically constructed and trained on the audio data from each slide interval. A Viterbi alignment then resegments the audio according to speaker. Since the same speaker may talk across multiple slide intervals, the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. This allows the individual presentations in the video to be identified from the location of each presenter's speech. Results are presented for a corpus of six meeting videos.
Publication Details
  • In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (Phoenix, AZ), vol. 6, pp. 3453-3456, 1999.
  • Mar 14, 1999

Abstract

Close
This paper describes a technique for automatically creating an index for handwritten notes captured as digital ink. No text recog-nition is performed. Rather, a dictionary of possible index terms is built by clustering groups of ink strokes corresponding roughly to words. Terms whose distribution varies significantly across note pages are selected for the index. An index page containing the index terms is created, and terms are hyper-linked back to their original location in the notes. Further, index terms occurring in a note page are highlighted to aid browsing.
1998
Publication Details
  • UIST '98, ACM Press, 1998, pp. 195-202.
  • Oct 31, 1998

Abstract

Close
In this paper, we describe a technique for dynamically grouping digital ink and audio to support user interaction in freeform note-taking systems. For ink, groups of strokes might correspond to words, lines, or paragraphs of handwritten text. For audio, groups might be a complete spoken phrase or a speaker turn in a conversation. Ink and audio grouping is important for editing operations such as deleting or moving chunks of ink and audio notes. The grouping technique is based on hierarchical agglomerative clustering. This clustering algorithm yields groups of ink or audio in a range of sizes, depending on the level in the hierarchy, and thus provides structure for simple interactive selection and rapid non-linear expansion of a selection. Ink and audio grouping is also important for marking portions of notes for subsequent browsing and retrieval. Integration of the ink and audio clusters provides a flexible way to browse the notes by selecting the ink cluster and playing the corresponding audio cluster.
Publication Details
  • MULTIMEDIA '98, ACM Press, 1998, pp. 375-380.
  • Sep 14, 1998

Abstract

Close
Many techniques can extract information from an multimedia stream, such as speaker identity or shot boundaries. We present a browser that uses this information to navigate through stored media. Because automatically-derived information is not wholly reliable, it is transformed into a time-dependent "confidence score." When presented graphically, confidence scores enable users to make informed decisions about regions of interest in the media, so that non-interesting areas may be skipped. Additionally, index points may be determined automatically for easy navigation, selection, editing, and annotation and will support analysis types other than the speaker identification and shot detection used here.
Publication Details
  • Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (Seattle, WA), Vol. 6, 1998, pp. 3741-3744.
  • May 12, 1998

Abstract

Close
This paper describes a technique for segmenting video using hidden Markov models (HMM). Video is segmented into regions defined by shots, shot boundaries, and camera movement within shots. Features for segmentation include an image-based distance between adjacent video frames, an audio distance based on the acoustic difference in intervals just before and after the frames, and an estimate of motion between the two frames. Typical video segmentation algorithms classify shot boundaries by computing an image-based distance between adjacent frames and comparing this distance to fixed, manually determined thresholds. Motion and audio information is used separately. In contrast, our segmentation technique allows features to be combined within the HMM framework. Further, thresholds are not required since automatically trained HMMs take their place. This algorithm has been tested on a video data base, and has been shown to improve the accuracy of video segmentation over standard threshold-based systems.
Publication Details
  • In Proceedings of the Thirty-first Annual Hawaii International Conference on System Sciences (Wailea, Hawaii, January 1998), Volume II, pp. 259-267.
  • Feb 6, 1998

Abstract

Close
In this paper we describe a method for indexing and retrieval of multimedia data based on annotation and segmentation. Our goal is the retrieval of segments of audio and video suitable for inclusion in multimedia documents. Annotation refers to the association of text data with particular time locations of the media. Segmentation is the partitioning of continuous media into homogenous regions. Retrieval is performed over segments of the media using the annotations associated with the segments. We present two scenarios that describe how these techniques might be applied. In the first, we describe how excerpts from a video-taped usage study of a new device are located for inclusion in a report on the utility of the device. In the second, we show how sound bites from a recorded meeting are obtained for use in authoring a summary of the meeting.
1997
Publication Details
  • In Proceedings: VISual'97; Second International Conference on Visual Information Systems (San Diego, CA), 1997, pp. 53-60.
  • Dec 15, 1997

Abstract

Close
As the concept of what constitutes a "content-based" search grows more mature, it becomes valuable for us to establish a clear sense of just what is meant by "content." Recent multimedia document retrieval systems have dealt with this problem by indexing across multiple indexes; but it is important to identify how such multiple indexes are dealing with multiple dimensions of a description space, rather than simply providing the user with more descriptors. In this paper we consider a description space for multimedia documents based on three "dimensions" of a document, namely context, form, and content. We analyze the nature of this space with respect to three challenging examples of multimedia search tasks, and we address the nature of the index structures that would facilitate how these tasks may be achieved. These examples then lead us to some general conclusions on the nature of multimedia indexing and the intuitions we have inherited from the tradition of books and libraries.
Publication Details
  • In CHI 97 Extended Abstracts, ACM Press, 1997, pp. 22-23.
  • Mar 21, 1997

Abstract

Close
Dynomite is a portable electronic notebook that merges the benefits of paper note-taking with the organizational capabilities of computers. Dynomite incorporates four complementary features which combine to form an easy to use system for the capture and retrieval of handwritten and audio notes. First, Dynomite has a paper-like user interface. Second, Dynomite uses ink properties and keywords for content indexing of notes. Third, Dynomite's properties and keywords allows retrieval of specific ink and notes. The user is shown a view, or a subset of the notebook content, that dynamically changes as new information is added. Finally, Dynomite continuously records audio, but only permanently stores highlighted portions so that it is possible to augment handwritten notes with audio on devices with limited storage.
Publication Details
  • In CHI 97 Conference Proceedings, ACM Press, 1997, pp. 186-193.
  • Mar 21, 1997

Abstract

Close
Dynomite is a portable electronic notebook for the capture and retrieval of handwritten and audio notes. The goal of Dynomite is to merge the organization, search, and data acquisition capabilities of a computer with the benefits of a paper-based notebook. Dynomite provides novel solutions in four key problem areas. First, Dynomite uses a casual, low cognitive overhead interface. Second, for content indexing of notes, Dynomite uses ink properties and keywords. Third, to assist organization, Dynomite's properties and keywords define views, presenting a subset of the notebook content that dynamically changes as users add new information. Finally, to augment handwritten notes with audio on devices with limited storage, Dynomite continuously records audio, but only permanently stores those parts highlighted by the user.

Metadata for Mixed Media Access.

Publication Details
  • In Managing Multimedia Data: Using Metadata to Integrate and Apply Digital Data. A. Sheth and W. Klas (eds.), McGraw Hill, 1997.
  • Feb 1, 1997

Abstract

Close
In this chapter, we discuss mixed-media access, an information access paradigm for multimedia data in which the media type of a query may differ from that of the data. This allows a single query to be used to retrieve information from data consisting of multiple types of media. In addition, multiple queries formulated in different media types can be used to more accurately specify the data to be retrieved. The types of media considered in this paper are speech, images of text, and full-length text. Some examples of metadata for mixed-media access are locations of keywords in speech and images, identification of speakers, locations of emphasized regions in speech, and locations of topic boundaries in text. Algorithms for automatically generating this metadata are described, including word spotting, speaker segmentation, emphatic speech detection, and subtopic boundary location. We illustrate the use of mixed-media access with an example of information access from multimedia data surrounding a formal presentation.
1996
Publication Details
  • Proceedings Interface Conference (Sydney, Australia, July 1996).
  • Jul 1, 1996

Abstract

Close
Online digital audio is a rapidly growing resource, which can be accessed in rich new ways not previously possible. For example, it is possible to listen to just those portions of a long discussion which involve a given subset of people, or to instantly skip ahead to the next speaker. Providing this capability to users, however, requires generation of necessary indices, as well as an interface which utilizes these indices to aid navigation. We describe algorithms which generate indices from automatic acoustic segmentation. These algorithms use hidden Markov models to segment audio into segments corresponding to different speakers or acoustics classes (e.g. music). Unsupervised model initialization using agglomerative clustering is described, and shown to work as well in most cases as supervised initialization. We also describe a user interface which displays the segmentation in the form of a timeline, which tracks for the different acoustic classes. The interface can be used for direct navigation through the audio.