Publications

FXPAL publishes in top scientific conferences and journals.

2014
Publication Details
  • International Journal of Multimedia Information Retrieval Special Issue on Cross-Media Analysis
  • Sep 4, 2014

Abstract

Close
Media Embedded Target, or MET, is an iconic mark printed in a blank margin of a page that indicates a media link is associated with a nearby region of the page. It guides the user to capture the region and thus retrieve the associated link through visual search within indexed content. The target also serves to separate page regions with media links from other regions of the page. The capture application on the cell phone displays a sight having the same shape as the target near the edge of a camera-view display. The user moves the phone to align the sight with the target printed on the page. Once the system detects correct sight-target alignment, the region in the camera view is captured and sent to the recognition engine which identifies the image and causes the associated media to be displayed on the phone. Since target and sight alignment defines a capture region, this approach saves storage by only indexing visual features in the predefined capture region, rather than indexing the entire page. Target-sight alignment assures that the indexed region is fully captured. We compare the use of MET for guiding capture with two standard methods: one that uses a logo to indicate that media content is available and text to define the capture region and another that explicitly indicates the capture region using a visible boundary mark.
Publication Details
  • SPIE optics + photonics (SPIE)
  • Aug 17, 2014

Abstract

Close
Live 3D reconstruction of a human as a 3D mesh with commodity electronics is becoming a reality. Immersive applications (i.e. cloud gaming, tele-presence) benefit from effective transmission of such content over a bandwidth limited link. In this paper we outline different approaches for compressing live reconstructed mesh geometry based on distributing mesh reconstruction functions between sender and receiver. We evaluate rate-performance-complexity of different configurations. First, we investigate 3D mesh compression methods (i.e. dynamic/static) from MPEG-4. Second, we evaluate the option of using octree based point cloud compression and receiver side surface reconstruction.
Publication Details
  • ICME 2014, Best Demo Award
  • Jul 14, 2014

Abstract

Close
In this paper, we describe Gesture Viewport, a projector-camera system that enables finger gesture interactions with media content on any surface. We propose a novel and computationally very efficient finger localization method based on the detection of occlusion patterns inside a virtual sensor grid rendered in a layer on top of a viewport widget. We develop several robust interaction techniques to prevent unintentional gestures to occur, to provide visual feedback to a user, and to minimize the interference of the sensor grid with the media content. We show the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View.
Publication Details
  • ACM SIGIR International Workshop on Social Media Retrieval and Analysis
  • Jul 11, 2014

Abstract

Close
We examine the use of clustering to identify selfies in a social media user's photos for use in estimating demographic information such as age, gender, and race. Faces are first detected within a user's photos followed by clustering using visual similarity. We define a cluster scoring scheme that uses a combination of within-cluster visual similarity and average face size in a cluster to rank potential selfie-clusters. Finally, we evaluate this ranking approach over a collection of Twitter users and discuss methods that can be used for improving performance in the future.

SearchPanel: Framing Complex Search Needs

Publication Details
  • SIGIR 2014
  • Jul 6, 2014
  • pp. pp.495-504

Abstract

Close
People often use more than one query when searching for information. They revisit search results to re-find information and build an understanding of their search need through iterative explorations of query formulation. These tasks are not well-supported by search interfaces and web browsers. We designed and built SearchPanel, a Chrome browser extension that helps people manage their ongoing information seeking. This extension combines document and process metadata into an interactive representation of the retrieved documents that can be used for sense-making, navigation, and re-finding documents. In a real-world deployment spanning over two months, results show that SearchPanel appears to have been primarily used for complex information needs, in search sessions with long durations and high numbers of queries. The process metadata features in SearchPanel seem to be of particular importance when working on complex information needs.

Supporting media bricoleurs

Publication Details
  • ACM interactions
  • Jul 1, 2014

Abstract

Close
Online video is incredibly rich. A 15-minute home improvement YouTube tutorial might include 1500 words of narration, 100 or more significant keyframes showing a visual change from multiple perspectives, several animated objects, references to other examples, a tool list, comments from viewers and a host of other metadata. Furthermore, video accounts for 90% of worldwide Internet traffic. However, it is our observation that video is not widely seen as a full-fledged document; dismissed as a media that, at worst, gilds over substance and, at best, simply augments text-based communications. In this piece, we suggest that negative attitudes toward multimedia documents that include audio and video are largely unfounded and arise mostly because we lack the necessary tools to treat video content as first-order media or to support seamlessly mixing media.
Publication Details
  • ACM TVX 2014
  • Jun 25, 2014

Abstract

Close
Creating compelling multimedia content is a difficult task. It involves not only the creative process of developing a compelling media-based story, but it also requires significant technical support for content editing, management and distribution. This has been true for printed, audio and visual presentations for centuries. It is certainly true for broadcast media such as radio and television. The talk will survey several approaches to describe and manage media interactions. We will focus on the temporal modeling of context-sensitive personalized interactions of complex collections of independent media objects. Using the concepts of ‘togetherness’ being employed in the EU’s FP-7 project TA2: Together Anywhere, Together Anytime, we will follow the process of media capture, profiling, composition, sharing and end-user manipulation. We will consider the promise of using automated tools and contrast this with the reality of letting real users manipulation presentation semantics in real time. The talk will not present a closed form solution, but will present a series of topics and problems that can stimulate the development of a new generation of systems to stimulate social media interaction.
Publication Details
  • IEEE Transactions on Multimedia
  • Jun 18, 2014

Abstract

Close
3D Tele-immersion enables participants in remote locations to share, in real-time, an activity. It offers users interactive and immersive experiences, but it challenges current media streaming solutions. Work in the past has mainly focused on the efficient delivery of image-based 3D videos and on realistic rendering and reconstruction of geometry-based 3D objects. The contribution of this paper is a real-time streaming component for 3D Tele-Immersion with dynamic reconstructed geometry. This component includes both a novel fast compression method and a rateless packet protection scheme specifically designed towards the requirements imposed by real time transmission of live-reconstructed mesh geometry. Tests on a large dataset show an encoding speed-up upto 10 times at comparable compression ratio and quality, when compared to the high-end MPEG-4 SC3DMC mesh encoders. The implemented rateless code ensures complete packet loss protection of the triangle mesh object and a delivery delay within interactive bounds. Contrary to most linear fountain codes, the designed codec enables real time progressive decoding allowing partial decoding each time a packet is received. This approach is compared to transmission over TCP in packet loss rates and latencies, typical in managed WAN and MAN networks, and heavily outperforms it in terms of end-to-end delay. The streaming component has been integrated into a larger 3D Tele-Immersive environment that includes state of the art 3D reconstruction and rendering modules. This resulted in a prototype that can capture, compress transmit and render triangle mesh geometry in real-time in realistic internet conditions as shown in experiments. Compared to alternative methods, lower interactive end-to-end delay and frame rates over 3 times higher are achieved.
Publication Details
  • ICWSM (The 8th International AAAI Conference on Weblogs and Social Media)
  • Jun 1, 2014

Abstract

Close
A topic-independent sentiment model is commonly used to estimate sentiment in microblogs. But for movie and product reviews, domain adaptation has been shown to improve sentiment estimation performance. We investigated the utility of topic-dependent polarity estimation models for microblogs. We examined both a model trained on Twitter tweets containing a target keyword and a model trained on an enlarged set of tweets containing terms related to a topic. Comparing the performance of the topic-dependent models to a topic-independent model trained on a general sample of tweets, we noted that for some topics, topic-dependent models performed better. We then propose a method for predicting which topics are likely to have better sentiment estimation performance when a topic-dependent sentiment model is used.
Publication Details
  • IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
  • May 3, 2014

Abstract

Close
Geometry based 3D Tele-Immersion is a novel emerging media application that involves on the fly reconstructed 3D mesh geometry. To enable real-time communication of such live reconstructed mesh geometry over a bandwidth limited link, fast dynamic geometry compression is needed. However, most tools and methods have been developed for compressing synthetically generated graphics content. These methods achieve good compression rates by exploiting topological and geometric properties that typically do not hold for reconstructed mesh geometry. The live reconstructed dynamic geometry is causal and often non-manifold, open, non-oriented and time-inconsistent. Based on our experience developing a prototype for 3D Teleimmersion based on live reconstructed geometry, we discuss currently available tools. We then present our approach for dynamic compression that better exploits the fact that the 3D geometry is reconstructed and achieve a state of art rate-distortion under stringent real-time constraints. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6854788&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6854788
Publication Details
  • CHI 2014 (Interactivity)
  • Apr 26, 2014

Abstract

Close
AirAuth is a biometric authentication technique that uses in-air hand gestures to authenticate users tracked through a short-range depth sensor. Our method tracks multiple distinct points on the user's hand simultaneously that act as a biometric to further enhance security. We describe the details of our mobile demonstrator that will give Interactivity attendees an opportunity to enroll and verify our system's authentication method. We also wish to encourage users to design their own gestures for use with the system. Apart from engaging with the CHI community, a demonstration of AirAuth would also yield useful gesture data input by the attendees which we intend to use to further improve the prototype and, more importantly, make available publicly as a resource for further research into gesture-based user interfaces.
Publication Details
  • CHI Extended Abstracts 2014
  • Apr 26, 2014

Abstract

Close
AirAuth is a biometric, gesture-based authentication system based on in-air gesture input. We describe the operations necessary to sample enrollment gestures and to perform matching for authentication, using data from a short range depth sensor. We present the results of two initial user studies. A first study was conducted to crowd source a simple gesture set for use in further evaluations. The results of our second study indicate that AirAuth achieves a very high Equal Error Rate (EER-)based accuracy of 96.6 % for simple gesture set and 100 % for user-specific gestures. Future work will encompass the evaluation of possible attack scenarios and obtaining qualitative user feedback on usability advantages of gesture-based authentication.
Publication Details
  • ACM ICMR 2014
  • Apr 1, 2014

Abstract

Close
Motivated by scalable partial-duplicate visual search, there has been growing interest on a wealth of compact and efficient binary feature descriptors (e.g. ORB, FREAK, BRISK). Typically, binary descriptors are clustered into codewords and quantized with Hamming distance, which follows conventional bag-of-words strategy. However, such codewords formulated in Hamming space did not present obvious indexing and search performance improvement as compared to the Euclidean ones. In this paper, without explicit codeword construction, we explore to utilize binary descriptors as direct codebook indices (addresses). We propose a novel approach to build multiple index tables which parallelly check the collision of same hash values. The evaluation is performed on two public image datasets: DupImage and Holidays. The experimental results demonstrate the index efficiency and retrieval accuracy of our approach.

The Optimiser: monitoring and improving switching delays in video conferencing

Publication Details
  • ACM Workshop on Mobile Video (ACM MoVid)
  • Mar 18, 2014

Abstract

Close
With the growing popularity of video communication systems, more people are using group video chat, rather than only one-to-one video calls. In such multi-party sessions, remote participants compete for the available screen space and bandwidth. A common solution is showing the current speaker prominently. Bandwidth limitations may not allow all streams to be sent at a high resolution at all times, especially with many participants in a call. This can be mitigated by only switching on higher resolutions when they are required. This switching encounters delays due to latency and the properties of encoded video streams. In this paper, we analyse and improve the switching delay of our video conferencing system. Our server-centric system offers a next-generation video chat solution, providing end-to-end video encoding. To evaluate our system we use a testbed that allows us to emulate different network conditions. We measure the video switching delay between three clients, each connected via different network profiles. Our results show that missing Intra-Frames in the transmission has a strong influence on the switching delay. Based on this, we provide an optimization mechanism that improves those delays by resending Intra-Frames.
http://dl.acm.org/citation.cfm?id=2579472

Multimedia Authoring and Annotation

Publication Details
  • International Journal on Multimedia Tools and Applications
  • Feb 28, 2014

Abstract

Close
With the massive amount of captured multimedia, authoring is more relevant than ever. Multimedia content is available in many settings including the web, mobile devices, desktop applications, as well as games and interactive TV. The authoring and production of multimedia documents demands attention to many issues related to the structure and to the synchronization of the media components, to the specification of the document and of the interaction, to the roles of authors and end users, as well as issues concerning reuse and digital rights management. Several complementary approaches to support the authoring of multimedia documents have been reported in the literature, and in many cases they have been studied via authoring tools and applications. One aim of this special issue is to assess current approaches, tools and applications, discussing how they tackle the main issues relative to the process of authoring, as well as their limitations.
Publication Details
  • HotMobile 2014
  • Feb 26, 2014

Abstract

Close
In this paper, we propose HiFi system which enables users to interact with surrounding physical objects. It uses coded light to encode position in an environment. By attaching a tiny light sensor on a user’s mobile device, the user can attach digital info to arbitrary static physical objects or retrieve/modify them anchored to these objects. With this system, a family member may attach a digital maintenance schedule to a fish tank or indoor plants, etc. In a store, a store manager may use such system to attach price tag, discount info and multimedia contents to any products and customers can get the attached info by moving their phone close to the focused product. Similarly, a museum can use this system to provide extra info of displayed items to visitors. Different from computer vision based systems, HiFi does not have requests on texture, bright illumination, etc. Different from regular barcode approaches, HiFi does not require extra physical attachments that may change an object’s native appearance. HiFi has much higher spatial resolution for distinguishing close objects or attached parts of the same object. As HiFi system can track a mobile device at 80 positions per second, it also has much faster response than any above listed system.
Publication Details
  • Fuji Xerox Technical Report, No. 23, 2014, pp. 34-42
  • Feb 20, 2014

Abstract

Close
Video content creators invest enormous effort creating work that is in turn typically viewed passively. However, learning tasks using video requires users not only to consume the content but also to engage, interact with, and repurpose it. Furthermore, to promote learning with video in domains where content creators are not necessarily videographers, it is important that capture tools facilitate creation of interactive content. In this paper, we describe some early experiments toward this goal. A literature review coupled with formative field studies led to a system design that can incorporate a broad set of video-creation and interaction styles.
2013
Publication Details
  • IEEE ISM 2013
  • Dec 9, 2013

Abstract

Close
Real-time tele-immersion requires low latency, synchronized multi-camera capture. Prior high definition (HD) capture systems were bulky. We in vestigate the suitability of using flocks of smartphone cameras for tele-immersion. Smartphones can potentially integrate HD capture and streaming into a single portable package. However, they are designed for archiving the captured video into a movie. Hence, we create a sequence of H.264 movies and stream them. We lower the capture delay by reducing the number of frames in each movie segment. Increasing the number of movie segments adds compression overhead. Smartphone video encoders do not sacrifice video quality to lower the compression latency or the stream size. On an iPhone 4S, our application that uses published APIs streams 1920x1080 videos at 16.5 fps with a delay of 712 msec between a real-life event and displaying an uncompressed bitmap of this event on a local laptop. For comparison, the bulky Cisco Tandberg required 300 msec delay. Stereoscopic video from two unsynchronized smartphones showed minimal visual artifacts in an indoor teleconference setting.
Publication Details
  • Education and Information Technologies journal
  • Oct 11, 2013

Abstract

Close
Video tends to be imbalanced as a medium. Typically, content creators invest enormous effort creating work that is then watched passively. However, learning tasks require that users not only consume video but also engage, interact with, and repurpose content. Furthermore, to promote learning across domains where content creators are not necessarily videographers, it is important that capture tools facilitate creation of interactive content. In this paper, we describe some early experiments toward this goal. Specifically, we describe a needfinding study involving interviews with amateur video creators as well as our experience with an early prototype to support expository capture and access. Our findings led to a system redesign that can incorporate a broad set of video-creation and interaction styles.
Publication Details
  • Interactive Tabletops and Surfaces (ITS) 2013
  • Oct 6, 2013

Abstract

Close
The expressiveness of touch input can be increased by detecting additional finger pose information at the point of touch such as finger rotation and tilt. PointPose is a prototype that performs finger pose estimation at the location of touch using a short-range depth sensor viewing the touch screen of a mobile device. We present an algorithm that extracts finger rotation and tilt from a point cloud generated by a depth sensor oriented towards the device's touchscreen. The results of two user studies we conducted show that finger pose information can be extracted reliably using our proposed method. We show this for controlling rotation and tilt axes separately and also for combined input tasks using both axes. With the exception of the depth sensor, which is mounted directly on the mobile device, our approach does not require complex external tracking hardware, and, furthermore, external computation is unnecessary as the finger pose extraction algorithm can run directly on the mobile device. This makes PointPose ideal for prototyping and developing novel mobile user interfaces that use finger pose estimation.
Publication Details
  • ACM Trans. On Multimedia Computing, Communications and Applications (TOMCCAP)
  • Oct 1, 2013

Abstract

Close
A panel at ACM Multimedia 2012 addressed research successes in the past 20 years. While the panel focused on the past, this article discusses successes since the ACM SIGMM 2003 Retreat and suggests research directions in the next ten years. While significant progress has been made, more research is required to allow multimedia to impact our everyday computing environment. The importance of hardware changes on future research directions is discussed. We believe ubiquitous computing—meaning abundant computation and network bandwidth—should be applied in novel ways to solve multimedia grand challenges and continue the IT revolution of the past century.
Publication Details
  • DocEng 2013
  • Sep 10, 2013

Abstract

Close
Unlike text, copying and pasting parts of video documents is challenging. Yet, the huge amount of video documents now available in the form of how-to tutorials begs for simpler techniques that allow users to easily copy and paste fragments of video materials into new documents. We describe new direct video manipulation techniques that allow users to quickly copy and paste content from video documents such as how-to tutorials into a new document. While the video plays, users interact with the video canvas to select text regions, scrollable regions, slide sequences built up across many frames, or semantically meaningful regions such as dialog boxes. Instead of relying on the timeline to accurately select sub-parts of the video document, users navigate using familiar selection techniques such as mouse-wheel to scroll back and forward over a video region where content scrolls, double-clicks over rectangular regions to select them, or clicks and drags over textual regions of the video canvas to select them. We describe the video processing techniques that run in real-time in modern web browsers using HTML5 and JavaScript; and show how they help users quickly copy and paste video fragments into new documents, allowing them to efficiently reuse video documents for authoring or note-taking.
Publication Details
  • CBDAR 2013
  • Aug 23, 2013

Abstract

Close
Capturing book images is more convenient with a mobile phone camera than with more specialized flat-bed scanners or 3D capture devices. We built an application for the iPhone 4S that captures a sequence of hi-res (8 MP) images of a page spread as the user sweeps the device across the book. To do the 3D dewarping, we implemented two algorithms: optical flow (OF) and structure from motion (SfM). Making further use of the image sequence, we examined the potential of multi-frame OCR. Preliminary evaluation on a small set of data shows that OF and SfM had comparable OCR performance for both single-frame and multi-frame techniques, and that multi-frame was substantially better than single-frame. The computation time was much less for OF than for SfM.

SearchPanel: A browser extension for managing search activity

Publication Details
  • EuroHCIR 2013
  • Aug 1, 2013

Abstract

Close
People often use more than one query when searching for information; they also revisit search results to re-find information. These tasks are not well-supported by search interfaces and web browsers. We designed and built a Chrome browser extension that helps people manage their ongoing information seeking. The extension combines document and process metadata into an interactive representation of the retrieved documents that can be used for sense-making, for navigation, and for re-finding documents.

Looking Ahead: Query Preview in Exploratory Search

Publication Details
  • SIGIR 2013
  • Jul 28, 2013

Abstract

Close
Exploratory search is a complex, iterative information seeking activity that involves running multiple queries, finding and examining many documents. We introduced a query preview interface that visualizes the distribution of newly-retrieved and re-retrieved documents prior to showing the detailed query results. When evaluating the preview control with a control condition, we found effects on both people’s information seeking behavior and improved retrieval performance. People spent more time formulating a query and were more likely to explore search results more deeply, retrieved a more diverse set of documents, and found more different relevant documents when using the preview. With more time spent on query formulation, higher quality queries were produced and as consequence the retrieval results improved; both average residual precision and recall was higher with the query preview present.
Publication Details
  • The International Symposium on Pervasive Displays
  • Jun 4, 2013

Abstract

Close
Existing user interfaces for the configuration of large shared displays with multiple inputs and outputs usually do not allow users easy and direct configuration of the display's properties such as window arrangement or scaling. To address this problem, we are exploring a gesture-based technique for manipulating display windows on shared display systems. To aid target selection under noisy tracking conditions, we propose VoroPoint, a modified Voronoi tessellation approach that increases the selectable target area of the display windows. By maximizing the available target area, users can select and interact with display windows with greater ease and precision.

Private Aggregation for Presence Streams

Publication Details
  • Future Generation Computer Systems
  • May 28, 2013

Abstract

Close

Collaboration technologies must support information sharing between collaborators, but must also take care not to share too much information or share information too widely. Systems that share information without requiring an explicit action by a user to initiate the sharing must be particularly cautious in this respect. Presence systems are an emerging class of applications that support collaboration. Through the use of pervasive sensors, these systems estimate user location, activities, and available communication channels. Because such presence data are sensitive, to achieve wide-spread adoption, sharing models must reflect the privacy and sharing preferences of their users. This paper looks at the role that privacy-preserving aggregation can play in addressing certain user sharing and privacy concerns with respect to presence data. We define conditions to achieve CollaPSE (Collaboration Presence Sharing Encryption) security, in which (i) an individual has full access to her own data, (ii) a third party performs computation on the data without learning anything about the data values, and (iii) people with special privileges called “analysts” can learn statistical information about groups of individuals, but nothing about the individual values contributing to the statistic other than what can be deduced from the statistic. More specifically, analysts can decrypt aggregates without being able to decrypt the individual values contributing to the aggregate. Based in part on studies we carried out that illustrate the need for the conditions encapsulated by CollaPSE security, we designed and implemented a family of CollaPSE protocols. We analyze their security, discuss efficiency tradeoffs, describe extensions, and review more recent privacy-preserving aggregation work.

Leading People to Longer Queries

Publication Details
  • CHI 2013
  • Apr 27, 2013

Abstract

Close
Although longer queries can produce better results for information seeking tasks, people tend to type short queries. We created an interface designed to encourage people to type longer queries, and evaluated it in two Mechanical Turk experiments. Results suggest that our interface manipulation may be effective for eliciting longer queries.
Publication Details
  • IUI 2013
  • Mar 19, 2013

Abstract

Close
People frequently capture photos with their smartphones, and some are starting to capture images of documents. However, the quality of captured document images is often lower than expected, even when applications that perform post-processing to improve the image are used. To improve the quality of captured images before post-processing, we developed a Smart Document Capture (SmartDCap) application that provides real-time feedback to users about the likely quality of a captured image. The quality measures capture the sharpness and framing of a page or regions on a page, such as a set of one or more columns, a part of a column, a figure, or a table. Using our approach, while users adjust the camera position, the application automatically determines when to take a picture of a document to produce a good quality result. We performed a subjective evaluation comparing SmartDCap and the Android Ice Cream Sandwich (ICS) camera application; we also used raters to evaluate the quality of the captured images. Our results indicate that users find SmartDCap to be as easy to use as the standard ICS camera application. Additionally, images captured using SmartDCap are sharper and better framed on average than images using the ICS camera application.

Abstract

Close
Motivated by the addition of gyroscopes to a large number of new smart phones, we study the effects of combining accelerometer and gyroscope data on the recognition rate of motion gesture recognizers with dimensionality constraints. Using a large data set of motion gestures we analyze results for the following algorithms: Protractor3D, Dynamic Time Warping (DTW) and Regularized Logistic Regression (LR). We chose to study these algorithms because they are relatively easy to implement, thus well suited for rapid prototyping or early deployment during prototyping stages. For use in our analysis, we contribute a method to extend Protractor3D to work with the 6D data obtained by combining accelerometer and gyroscope data. Our results show that combining accelerometer and gyroscope data is beneficial also for algorithms with dimensionality constraints and improves the gesture recognition rate on our data set by up to 4%.
Publication Details
  • IUI 2013
  • Mar 19, 2013

Abstract

Close
We describe direct video manipulation interactions applied to screen-based tutorials. In addition to using the video timeline, users of our system can quickly navigate into the video by mouse-wheel, double click over a rectangular region to zoom in and out, or drag a box over the video canvas to select text and scrub the video until the end of a text line even if not shown in the current frame. We describe the video processing techniques developed to implement these direct video manipulation techniques, and show how there are implemented to run in most modern web browsers using HTML5's CANVAS and Javascript.
Publication Details
  • SPIE Electronic Imaging 2013
  • Feb 3, 2013

Abstract

Close
Video is becoming a prevalent medium for e-learning. Lecture videos contain useful information in both the visual and aural channels: the presentation slides and lecturer's speech respectively. To extract the visual information, we apply video content analysis to detect slides and optical character recognition (OCR) to obtain their text. Automatic speech recognition (ASR) is used similarly to extract spoken text from the recorded audio. These two text sources have distinct characteristics and relative strengths for video retrieval. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Experiments reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Additional experiments demonstrate higher precision video retrieval using automatically extracted slide text.  
2012
Publication Details
  • IPIN2012
  • Nov 13, 2012

Abstract

Close
We describe Explorer, a system utilizing mirror worlds - dynamic 3D virtual models of physical spaces that reflect the structure and activities of those spaces to help support navigation, context awareness and tasks such as planning and recollection of events. A rich sensor network dynamically updates the models, determining the position of people, status of rooms, or updating textures to reflect displays or bulletin boards. Through views on web pages, portable devices, or on 'magic window' displays located in the physical space, remote people may 'Clook in' to the space, while people within the space are provided with augmented views showing information not physically apparent. For example, by looking at a mirror display, people can learn how long others have been present, or where they have been. People in one part of a building can get a sense of activities in the rest of the building, know who is present in their office, and look in to presentations in other rooms. A spatial graph is derived from the 3D models which is used both to navigational paths and for fusion of acoustic, WiFi, motion and image sensors used for positioning. We describe usage scenarios for the system as deployed in two research labs, and a conference venue.
Publication Details
  • IPIN2012
  • Nov 13, 2012

Abstract

Close
Audio-based receiver localization in indoor environ-ments has multiple applications including indoor navigation, loca-tion tagging, and tracking. Public places like shopping malls and consumer stores often have loudspeakers installed to play music for public entertainment. Similarly, office spaces may have sound conditioning speakers installed to soften other environmental noises. We discuss an approach to leverage this infrastructure to perform audio-based localization of devices requesting local-ization in such environments, by playing barely audible controlled sounds from multiple speakers at known positions. Our approach can be used to localize devices such as smart-phones, tablets and laptops to sub-meter accuracy. The user does not need to carry any specialized hardware. Unlike acoustic approaches which use high-energy ultrasound waves, the use of barely audible (low energy) signals in our approach poses very different challenges. We discuss these challenges, how we addressed those, and experimental results on two prototypical implementations: a request-play-record localizer, and a continuous tracker. We evaluated our approach in a real world meeting room and report promising initial results with localization accuracy within half a meter 94% of the time. The system has been deployed in multiple zones of our office building and is now part of a location service in constant operation in our lab.
Publication Details
  • ICPR 2012
  • Nov 11, 2012

Abstract

Close
Images of document pages have different characteristics than images of natural scenes, and so the sharpness measures developed for natural scene images do not necessarily extend to document images primarily composed of text. We present an efficient and simple method for effectively estimating the sharpness/ blurriness of document images that also performs well on natural scenes. Our method can be used to predict the sharpness in scenarios where images are blurred due to camera-motion (or hand-shake), defocus, or inherent properties of the imaging system. The proposed method outperforms the perceptually-based, no-reference sharpness work of [1] and [4], which was shown to perform better than 14 other no-reference sharpness measures on the LIVE dataset.
Publication Details
  • ACM Multimedia 2012
  • Oct 29, 2012

Abstract

Close
Paper and Computers have complementary advantages and are used side by side in many scenarios. Interactive paper systems aim to combine the two media. However, most such systems only allow fingers and pens to interact with content on paper. This finger-pen-only input suffers from low precision, lag, instability and occlusion. Moreover, it incurs frequent device switch (e.g. pen vs. mouse) in users' hand during cross-media interactions, yielding inefficiency and interruptions of a document workspace continuum. To address these limitations, we propose MixPad, a novel interactive paper system which incorporates mice and keyboards to enhance the conventional pen-finger-based paper interaction. Similar to many other systems, MixPad adopts a mobile camera-projector unit to recognize paper documents, detect pen and finger gestures and provide visual feedback. Unlike these systems, MixPad supports users to use mice and keyboards to select fine-grained content and create annotation on paper, and to facilitate bimanual operations for more efficient and smoother cross-media interaction. This novel interaction style combines the advantages of mice, keyboards, pens and fingers, enabling richer digital functions on paper.
Publication Details
  • ACM Multimedia 2012
  • Oct 29, 2012

Abstract

Close
Faithful sharing of screen contents is an important collaboration feature. Prior systems were designed to operate over constrained networks. They performed poorly even without such bottlenecks. To build a high performance screen sharing system, we empirically analyzed screen contents for a variety of scenarios. We showed that screen updates were sporadic with long periods of inactivity. When active, screens were updated at far higher rates than was supported by earlier systems. The mismatch was pronounced for interactive scenarios. Even during active screen updates, the number of updated pixels were frequently small. We showed that crucial information can be lost if individual updates were merged. When the available system resources could not support high capture rates, we showed ways in which updates can be effectively collapsed. We showed that Zlib lossless compression performed poorly for screen updates. By analyzing the screen pixels, we developed a practical transformation that significantly improved compression rates. Our system captured 240 updates per second while only using 4.6 Mbps for interactive scenarios. Still, while playing movies in fullscreen mode, our approach could not achieve higher capture rates than prior systems; the CPU remains the bottleneck. A system that incorporates our findings is deployed within the lab.
Publication Details
  • ACM Multimedia '12
  • Oct 29, 2012

Abstract

Close
DisplayCast is a many to many screen sharing system that is targeted towards Intranet scenarios. The capture software runs on all computers whose screens need to be shared. It uses an application agnostic screen capture mechanism that creates a sequence of pixmap images of the screen updates. It transforms these pixmaps to vastly improve the lossless Zlib compression performance. These algorithms were developed after an extensive analysis of typical screen contents. DisplayCast shares the processor and network resources required for screen capture, compression and transmission with host applications whose output needs to be shared. It balances the need for high performance screen capture with reducing its resource interference with user applications. DisplayCast uses Zeroconf for naming and asynchronous location. It provides support for Cisco WiFi and Bluetooth based localization. It also includes a HTTP/REST based controller for remote session initiation and control. DisplayCast supports screen capture and playback in computers running Windows 7 and Mac OS X operating systems. Remote screens can be archived into a H.264 encoded movie on a Mac. They can also be played back in real time on Apple iPhones and iPads. The software is released under a New BSD license.
Publication Details
  • CIKM 2012 Books Online Workshop Keynote Address
  • Oct 29, 2012

Abstract

Close
Reading is part of how we understand the world, how we share knowledge, how we play, and even how we think. Although reading text is the dominant form of reading, most of the text we read— letters, numbers, words, and sentences—is surrounded by illustrations, photographs, and other kinds of symbols that we include as we read. As dynamic displays migrate into the real world at many scales, whether personal devices, handhelds, or large screens in both interior and exterior spaces, opportunities for reading migrate as well. As has happened continually throughout the history of reading, new technologies, physical forms and social patterns create new genres, which themselves may then combine or collide to morph into something new. At PARC, the RED (Research in Experimental Design) group examined emerging technologies for impact on media and the human relationship to information, especially reading. We explored new ways of experiencing text: new genres, new styles of interaction, and unusual media. Among the questions we considered: how might “the book” change? More particularly, how does the experience of reading change with the introduction of new technologies…and how does it remain the same? In this talk, we'll discuss the ideas behind the design and research process that led to creating eleven different experiences of new forms of reading. We’ll also consider how our technological context for reading has changed in recent years, and what influence the lessons from XFR may have on our ever-developing online reading experiences.

Through the Looking-Glass: Mirror Worlds for Augmented Awareness & Capability

Publication Details
  • ACM MM 2012
  • Oct 29, 2012

Abstract

Close
We describe a system for supporting mirror worlds - 3D virtual models of physical spaces that reflect the structure and activities of those spaces to help support context awareness and tasks such as planning and recollection of events. Through views on web pages, portable devices, or on 'magic window' displays located in the physical space, remote people may 'look in' to the space, while people within the space are provided information not apparent through unaided perception. For example, by looking at a mirror display, people can learn how long others have been present, or where they have been. People in one part of a building can get a sense of activities in the rest of the building, know who is present in their office, and look in to presentations in other rooms. The system can be used to bridge across sites and help provide different parts of an organization with a shared awareness of each other's space and activities. We describe deployments of our mirror world system at several locations.
Publication Details
  • Mobile HCI 2012 demo track
  • Sep 21, 2012

Abstract

Close
In this demonstration we will show a mobile remote control and monitoring application for a recipe development laboratory at a local chocolate production company. In collaboration with TCHO, a chocolate maker in San Francisco, we built a mobile Web app designed to allow chocolate makers to control their laboratory's machines. Sensor data is imported into the app from each machine in the lab. The mobile Web app is used for control, monitoring, and collaboration. We have tested and deployed this system at the real-world factory and it is now in daily use. This project is designed as part of a research exploration into enhanced collaboration in industrial settings between physically remote people and places, e.g. factories in China with clients in the US.
Publication Details
  • Workshop on Social Mobile Video and Panoramic Video
  • Sep 20, 2012

Abstract

Close
The ways in which we come to know and share what we know with others are deeply entwined with the technologies that enable us to capture and share information. As face-to-face communication has been supplemented with ever-richer media––textual books, illustrations and photographs, audio, film and video, and more––the possibilities for knowledge transfer have only expanded. One of the latest trends to emerge amidst the growth of Internet sharing and pervasive mobile devices is the mass creation of online instructional videos. We are interested in exploring how smart phones shape this sort of mobile, rich media documentation and sharing.
Publication Details
  • USENIX/ACM/IFIP Middleware
  • Sep 19, 2012

Abstract

Close
Faunus addresses the challenge of specifying and managing complex collaboration sessions. Many entities from various administrative domains orchestrate such sessions. Faunus decouples the entities that specify the session from entities that activate and manage them. It restricts the operations to specific agents using capabilities. It unifies the specification and management operations through its naming system. Each Faunus name is persistent and globally unique. A collection of attributes are attached to each name. Together, they represent a collection of services that form a collaboration session. Anyone can create a name; the creator has full read and write privileges that can be delegated to others. With the proper capability, anyone can modify session attributes between an active and inactive state. Though the system is designed for Internet scale deployments, the security model for providing and revoking capabilities currently assumes an Intranet style deployment. We have incorporated Faunus into a DisplayCast system that originally used Zeroconf. We are incorporating Faunus into another project that will fully exercise the power of Faunus.
Publication Details
  • International Journal on Document Analysis and Recognition (IJDAR): Volume 15, Issue 3 (2012), pp. 167-182.
  • Sep 1, 2012

Abstract

Close
When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve genre identification performance. In the open-set identification of four office document genres, our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to genre identification of office documents. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.
Publication Details
  • IIiX 2012
  • Aug 21, 2012

Abstract

Close
Exploratory search activities tend to span multiple sessions and involve finding, analyzing and evaluating information and collab-orating with others. Typical search systems, on the other hand, are designed to support a single searcher, precision-oriented search tasks. We describe a search interface and system design of a multi-session exploratory search system, discuss design challenges en-countered, and chronicle the evolution of our design. Our design describes novel displays for visualizing retrieval history infor-mation, and introduces ambient displays and persuasive elements to interactive information retrieval.
Publication Details
  • DIS (Designing Interactive Systems) 2012 Demos track
  • Jun 11, 2012

Abstract

Close
We will demonstrate successive and final stages in the iterative design of a complex mixed reality system in a real-world factory setting. In collaboration with TCHO, a chocolate maker in San Francisco, we built a virtual “mirror” world of a real-world chocolate factory and its processes. Sensor data is imported into the multi-user 3D environment from hundreds of sensors and a number of cameras on the factory floor. The resulting virtual factory is used for simulation, visualization, and collaboration, using a set of interlinked, real-time layers of information. It can be a stand-alone or a web-based application, and also works on iOS and Android cell phones and tablet computers. A unique aspect of our system is that it is designed to enable the incorporation of lightweight social media-style interactions with co-workers along with factory data. Through this mixture of mobile, social, mixed and virtual technologies, we hope to create systems for enhanced collaboration in industrial settings between physically remote people and places, such as factories in China with managers in the US.
Publication Details
  • CHI 2012
  • May 7, 2012

Abstract

Close
Affect influences workplace collaboration and thereby impacts a workplace's productivity. Participants in face-toface interactions have many cues to each other's affect, but work is increasingly carried out via computer-mediated channels that lack many of these cues. Current presence systems enable users to estimate the availability of other users, but not their affect states or communication preferences. This work investigates relationships between affect state and communication preferences and demonstrates the feasibility of estimating affect state and communication preferences from a presence state stream.
Publication Details
  • CHI 2012
  • May 5, 2012

Abstract

Close
Abstract: Pico projectors have lately been investigated as mobile display and interaction devices. We propose to use them as ‘light beams’: Everyday objects sojourning in a beam are turned into dedicated projection surfaces and tangible interaction devices. While this has been explored for large projectors, the affordances of pico projectors are fundamentally different: they have a very small and strictly limited projection ray and can be carried around in a nomadic way during the day. Thus it is unclear how this could be actually leveraged for tangible interaction with physical, real world objects. We have investigated this in an exploratory field study and contribute the results. Based upon these, we present exemplary interaction techniques and early user feedback.

Designing a tool for exploratory information seeking

Publication Details
  • CHI 2012
  • May 5, 2012

Abstract

Close
In this paper we describe our on-going design process in building a search system designed to support people's multi-session exploratory search tasks. The system, called Querium, allows people to run queries and to examine results as do conventional search engines, but it also integrates a sophisticated search history that helps people make sense of their search activity over time. Information seeking is a cognitively demanding process that can benefit from many kinds of information, if that information is presented appropriately. Our design process has been focusing on creating displays that facilitate on-going sense-making while keeping the interaction efficient, fluid, and enjoyable.

Querium: A Session-Based Collaborative Search System

Publication Details
  • European Conference on Information Retrieval 2012
  • Apr 1, 2012

Abstract

Close
People's information-seeking can span multiple sessions, and can be collaborative in nature. Existing commercial offerings do not effectively support searchers to share, save, collaborate or revisit their information. In this demo paper we present Querium: a novel session-based collaborative search system that lets users search, share, resume and collaborate with other users. Querium provides a number of novel search features in a collaborative setting, including relevance feedback, query fusion, faceted search, and search histories