Publications

FXPAL publishes in top scientific conferences and journals.

2014
Publication Details
  • International Journal of Multimedia Information Retrieval Special Issue on Cross-Media Analysis
  • Sep 4, 2014

Abstract

Close
Media Embedded Target, or MET, is an iconic mark printed in a blank margin of a page that indicates a media link is associated with a nearby region of the page. It guides the user to capture the region and thus retrieve the associated link through visual search within indexed content. The target also serves to separate page regions with media links from other regions of the page. The capture application on the cell phone displays a sight having the same shape as the target near the edge of a camera-view display. The user moves the phone to align the sight with the target printed on the page. Once the system detects correct sight-target alignment, the region in the camera view is captured and sent to the recognition engine which identifies the image and causes the associated media to be displayed on the phone. Since target and sight alignment defines a capture region, this approach saves storage by only indexing visual features in the predefined capture region, rather than indexing the entire page. Target-sight alignment assures that the indexed region is fully captured. We compare the use of MET for guiding capture with two standard methods: one that uses a logo to indicate that media content is available and text to define the capture region and another that explicitly indicates the capture region using a visible boundary mark.
Publication Details
  • SPIE optics + photonics (SPIE)
  • Aug 17, 2014

Abstract

Close
Live 3D reconstruction of a human as a 3D mesh with commodity electronics is becoming a reality. Immersive applications (i.e. cloud gaming, tele-presence) benefit from effective transmission of such content over a bandwidth limited link. In this paper we outline different approaches for compressing live reconstructed mesh geometry based on distributing mesh reconstruction functions between sender and receiver. We evaluate rate-performance-complexity of different configurations. First, we investigate 3D mesh compression methods (i.e. dynamic/static) from MPEG-4. Second, we evaluate the option of using octree based point cloud compression and receiver side surface reconstruction.
Publication Details
  • ICME 2014, Best Demo Award
  • Jul 14, 2014

Abstract

Close
In this paper, we describe Gesture Viewport, a projector-camera system that enables finger gesture interactions with media content on any surface. We propose a novel and computationally very efficient finger localization method based on the detection of occlusion patterns inside a virtual sensor grid rendered in a layer on top of a viewport widget. We develop several robust interaction techniques to prevent unintentional gestures to occur, to provide visual feedback to a user, and to minimize the interference of the sensor grid with the media content. We show the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View.
Publication Details
  • ACM SIGIR International Workshop on Social Media Retrieval and Analysis
  • Jul 11, 2014

Abstract

Close
We examine the use of clustering to identify selfies in a social media user's photos for use in estimating demographic information such as age, gender, and race. Faces are first detected within a user's photos followed by clustering using visual similarity. We define a cluster scoring scheme that uses a combination of within-cluster visual similarity and average face size in a cluster to rank potential selfie-clusters. Finally, we evaluate this ranking approach over a collection of Twitter users and discuss methods that can be used for improving performance in the future.

SearchPanel: Framing Complex Search Needs

Publication Details
  • SIGIR 2014
  • Jul 6, 2014
  • pp. pp.495-504

Abstract

Close
People often use more than one query when searching for information. They revisit search results to re-find information and build an understanding of their search need through iterative explorations of query formulation. These tasks are not well-supported by search interfaces and web browsers. We designed and built SearchPanel, a Chrome browser extension that helps people manage their ongoing information seeking. This extension combines document and process metadata into an interactive representation of the retrieved documents that can be used for sense-making, navigation, and re-finding documents. In a real-world deployment spanning over two months, results show that SearchPanel appears to have been primarily used for complex information needs, in search sessions with long durations and high numbers of queries. The process metadata features in SearchPanel seem to be of particular importance when working on complex information needs.

Supporting media bricoleurs

Publication Details
  • ACM interactions
  • Jul 1, 2014

Abstract

Close
Online video is incredibly rich. A 15-minute home improvement YouTube tutorial might include 1500 words of narration, 100 or more significant keyframes showing a visual change from multiple perspectives, several animated objects, references to other examples, a tool list, comments from viewers and a host of other metadata. Furthermore, video accounts for 90% of worldwide Internet traffic. However, it is our observation that video is not widely seen as a full-fledged document; dismissed as a media that, at worst, gilds over substance and, at best, simply augments text-based communications. In this piece, we suggest that negative attitudes toward multimedia documents that include audio and video are largely unfounded and arise mostly because we lack the necessary tools to treat video content as first-order media or to support seamlessly mixing media.
Publication Details
  • ACM TVX 2014
  • Jun 25, 2014

Abstract

Close
Creating compelling multimedia content is a difficult task. It involves not only the creative process of developing a compelling media-based story, but it also requires significant technical support for content editing, management and distribution. This has been true for printed, audio and visual presentations for centuries. It is certainly true for broadcast media such as radio and television. The talk will survey several approaches to describe and manage media interactions. We will focus on the temporal modeling of context-sensitive personalized interactions of complex collections of independent media objects. Using the concepts of ‘togetherness’ being employed in the EU’s FP-7 project TA2: Together Anywhere, Together Anytime, we will follow the process of media capture, profiling, composition, sharing and end-user manipulation. We will consider the promise of using automated tools and contrast this with the reality of letting real users manipulation presentation semantics in real time. The talk will not present a closed form solution, but will present a series of topics and problems that can stimulate the development of a new generation of systems to stimulate social media interaction.
Publication Details
  • IEEE Transactions on Multimedia
  • Jun 18, 2014

Abstract

Close
3D Tele-immersion enables participants in remote locations to share, in real-time, an activity. It offers users interactive and immersive experiences, but it challenges current media streaming solutions. Work in the past has mainly focused on the efficient delivery of image-based 3D videos and on realistic rendering and reconstruction of geometry-based 3D objects. The contribution of this paper is a real-time streaming component for 3D Tele-Immersion with dynamic reconstructed geometry. This component includes both a novel fast compression method and a rateless packet protection scheme specifically designed towards the requirements imposed by real time transmission of live-reconstructed mesh geometry. Tests on a large dataset show an encoding speed-up upto 10 times at comparable compression ratio and quality, when compared to the high-end MPEG-4 SC3DMC mesh encoders. The implemented rateless code ensures complete packet loss protection of the triangle mesh object and a delivery delay within interactive bounds. Contrary to most linear fountain codes, the designed codec enables real time progressive decoding allowing partial decoding each time a packet is received. This approach is compared to transmission over TCP in packet loss rates and latencies, typical in managed WAN and MAN networks, and heavily outperforms it in terms of end-to-end delay. The streaming component has been integrated into a larger 3D Tele-Immersive environment that includes state of the art 3D reconstruction and rendering modules. This resulted in a prototype that can capture, compress transmit and render triangle mesh geometry in real-time in realistic internet conditions as shown in experiments. Compared to alternative methods, lower interactive end-to-end delay and frame rates over 3 times higher are achieved.
Publication Details
  • ICWSM (The 8th International AAAI Conference on Weblogs and Social Media)
  • Jun 1, 2014

Abstract

Close
A topic-independent sentiment model is commonly used to estimate sentiment in microblogs. But for movie and product reviews, domain adaptation has been shown to improve sentiment estimation performance. We investigated the utility of topic-dependent polarity estimation models for microblogs. We examined both a model trained on Twitter tweets containing a target keyword and a model trained on an enlarged set of tweets containing terms related to a topic. Comparing the performance of the topic-dependent models to a topic-independent model trained on a general sample of tweets, we noted that for some topics, topic-dependent models performed better. We then propose a method for predicting which topics are likely to have better sentiment estimation performance when a topic-dependent sentiment model is used.
Publication Details
  • IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
  • May 3, 2014

Abstract

Close
Geometry based 3D Tele-Immersion is a novel emerging media application that involves on the fly reconstructed 3D mesh geometry. To enable real-time communication of such live reconstructed mesh geometry over a bandwidth limited link, fast dynamic geometry compression is needed. However, most tools and methods have been developed for compressing synthetically generated graphics content. These methods achieve good compression rates by exploiting topological and geometric properties that typically do not hold for reconstructed mesh geometry. The live reconstructed dynamic geometry is causal and often non-manifold, open, non-oriented and time-inconsistent. Based on our experience developing a prototype for 3D Teleimmersion based on live reconstructed geometry, we discuss currently available tools. We then present our approach for dynamic compression that better exploits the fact that the 3D geometry is reconstructed and achieve a state of art rate-distortion under stringent real-time constraints. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6854788&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6854788
Publication Details
  • CHI 2014 (Interactivity)
  • Apr 26, 2014

Abstract

Close
AirAuth is a biometric authentication technique that uses in-air hand gestures to authenticate users tracked through a short-range depth sensor. Our method tracks multiple distinct points on the user's hand simultaneously that act as a biometric to further enhance security. We describe the details of our mobile demonstrator that will give Interactivity attendees an opportunity to enroll and verify our system's authentication method. We also wish to encourage users to design their own gestures for use with the system. Apart from engaging with the CHI community, a demonstration of AirAuth would also yield useful gesture data input by the attendees which we intend to use to further improve the prototype and, more importantly, make available publicly as a resource for further research into gesture-based user interfaces.
Publication Details
  • CHI Extended Abstracts 2014
  • Apr 26, 2014

Abstract

Close
AirAuth is a biometric, gesture-based authentication system based on in-air gesture input. We describe the operations necessary to sample enrollment gestures and to perform matching for authentication, using data from a short range depth sensor. We present the results of two initial user studies. A first study was conducted to crowd source a simple gesture set for use in further evaluations. The results of our second study indicate that AirAuth achieves a very high Equal Error Rate (EER-)based accuracy of 96.6 % for simple gesture set and 100 % for user-specific gestures. Future work will encompass the evaluation of possible attack scenarios and obtaining qualitative user feedback on usability advantages of gesture-based authentication.
Publication Details
  • ACM ICMR 2014
  • Apr 1, 2014

Abstract

Close
Motivated by scalable partial-duplicate visual search, there has been growing interest on a wealth of compact and efficient binary feature descriptors (e.g. ORB, FREAK, BRISK). Typically, binary descriptors are clustered into codewords and quantized with Hamming distance, which follows conventional bag-of-words strategy. However, such codewords formulated in Hamming space did not present obvious indexing and search performance improvement as compared to the Euclidean ones. In this paper, without explicit codeword construction, we explore to utilize binary descriptors as direct codebook indices (addresses). We propose a novel approach to build multiple index tables which parallelly check the collision of same hash values. The evaluation is performed on two public image datasets: DupImage and Holidays. The experimental results demonstrate the index efficiency and retrieval accuracy of our approach.

The Optimiser: monitoring and improving switching delays in video conferencing

Publication Details
  • ACM Workshop on Mobile Video (ACM MoVid)
  • Mar 18, 2014

Abstract

Close
With the growing popularity of video communication systems, more people are using group video chat, rather than only one-to-one video calls. In such multi-party sessions, remote participants compete for the available screen space and bandwidth. A common solution is showing the current speaker prominently. Bandwidth limitations may not allow all streams to be sent at a high resolution at all times, especially with many participants in a call. This can be mitigated by only switching on higher resolutions when they are required. This switching encounters delays due to latency and the properties of encoded video streams. In this paper, we analyse and improve the switching delay of our video conferencing system. Our server-centric system offers a next-generation video chat solution, providing end-to-end video encoding. To evaluate our system we use a testbed that allows us to emulate different network conditions. We measure the video switching delay between three clients, each connected via different network profiles. Our results show that missing Intra-Frames in the transmission has a strong influence on the switching delay. Based on this, we provide an optimization mechanism that improves those delays by resending Intra-Frames.
http://dl.acm.org/citation.cfm?id=2579472

Multimedia Authoring and Annotation

Publication Details
  • International Journal on Multimedia Tools and Applications
  • Feb 28, 2014

Abstract

Close
With the massive amount of captured multimedia, authoring is more relevant than ever. Multimedia content is available in many settings including the web, mobile devices, desktop applications, as well as games and interactive TV. The authoring and production of multimedia documents demands attention to many issues related to the structure and to the synchronization of the media components, to the specification of the document and of the interaction, to the roles of authors and end users, as well as issues concerning reuse and digital rights management. Several complementary approaches to support the authoring of multimedia documents have been reported in the literature, and in many cases they have been studied via authoring tools and applications. One aim of this special issue is to assess current approaches, tools and applications, discussing how they tackle the main issues relative to the process of authoring, as well as their limitations.
Publication Details
  • HotMobile 2014
  • Feb 26, 2014

Abstract

Close
In this paper, we propose HiFi system which enables users to interact with surrounding physical objects. It uses coded light to encode position in an environment. By attaching a tiny light sensor on a user’s mobile device, the user can attach digital info to arbitrary static physical objects or retrieve/modify them anchored to these objects. With this system, a family member may attach a digital maintenance schedule to a fish tank or indoor plants, etc. In a store, a store manager may use such system to attach price tag, discount info and multimedia contents to any products and customers can get the attached info by moving their phone close to the focused product. Similarly, a museum can use this system to provide extra info of displayed items to visitors. Different from computer vision based systems, HiFi does not have requests on texture, bright illumination, etc. Different from regular barcode approaches, HiFi does not require extra physical attachments that may change an object’s native appearance. HiFi has much higher spatial resolution for distinguishing close objects or attached parts of the same object. As HiFi system can track a mobile device at 80 positions per second, it also has much faster response than any above listed system.
Publication Details
  • Fuji Xerox Technical Report, No. 23, 2014, pp. 34-42
  • Feb 20, 2014

Abstract

Close
Video content creators invest enormous effort creating work that is in turn typically viewed passively. However, learning tasks using video requires users not only to consume the content but also to engage, interact with, and repurpose it. Furthermore, to promote learning with video in domains where content creators are not necessarily videographers, it is important that capture tools facilitate creation of interactive content. In this paper, we describe some early experiments toward this goal. A literature review coupled with formative field studies led to a system design that can incorporate a broad set of video-creation and interaction styles.
2013
Publication Details
  • IEEE ISM 2013
  • Dec 9, 2013

Abstract

Close
Real-time tele-immersion requires low latency, synchronized multi-camera capture. Prior high definition (HD) capture systems were bulky. We in vestigate the suitability of using flocks of smartphone cameras for tele-immersion. Smartphones can potentially integrate HD capture and streaming into a single portable package. However, they are designed for archiving the captured video into a movie. Hence, we create a sequence of H.264 movies and stream them. We lower the capture delay by reducing the number of frames in each movie segment. Increasing the number of movie segments adds compression overhead. Smartphone video encoders do not sacrifice video quality to lower the compression latency or the stream size. On an iPhone 4S, our application that uses published APIs streams 1920x1080 videos at 16.5 fps with a delay of 712 msec between a real-life event and displaying an uncompressed bitmap of this event on a local laptop. For comparison, the bulky Cisco Tandberg required 300 msec delay. Stereoscopic video from two unsynchronized smartphones showed minimal visual artifacts in an indoor teleconference setting.
Publication Details
  • Education and Information Technologies journal
  • Oct 11, 2013

Abstract

Close
Video tends to be imbalanced as a medium. Typically, content creators invest enormous effort creating work that is then watched passively. However, learning tasks require that users not only consume video but also engage, interact with, and repurpose content. Furthermore, to promote learning across domains where content creators are not necessarily videographers, it is important that capture tools facilitate creation of interactive content. In this paper, we describe some early experiments toward this goal. Specifically, we describe a needfinding study involving interviews with amateur video creators as well as our experience with an early prototype to support expository capture and access. Our findings led to a system redesign that can incorporate a broad set of video-creation and interaction styles.
Publication Details
  • Interactive Tabletops and Surfaces (ITS) 2013
  • Oct 6, 2013

Abstract

Close
The expressiveness of touch input can be increased by detecting additional finger pose information at the point of touch such as finger rotation and tilt. PointPose is a prototype that performs finger pose estimation at the location of touch using a short-range depth sensor viewing the touch screen of a mobile device. We present an algorithm that extracts finger rotation and tilt from a point cloud generated by a depth sensor oriented towards the device's touchscreen. The results of two user studies we conducted show that finger pose information can be extracted reliably using our proposed method. We show this for controlling rotation and tilt axes separately and also for combined input tasks using both axes. With the exception of the depth sensor, which is mounted directly on the mobile device, our approach does not require complex external tracking hardware, and, furthermore, external computation is unnecessary as the finger pose extraction algorithm can run directly on the mobile device. This makes PointPose ideal for prototyping and developing novel mobile user interfaces that use finger pose estimation.
Publication Details
  • ACM Trans. On Multimedia Computing, Communications and Applications (TOMCCAP)
  • Oct 1, 2013

Abstract

Close
A panel at ACM Multimedia 2012 addressed research successes in the past 20 years. While the panel focused on the past, this article discusses successes since the ACM SIGMM 2003 Retreat and suggests research directions in the next ten years. While significant progress has been made, more research is required to allow multimedia to impact our everyday computing environment. The importance of hardware changes on future research directions is discussed. We believe ubiquitous computing—meaning abundant computation and network bandwidth—should be applied in novel ways to solve multimedia grand challenges and continue the IT revolution of the past century.
Publication Details
  • DocEng 2013
  • Sep 10, 2013

Abstract

Close
Unlike text, copying and pasting parts of video documents is challenging. Yet, the huge amount of video documents now available in the form of how-to tutorials begs for simpler techniques that allow users to easily copy and paste fragments of video materials into new documents. We describe new direct video manipulation techniques that allow users to quickly copy and paste content from video documents such as how-to tutorials into a new document. While the video plays, users interact with the video canvas to select text regions, scrollable regions, slide sequences built up across many frames, or semantically meaningful regions such as dialog boxes. Instead of relying on the timeline to accurately select sub-parts of the video document, users navigate using familiar selection techniques such as mouse-wheel to scroll back and forward over a video region where content scrolls, double-clicks over rectangular regions to select them, or clicks and drags over textual regions of the video canvas to select them. We describe the video processing techniques that run in real-time in modern web browsers using HTML5 and JavaScript; and show how they help users quickly copy and paste video fragments into new documents, allowing them to efficiently reuse video documents for authoring or note-taking.
Publication Details
  • CBDAR 2013
  • Aug 23, 2013

Abstract

Close
Capturing book images is more convenient with a mobile phone camera than with more specialized flat-bed scanners or 3D capture devices. We built an application for the iPhone 4S that captures a sequence of hi-res (8 MP) images of a page spread as the user sweeps the device across the book. To do the 3D dewarping, we implemented two algorithms: optical flow (OF) and structure from motion (SfM). Making further use of the image sequence, we examined the potential of multi-frame OCR. Preliminary evaluation on a small set of data shows that OF and SfM had comparable OCR performance for both single-frame and multi-frame techniques, and that multi-frame was substantially better than single-frame. The computation time was much less for OF than for SfM.

SearchPanel: A browser extension for managing search activity

Publication Details
  • EuroHCIR 2013
  • Aug 1, 2013

Abstract

Close
People often use more than one query when searching for information; they also revisit search results to re-find information. These tasks are not well-supported by search interfaces and web browsers. We designed and built a Chrome browser extension that helps people manage their ongoing information seeking. The extension combines document and process metadata into an interactive representation of the retrieved documents that can be used for sense-making, for navigation, and for re-finding documents.

Looking Ahead: Query Preview in Exploratory Search

Publication Details
  • SIGIR 2013
  • Jul 28, 2013

Abstract

Close
Exploratory search is a complex, iterative information seeking activity that involves running multiple queries, finding and examining many documents. We introduced a query preview interface that visualizes the distribution of newly-retrieved and re-retrieved documents prior to showing the detailed query results. When evaluating the preview control with a control condition, we found effects on both people’s information seeking behavior and improved retrieval performance. People spent more time formulating a query and were more likely to explore search results more deeply, retrieved a more diverse set of documents, and found more different relevant documents when using the preview. With more time spent on query formulation, higher quality queries were produced and as consequence the retrieval results improved; both average residual precision and recall was higher with the query preview present.