Publications

FXPAL publishes in top scientific conferences and journals.

2015
Publication Details
  • Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
  • Apr 18, 2015

Abstract

Close
Edge targets, such as buttons or menus along the edge of a screen, are known to afford fast acquisition performance in desktop mousing environments. As the popularity of touch based devices continues to grow, understanding the affordances of edge targets on touchscreen is needed. This paper describes results from two controlled experiments that examine in detail the effect of edge targets on performance in touch devices. Our results shows that on touch devices, a target's proximity to the edge has a significant negative effect on reaction time. We examine the effect in detail and explore mitigating factors. We discuss potential explanations for the effect and propose implications for the design of efficient interfaces for touch devices.
Publication Details
  • CHI 2015 (Extended Abstracts)
  • Apr 18, 2015

Abstract

Close
We present our ongoing research on automatic segmentation of motion gestures tracked by IMUs. We postulate that by recognizing gesture execution phases from motion data that we may be able to auto-delimit user gesture entries. We demonstrate that machine learning classifiers can be trained to recognize three distinct phases of gesture entry: the start, middle and end of a gesture motion. We further demonstrate that this type of classification can be done at the level of individual gestures. Furthermore, we describe how we captured a new data set for data exploration and discuss a tool we developed to allow manual annotations of gesture phase information. Initial results we obtained using the new data set annotated with our tool show a precision of 0.95 for recognition of the gesture phase and a precision of 0.93 for simultaneous recognition of the gesture phase and the gesture type.
Publication Details
  • CSCW 2015
  • Mar 14, 2015

Abstract

Close
Collaboration Map (CoMap) is an interactive visualization tool showing temporal changes of small group collaborations. As dynamic entities, collaboration groups have flexible features such as people involved, areas of work, and timings. CoMap shows a graph of collaborations during user-adjustable periods, providing overviews of collaborations' dynamic features. We demonstrate CoMap with a co-authorship dataset extracted from DBLP to visualize 587 publications by 29 researchers at a research organization.

Abstract

Close
In this paper, we report findings from a study that compared basic video-conferencing, emergent kinetic video-conferencing techniques, and face-to-face meetings. In our study, remote and co-located participants worked together in groups of three. We show, in agreement with prior literature, the strong adverse impact of being remote on participation-levels. We also show that local and remote participants perceived differently their own contributions and others. Extending prior work, we also show that local participants exhibited significantly more overlapping speech with remote participants who used an embodied proxy, than with remote participants in basic-video conferencing (and at a rate similar to overlapping speech for co-located groups). We also describe differences in how the technologies were used to follow conversation. We discuss how these findings extend our understanding of the promise and potential limitations of embodied video-conferencing solutions.

Abstract

Close
In a variety of peer production settings, from Wikipedia to open source software development to crowdsourcing, individuals may encounter, edit, or review the work of unknown others. Typically this is done without much context to the person's past behavior or performance. To understand how exposure to an unknown individual's activity history influences attitudes and behaviors, we conducted an online experiment on Mechanical Turk varying the content, quality, and presentation of information about another Turker's work history. Surprisingly, negative work history did not lead to negative outcomes, but in contrast, a positive work history led to positive initial impressions that persisted in the face of contrary information. This work provides insight into the impact of activity history design factors on psychological and behavioral outcomes that can be of use in other related settings.
Publication Details
  • Presented in "Everyday Telepresence" workshop at CHI 2015 on Apr 18, 2015
  • Mar 3, 2015

Abstract

Close
As video-mediated communication reaches broad adoption, improving immersion and social interaction are important areas of focus in the design of tools for exploration and work-based communication. Here we present three threads of research focused on developing new ways of enabling exploration of a remote environment and interacting with the people and artifacts therein.
Publication Details
  • IEEE Pervasive Computing (In press)
  • Mar 3, 2015

Abstract

Close
Tutorials are one of the most fundamental means of conveying knowledge. In this paper, we present a suite of applications that allow users to combine different types of media captured from handheld, standalone, or wearable devices to create multimedia tutorials. We conducted a study comparing standalone (camera on tripod) versus wearable capture (Google Glass). The results show that tutorial authors have a slight preference for wearable capture devices, especially when recording activities involving larger objects.

Abstract

Close
Our research focuses on improving the effectiveness and usability of driving mobile telepresence robots by increasing the user's sense of immersion during the navigation task. To this end we developed a robot platform that allows immersive navigation using head-tracked stereoscopic video and a HMD. We present the result of an initial user study that compares System Usability Scale (SUS) ratings of a robot teleoperation task using head-tracked stereo vision with a baseline fixed video feed and the effect of a low or high placement of the camera(s). Our results show significantly higher ratings for the fixed video condition and no effect of the camera placement. Future work will focus on examining the reasons for the lower ratings of stereo video and and also exploring further visual navigation interfaces.
Publication Details
  • The Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15)
  • Jan 25, 2015

Abstract

Close
Name of an identity is strongly influenced by his/her cultural background such as gender and ethnicity, both vital attributes for user profiling, attribute-based retrieval, etc. Typically, the associations between names and attributes (e.g., people named "Amy" are mostly females) are annotated manually or provided by the census data of governments. We propose to associate a name and its likely demographic attributes by exploiting click-throughs between name queries and images with automatically detected facial attributes. This is the first work attempting to translate an abstract name to demographic attributes in visual-data-driven manner, and it is adaptive to incremental data, more countries and even unseen names (the names out of click-through data) without additional manual labels. In the experiments, the automatic name-attribute associations can help gender inference with competitive accuracy by using manual labeling. It also benefits profiling social media users and keyword-based face image retrieval, especially for contributing 12% relative improvement of accuracy in adapting to unseen names.
2014

Synchronizing Web Documents with Style

Publication Details
  • ACM Brazilian Symposium on Multimedia and the Web
  • Nov 17, 2014

Abstract

Close
In this paper we report on our efforts to define a set of document extensions to Cascading Style Sheets (CSS) that allow for structured timing and synchronization of elements within a Web page. Our work considers the scenario in which the temporal structure can be decoupled from the content of the Web page in a similar way that CSS does with the layout, colors and fonts. Based on the SMIL (Synchronized Multimedia Integration Language) temporal model we propose CSS document extensions and discuss the design and implementation of a proof of concept that realizes our contributions. As HTML5 seems to move away from technologies like Flash and XML (eXtensible Markup Language), we believe our approach provides a flexible declarative solution to specify rich media experiences that is more aligned with current Web practices.
Publication Details
  • ACM International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions (UMMMI)
  • Nov 15, 2014

Abstract

Close
In this paper we discuss communication problems in video-mediated small group discussions. We present results from a study in which ad-hoc groups of five people, with moderator, solved a quiz question-select answer style task over a video-conferencing system. The task was performed under different delay conditions, of up to 2000ms additional one-way delay. Even with a delay up to 2000ms, we could not observe any effect on the achieved quiz scores. In contrast, the subjective satisfaction was severely negatively affected. While we would have suspected a clear conversational breakdown with such a high delay, groups adapted their communication style and thus still managed to solve the task. This is, most groups decided to switch to a more explicit turn-taking scheme. We argue that future video-conferencing systems can provide a better experience if they are aware of the current conversational situation and can provide compensation mechanisms. Thus we provide an overview of what cues are relevant and how they are affected by the video-conferencing system and how recent advancements in computational social science can be leveraged. Further, we provide an analysis of the suitability of normal webcam data for such cue recognition. Based on our observations, we suggest strategies that can be implemented to alleviate the problems.
Publication Details
  • ACM International Workshop on Socially-aware Multimedia (SAM)
  • Nov 6, 2014

Abstract

Close
As commercial, off-the-shelf, services enable people to easily connect with friends and relatives, video-mediated communication is filtering into our daily activities. With the proliferation of broadband and powerful devices, multi-party gatherings are becoming a reality in home environments. With the technical infrastructure in place and has been accepted by a large user base, researchers and system designers are concentrating on understanding and optimizing the Quality of Experience (QoE) for participants. Theoretical foundations for QoE have identified three crucial factors for understanding the impact on the individual’s perception: system, context, and user. While most of the current research tends to focus on the system factors (delay, bandwidth, resolution), in this paper we offer a more complete analysis that takes into consideration context and user factors. In particular, we investigate the influence of delay (constant system factor) in the QoE of multi-party conversations. Regarding the context, we extend the typical one-to-one condition to explore conversations between small groups (up to five people). In terms of user factors, we take into account conversation analysis, turn-taking and role-theory, for better understanding the impact of different user profiles. Our investigation allows us to report a detailed analysis on how delay influences the QoE, concluding that the actual interactivity pattern of each participant in the conversation results on different noticeability thresholds of delays. Such results have a direct impact on how we should design and construct video-communication services for multi-party conversations, where user activity should be considered as a prime adaptation and optimization parameter.
Publication Details
  • ACM Multimedia 2014
  • Nov 2, 2014

Abstract

Close
We propose Multi-modal Language Models (MLMs), which adapt latent variable models for text document analysis to modeling co-occurrence relationships in multi-modal data. In this paper, we focus on the application of MLMs to indexing slide and spoken text associated with lecture videos, and subsequently employ a multi-modal probabilistic ranking function for lecture video retrieval. The MLM achieves highly competitive results against well established retrieval methods such as the Vector Space Model and Probabilistic Latent Semantic Analysis. Retrieval performance with MLMs is also shown to improve with the quality of the available extracted spoken text.
Publication Details
  • ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia
  • Nov 2, 2014

Abstract

Close
We present a method for profiling businesses at specific locations that is based on mining information from social media. The method matches geo-tagged tweets from Twitter against venues from Foursquare to identify the specific business mentioned in a tweet. By linking geo-coordinates to places, the tweets associated with a business, such as a store, can then be used to profile that business. We used a sentiment estimator developed for tweets to create sentiment profiles of the stores in a chain, computing the average sentiment of tweets associated with each store. We present the results as heatmaps which show how sentiment differs across stores in the same chain and how some chains have more positive sentiment than other chains. We also created profiles of social group size for businesses and show sample heatmaps illustrating how the size of a social group can vary.

On Aesthetics and Emotions in Scene Images: A Computational Perspective.

Publication Details
  • Book: Scene Vision, MIT Press, (Editors Kestas Kveraga and Moshe Bar).
  • Nov 1, 2014

Abstract

Close
In this chapter, we discuss the problem of computational inference of aesthetics and emotions from images. We draw inspiration from diverse disciplines such as philosophy, photography, art, and psychology to define and understand the key concepts of aesthetics and emotions. We introduce the primary computational problems that the research community has been striving to solve and the computational framework required for solving them. We also describe datasets available for performing assessment and outline several real-world applications where research in this domain can be employed. This chapter discusses the contributions of a significant number of research articles that have attempted to solve problems in aesthetics and emotion inference in the last several years. We conclude the chapter with directions for future research. Here’s a link to the book.
http://mitpress.mit.edu/books/scene-vision
Publication Details
  • UIST 2014
  • Oct 5, 2014

Abstract

Close
Video Text Retouch is a technique for retouching textual content found in many online videos such as screencasts, recorded presentations and many online e-learning videos. Viewed through our special, HTML5-based player, users can edit in real-time the textual content of the video frames, such as correcting typos or inserting new words between existing characters. Edits are overlaid and tracked at the desired position for as long as the original video content remains similar. We describe the interaction techniques, image processing algorithms and give implementation details of the system.

Abstract

Close
It is now possible to develop head-mounted devices (HMDs) that allow for ego-centric sensing of mid-air gestural input. Therefore, we explore the use of HMD-based gestural input techniques in smart space environments. We developed a usage scenario to evaluate HMD-based gestural interactions and conducted a user study to elicit qualitative feedback on several HMD-based gestural input techniques. Our results show that for the proposed scenario, mid-air hand gestures are preferred to head gestures for input and rated more favorably compared to non-gestural input techniques available on existing HMDs. Informed by these study results, we developed a prototype HMD system that supports gestural interactions as proposed in our scenario. We conducted a second user study to quantitatively evaluate our prototype comparing several gestural and non-gestural input techniques. The results of this study show no clear advantage or disadvantage of gestural inputs vs.~non-gestural input techniques on HMDs. We did find that voice control as (sole) input modality performed worst compared to the other input techniques we evaluated. Lastly, we present two further applications implemented with our system, demonstrating 3D scene viewing and ambient light control. We conclude by briefly discussing the implications of ego-centric vs.~exo-centric tracking for interaction in smart spaces.
Publication Details
  • IEEE Transactions on Multimedia
  • Sep 30, 2014

Abstract

Close
3D Tele-immersion enables participants in remote locations to share, in real-time, an activity. It offers users interactive and immersive experiences, but it challenges current media streaming solutions. Work in the past has mainly focused on the efficient delivery of image-based 3D videos and on realistic rendering and reconstruction of geometry-based 3D objects. The contribution of this paper is a real-time streaming component for 3D Tele-Immersion with dynamic reconstructed geometry. This component includes both a novel fast compression method and a rateless packet protection scheme specifically designed towards the requirements imposed by real time transmission of live-reconstructed mesh geometry. Tests on a large dataset show an encoding speed-up upto 10 times at comparable compression ratio and quality, when compared to the high-end MPEG-4 SC3DMC mesh encoders. The implemented rateless code ensures complete packet loss protection of the triangle mesh object and a delivery delay within interactive bounds. Contrary to most linear fountain codes, the designed codec enables real time progressive decoding allowing partial decoding each time a packet is received. This approach is compared to transmission over TCP in packet loss rates and latencies, typical in managed WAN and MAN networks, and heavily outperforms it in terms of end-to-end delay. The streaming component has been integrated into a larger 3D Tele-Immersive environment that includes state of the art 3D reconstruction and rendering modules. This resulted in a prototype that can capture, compress transmit and render triangle mesh geometry in real-time in realistic internet conditions as shown in experiments. Compared to alternative methods, lower interactive end-to-end delay and frame rates over 3 times higher are achieved.
Publication Details
  • MobileHCI 2014 (Industrial Case Study)
  • Sep 23, 2014

Abstract

Close
Telepresence systems usually lack mobility. Polly, a wearable telepresence device, allows users to explore remote locations or experience events remotely by means of a person that serves as a mobile "guide". We built a series of hardware prototypes and our current, most promising embodiment consists of a smartphone mounted on a stabilized gimbal that is wearable. The gimbal enables remote control of the viewing angle as well as providing active image stabilization while the guide is walking. We present qualitative findings from a series of 8 field tests using either Polly or only a mobile phone. We found that guides felt more physical comfort when using Polly vs. a phone and that Polly was accepted by other persons at the remote location. Remote participants appreciated the stabilized video and ability to control camera view. Connection and bandwidth issues appear to be the most challenging issues for Polly-like systems.
Publication Details
  • MobileHCI 2014 (Full Paper)
  • Sep 23, 2014

Abstract

Close
Secure authentication with devices or services that store sensitive and personal information is highly important. However, traditional password and pin-based authentication methods compromise between the level of security and user experience. AirAuth is a biometric authentication technique that uses in-air gesture input to authenticate users. We evaluated our technique on a predefined (simple) gesture set and our classifier achieved an average accuracy of 96.6% in an equal error rate (EER-)based study. We obtained an accuracy of 100% when exclusively using personal (complex) user gestures. In a further user study, we found that AirAuth is highly resilient to video-based shoulder surfing attacks, with a mea- sured false acceptance rate of just 2.2%. Furthermore, a longitudinal study demonstrates AirAuth’s repeatability and accuracy over time. AirAuth is relatively simple, robust and requires only a low amount of computational power and is hence deployable on embedded or mobile hardware. Un- like traditional authentication methods, our system’s security is positively aligned with user-rated pleasure and excitement levels. In addition, AirAuth attained acceptability ratings in personal, office, and public spaces that are comparable to an existing stroke-based on-screen authentication technique. Based on the results presented in this paper, we believe that AirAuth shows great promise as a novel, secure, ubiquitous, and highly usable authentication method.

Asymmetric Delay in Video-Mediated Group Discussions

Publication Details
  • International Workshop on Quality of Multimedia Experience (QoMEX)
  • Sep 18, 2014

Abstract

Close
Delay has been found as one of the most crucial factors determining the Quality of Experience (QoE) in synchronous video-mediated communication. The effect has been extensively studied for dyadic conversations and recently the study of small group communications has become the focus of the research community. Contrary to dyads, in which the delay is symmetrically perceived, this is not the case for groups. Due to the heterogeneous structure of the internet asymmetric delays between participants are likely to occur.
Publication Details
  • DocEng 2014
  • Sep 16, 2014

Abstract

Close
Distributed teams must co-ordinate a variety of tasks. To do so they need to be able to create, share, and annotate documents as well as discuss plans and goals. Many workflow tools support document sharing, while other tools support videoconferencing, however there exists little support for connecting the two. In this work we describe a system that allows users to share and markup content during web meetings. This shared content can provide important conversational props within the context of a meeting; it can also help users review archived meetings. Users can also extract shared content from meetings directly into other workflow tools.
Publication Details
  • Assistive Computer Vision and Robotics Workshop of ECCV
  • Sep 12, 2014

Abstract

Close
Polly is an inexpensive, portable telepresence device based on the metaphor of a parrot riding a guide's shoulder and acting as proxy for remote participants. Although remote users may be anyone with a desire for `tele-visits', we focus on limited mobility users. We present a series of prototypes and field tests that informed design iterations. Our current implementations utilize a smartphone on a stabilized, remotely controlled gimbal that can be hand held, placed on perches or carried by wearable frame. We describe findings from trials at campus, museum and faire tours with remote users, including quadriplegics. We found guides were more comfortable using Polly than a phone and that Polly was accepted by other people. Remote participants appreciated stabilized video and having control of the camera. One challenge is negotiation of movement and view control. Our tests suggests Polly is an effective alternative to telepresence robots, phones or fixed cameras.

Abstract

Close
In recent years, there has been an explosion of social and collaborative applications that leverage location to provide users novel and engaging experiences. Current location technologies work well outdoors but fare poorly indoors. In this paper we present LoCo, a new framework that can provide highly accurate room-level location using a supervised classification scheme. We provide experiments that show this technique is orders of magnitude more efficient than current state-of-the-art Wi- Fi localization techniques. Low classification overhead and computational footprint make classification practical and efficient even on mobile devices. Our framework has also been designed to be easily deployed and lever- aged by developers to help create a new wave of location- driven applications and services.
Publication Details
  • International Journal of Multimedia Information Retrieval Special Issue on Cross-Media Analysis
  • Sep 4, 2014

Abstract

Close
Media Embedded Target, or MET, is an iconic mark printed in a blank margin of a page that indicates a media link is associated with a nearby region of the page. It guides the user to capture the region and thus retrieve the associated link through visual search within indexed content. The target also serves to separate page regions with media links from other regions of the page. The capture application on the cell phone displays a sight having the same shape as the target near the edge of a camera-view display. The user moves the phone to align the sight with the target printed on the page. Once the system detects correct sight-target alignment, the region in the camera view is captured and sent to the recognition engine which identifies the image and causes the associated media to be displayed on the phone. Since target and sight alignment defines a capture region, this approach saves storage by only indexing visual features in the predefined capture region, rather than indexing the entire page. Target-sight alignment assures that the indexed region is fully captured. We compare the use of MET for guiding capture with two standard methods: one that uses a logo to indicate that media content is available and text to define the capture region and another that explicitly indicates the capture region using a visible boundary mark.