Image and Video Captioning

Reviewing and navigating image and video content is generally a time-consuming task. The lack of semantics and compactness is especially obvious for long and continuous videos and greatly increase the effort of exploring video content. To alleviate the inefficiency, we developed a method for targeted video captioning, which enables targeted captioning of the video highlights. A general neural network architecture jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Example applications that could be potentially built based on this method include captioning abnormal events in a surveillance video (e.g., generating text alerts of crashes or fights) and captioning a target person’s activities in a crowd/sports/family video (e.g., featuring a kid or a couple, and allowing each person in the video separately featured).

Related Publications

2017
Publication Details
  • British Machine Vision Conference (BMVC) 2017
  • Sep 4, 2017

Abstract

Close
Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network architecture that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Our main idea is that the summary signals can help a video captioning model learn to focus on important frames. On the other hand, caption signals can help a video summarization model to learn better semantic representations. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Moreover, our experiments show the joint model can achieve better performance than state-of- the-art approaches in both individual tasks.
2016
Publication Details
  • ICME 2016
  • Jul 11, 2016

Abstract

Close
Captions are a central component in image posts that communicate the background story behind photos. Captions can enhance the engagement with audiences and are therefore critical to campaigns or advertisement. Previous studies in image captioning either rely solely on image content or summarize multiple web documents related to image's location; both neglect users' activities. We propose business-aware latent topics as a new contextual cue for image captioning that represent user activities. The idea is to learn the typical activities of people who posted images from business venues with similar categories (e.g., fast food restaurants) to provide appropriate context for similar topics (e.g., burger) in new posts. User activities are modeled via a latent topic representation. In turn, the image captioning model can generate sentences that better reflect user activities at business venues. In our experiments, the business-aware latent topics are effective for adapting to captions to images captured in various businesses than the existing baselines. Moreover, they complement other contextual cues (image, time) in a multi-modal framework.