Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text


Video is becoming a prevalent medium for e-learning. Lecture videos contain useful information in both the visual and aural channels: the presentation slides and lecturer’s speech respectively. To extract the visual information, we apply video content analysis to detect slides and optical character recognition (OCR) to obtain their text. Automatic speech recognition (ASR) is used similarly to extract spoken text from the recorded audio. These two text sources have distinct characteristics and relative strengths for video retrieval. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Experiments reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Additional experiments demonstrate higher precision video retrieval using automatically extracted slide text.