TalkMiner: A Lecture Webcast Search Engine

Abstract

The design and implementation of a search engine for lecture webcasts is described. A searchable text index is created allowing users to locate material within lecture videos found on a variety of websites such as YouTube and Berkeley webcasts. The index is created from words on the presentation slides appearing in the video along with any associated metadata such as the title and abstract when available. The video is analyzed to identify a set of distinct slide images, to which OCR and lexical processes are applied which in turn generate a list of indexable terms.
Several problems were discovered when trying to identify
distinct slides in the video stream. For example, picture-in-picture compositing of a speaker and a presentation slide, switching cameras, and slide builds confuse basic frame-differencing algorithms for extracting keyframe slide images.
Algorithms are described that improve slide identification.
A prototype system was built to test the algorithms and
the utility of the search engine. Users can browse lists of
lectures, slides in a specific lecture, or play the lecture video.
Over 10,000 lecture videos have been indexed from a variety
of sources. A public website will be published in mid 2010
that allows users to experiment with the search engine.