The results expected from the project are

Audio-visual information

An essential ingredient of the information society is the understanding of audio-visual information. Information gives access to an infinitely complex world. Among all choices of information on offer, the audio-visual form is on the rise, never to cease from the number one position. Where television is already the most important carrier of our culture, the importance of video as the dominant form of information and communication will only increase. Even knowledge may become represented in audio-visual form.

The role of semantics

The all-important step forward in the access to information is providing access at the semantic level. The recipient of the information requests nothing but an understanding of the content. With the increase in means of communication the user is swamped in an ocean of information, again never to return to the situation where information was scarce. So getting access to the semantic content is critical. In this proposal we contribute to the quest for semantic access to information by building a semantic audio-visual search engine.

The core to semantics is a thesaurus

The key of the engine is a large thesaurus of detectors just like the core of a dictionary. The elements in such a thesaurus, individually or their in combination in one shot, provide a semantic understanding of the shot. In effect, semantic understanding is the composition of ten detected items. The items may vary from pure format like a detected split screen, or a style like an interview, or an object like a horse, or an audio event like a load bang. Any one of those brings an understanding of the current content. Hence, we aim to build an audio, visual and mixed media thesaurus.

To machine learning from annotated examples

One way to build a thesaurus of concepts is to model them. A model of a chair would start like a chair has one, three or mostly four legs, where a leg is usually a thin vertical structure and so on. That is not the best way. A better way to build a detector for a chair is learn it from examples, or to learn it from context. Learning it from examples requires quite a few annotated example images or sounds as well as good features capable of discriminating the background irrelevant to the concept from the real essence of the concept. Learning it from context requires examples of pictures or sounds as well as annotations of the pictorial content. As an example contextual chair detector would typically detect an office, a home, a table, or persons sitting. We are looking for a large thesaurus of weak object and context detectors rather than a few carefully modelled strong detectors as in the combined result of weak detectors a specific answer to the semantics of the scene can be found.

Efficiently using annotations in machine learning

We use advance machine learning tools in a generic scheme of learning to detect audio event types, visual object types, context and scene types, and motion and behaviour types from examples. It is critical to use the annotated examples of all concepts as efficiently as possible. Therefore we develop and use the modern one-class classifiers and active learning techniques of machine learning. The proposal integrates audio and visual information where it is most effective.

The objectives of this proposal

Building on the success of the TRECvideo competitions of 2004 and 2005 we aim to make a significant contribution in the quest for semantic video search engines. This proposal will provide a video semantic video search engine containing the most crucial components for video retrieval: video information management, semantic audio-visual analysis and access, learning and mining the video data, machine interaction, and visualization/streaming of video information. To integrate these components, an machine learning system is built as well as a runtime system. The two systems are in part identical but the learning one is aimed at learning semantic concept detectors to be implemented in the runtime system. Indices are concise and semantically rich, and structured in a visual and audio thesaurus for semantic retrieval. The validation platform consists of
(1) Video news
(2) Documentaries
Mobile video applications can be part of a future scenario of the current technology.

Click on each main title to get more information