Audio-visual information
An essential ingredient of the information society
is the understanding of audio-visual information.
Information gives access to an infinitely complex
world. Among all choices of information on offer,
the audio-visual form is on the rise, never to
cease from the number one position. Where
television is already the most important carrier of
our culture, the importance of video as the
dominant form of information and communication will
only increase. Even knowledge may become
represented in audio-visual form.
The role of semantics
The all-important step forward in the access to
information is providing access at the semantic
level. The recipient of the information requests
nothing but an understanding of the content. With
the increase in means of communication the user is
swamped in an ocean of information, again never to
return to the situation where information was
scarce. So getting access to the semantic content
is critical. In this proposal we contribute to the
quest for semantic access to information by
building a semantic audio-visual search engine.
The core to semantics is a thesaurus
The key of the engine is a large thesaurus of
detectors just like the core of a dictionary. The
elements in such a thesaurus, individually or their
in combination in one shot, provide a semantic
understanding of the shot. In effect, semantic
understanding is the composition of ten detected
items. The items may vary from pure format like a
detected split screen, or a style like an
interview, or an object like a horse, or an audio
event like a load bang. Any one of those brings an
understanding of the current content. Hence, we aim
to build an audio, visual and mixed media
thesaurus.
To machine learning from annotated examples
One way to build a thesaurus of concepts is to
model them. A model of a chair would start like a
chair has one, three or mostly four legs, where a
leg is usually a thin vertical structure and so on.
That is not the best way. A better way to build a
detector for a chair is learn it from examples, or
to learn it from context. Learning it from examples
requires quite a few annotated example images or
sounds as well as good features capable of
discriminating the background irrelevant to the
concept from the real essence of the concept.
Learning it from context requires examples of
pictures or sounds as well as annotations of the
pictorial content. As an example contextual chair
detector would typically detect an office, a home,
a table, or persons sitting. We are looking for a
large thesaurus of weak object and context
detectors rather than a few carefully modelled
strong detectors as in the combined result of weak
detectors a specific answer to the semantics of the
scene can be found.
Efficiently using annotations in machine learning
We use advance machine learning tools in a generic
scheme of learning to detect audio event types,
visual object types, context and scene types, and
motion and behaviour types from examples. It is
critical to use the annotated examples of all
concepts as efficiently as possible. Therefore we
develop and use the modern one-class classifiers
and active learning techniques of machine learning.
The proposal integrates audio and visual
information where it is most effective.
The objectives of this proposal
Building on the success of the TRECvideo
competitions of 2004 and 2005 we aim to make a
significant contribution in the quest for semantic
video search engines. This proposal will provide a
video semantic video search engine containing the
most crucial components for video retrieval: video
information management, semantic audio-visual
analysis and access, learning and mining the video
data, machine interaction, and
visualization/streaming of video information. To
integrate these components, an machine learning
system is built as well as a runtime system. The
two systems are in part identical but the learning
one is aimed at learning semantic concept detectors
to be implemented in the runtime system. Indices
are concise and semantically rich, and structured
in a visual and audio thesaurus for semantic
retrieval. The validation platform consists of
(1) Video news
(2) Documentaries
Mobile video applications can be part of a future scenario of the current technology.
(1) Video news
(2) Documentaries
Mobile video applications can be part of a future scenario of the current technology.
Click on each main title to get more information