Project details

VIDI-VIDEO
Interactive semantic video search with a large thesaurus
of machine-learned audio-visual concepts

February 2007 – January 2010
Specific Targeted Research Project
Framework Programme 6

The VIDI-Video Project consist of 9 work packages

0 Fact Sheet & dissemination material

Click here to open a downloadable fact sheet (.pdf document)

Click here to open a downloadable poster (656 Kb, reduced to A4 format, .pdf document)

Click here to open a downloadable brochure (756 Kb.pdf document)

1 Management

OBJECTIVE
Managing the work in the project by planning, monitoring and evaluating intermediate results, and to capitalize the experience of the project for the exploitation of results.

Technical coordination
To manage the project, assuring smooth realization of the key deliverables, based on timely delivery of all deliverables, the director and all WP-leaders will check the project technical progress against the planned schedule on a regular basis. They will ensure that the deliverables are properly produced in due time and in respect to the quality plan. Risks, as identified in Table 3 and 4, will be tracked and assessed at the occasion of specific milestones. To assure that risks can be identified early, a number of feedback mechanisms have been built into the work plan. These are listed in Table 3. If the work plan or the project objectives could be affected at this occasion, the executive committee will take the appropriate measures in order to re-define the work plan or project objectives accordingly, with the approval of EC-representatives and the governing board, and in conformity with contractual aspects. To solve technical issues that may affect the whole project or specific work packages, technical meetings may be organized.

Synthesis of the project
To capitalize on the project experience and knowledge, to be re-used in post-project perspectives, a number of different points of views will be considered: Management, technology studies, disseminations, and post project planning. They will increase the value of the VIDI-Video project.

Exploitation strategy development
The exploitation opportunities identified in Section 6.4, above, will be evaluated in the course of the project and a final exploitation strategy will be developed towards the end of the project.

2 Video processing

OBJECTIVE
The objective of this work package is to pre-process the video stream so as to transform it from a single stream to a set of elementary pieces of audio and visual information suitable for a fine granularity approach to video processing, analysis and retrieval. Specific objectives include:

(i) Development of a robust and efficient method for the temporal segmentation of video to elementary image sequences (shots)
(ii) Development of method for shot summarization by means of eliminating intrashot temporal redundancy and building a model of the background scene
(iii) Development of method for shot grouping by exploiting inter-shot temporal redundancy and audio cues.
(iv) Development of methods for the segmentation of the audio stream.

TASKS
To reach the objective, three main pre-processing tasks are integrated in this work package:

(i) Temporal video segmentation to elementary image sequences (shots), and
(ii) Summarization of the latter, by means of eliminating intra-shot temporal redundancy, and grouping, by exploiting inter-shot temporal redundancy and audio cues.
(iii) Audio segmentation builds an acoustic representation creating a set of homogeneous segments with a complete description of their contents, in terms of acoustic background, presence/absence of speech, speaker gender, and speaker identification.

TASK 2.1 VIDEO SEGMENTATION - YIANNIS KOMPATSIARIS, CERTH

Shots
Temporal video segmentation aims to partition the video into elementary image sequences termed shots. A shot is defined as a set of consecutive frames taken without interruption by a single camera.

Shot segmentation
Temporal segmentation to shots is usually performed in uncompressed video, typically by means of pair-wise pixel comparisons between successive or distant frames or by comparing the colour histograms corresponding to different frames. Other approaches to shot segmentation include block-wise comparisons, edge-based methods, and methods comparing motion features at different time instances [Lienhart 1999]. Probabilistic models such as HMM’s, data clustering techniques such as fuzzy c-means and deterministic annealing, and representations of edge and texture information as a function of compressed video DCT coefficients are likely to serve as building blocks of the method to be developed. Once abrupt shot transitions are detected, the evaluation of the intra-shot temporal redundancy (by defining a distortion function in a suitable feature space) will iteratively serve as feedback to the shot segmentation algorithm, allowing the detection of gradual transitions as well and thus the refinement of the shot segmentation result. In this task we will develop and evaluate effective shot segmentation algorithms suited for our
video collection.

TASK 2.2 SHOT REPRESENTATION, KEY FRAMES - YIANNIS KOMPATSIARIS, CERTH; THEO GEVERS, UVA

Shot representation
Following temporal segmentation, the shots need to be represented in a more compact form. They also need to be grouped into sets of similar shots even if they are temporally separated. A shot is represented by key frame(s) plus the motion pattern in the shot.

Key frame selection
The amount of motion determines the number of key frames per shot. Shots which have panning, tilting, or zooming camera motion are represented by a mosaic key frame if needed to understand the background. The key-frame is selected as the frame mid-way the length of the shot or it is selected as visually the most similar frame to the rest. To realize compact shot representation, we will adopt a hybrid method combining temporal, visual, and motion information. This will be based on initially selecting a key-frame, based on temporal and visual information, and subsequently extending the initial frame to a mosaic that contains all entities of the scene.

Grouping
Similar shots are group into sets aimed to achieve semantic coherence of shots. This is achieved by clustering shots within a chosen temporal distance on the basis of visual features of the shots. We will use a fuzzy c-means algorithm augmented with a process for evaluating the entropy of the resulting clusters to estimate the best number of clusters. A key element here is the quality of the features. To this end, we will use visual and audio cues.

Motion pattern in the shot
After determining the global panning and zooming, standard methods for motion sequence analysis are used to obtain the motion in the shot. By the lack of a general segmentation algorithm of objects in a still (key) frame, motion is critical in the recognition of objects in a video. Many motion detection algorithms exist. We will employ and expand the methods for motion detection as provided in [Hieu 2003, 2004, 2005, 2006]. They are mostly starting from interactive initialisation. We cannot employ them as we aim for automatic analysis. So we will adapt these methods to include self-starting algorithms.

TASK 2.3 AUDIO SEGMENTATION - ISABEL TRANCOSO INESC-ID

Objective
In this task, we provide a segmentation of audio into homogeneous regions according to background conditions, and according to speaker gender and id. The segmentation algorithm tries to detect changes in the acoustic conditions and marks those time instants as segment boundaries.

Parameter extraction
This task can be accomplished, by evaluating, the similarity between two fixed-length, contiguous windows shifted in time. The symmetric Kullback-Liebler distance [Siegler et al. 1997] is an example of a distance measure typically used to evaluate acoustic similarity. Each window is modelled by a Gaussian distribution. Large values for this measure imply that the distributions of the windows are more dissimilar. We will use the measure derived from cepstral coefficients (or PLP coefficients – Perceptual Linear Prediction) extracted from the audio signal. Segment boundaries are positioned at window boundaries where the distance reaches a maximum, above a threshold.

Classification
In a second stage, each homogeneous audio segment is then passed through a classification stage in order to classify the background and tag non-speech segments [Meinedo et al. 2003]. Segments that are marked as containing speech are also classified according to gender and are subdivided into “sentence-like-units” by an endpoint detector. This segmentation can provide useful information about the background such as silence, music, noise or speech.

Speaker detection
In the third stage speaker detection is done, attempting to identify those speaker clusters that were produced by one of the pre-defined speakers [Chen et al. 1998]. The division into speaker turns and speaker identities creates a full description of contents allowing automatic indexing and retrieval of all occurrences of a particular speaker. We aim to group all segments assigned to the same speaker together in order to achieve an adaptive speech recognition model. To that end, various algorithms as BIC, MLP or SVM will be evaluated.

3 Audio analysis

OBJECTIVE
The objective of this WP is to fill the audio part of the thesaurus. This is done by extracting a set of acoustic cues that represent the information presented in the audio signal in order to enrich the semantic description of the multimedia data. The WP aims at detecting audio events. We integrate the resulting audio events with speech recognition to arrive at a complete transcription of the audio content. In order to accomplish these goals, the WP is structured into two tasks.

TASK 3.1 DETECTION OF AUDIO EVENTS - JOÃO P. NETO, INESC-ID

Objective
The objective of the task is to perform audio event detection by machine learning techniques. Acoustic events such as goals in sport matches, shooting, explosions, car or helicopter noises, cries, screams, or laughter are detected by learning from examples.

Features and models
The set of acoustic features used in the detection of audio events is not that different from the one used in automated speech recognition systems. We plan to use the mel-frequency cepstral coefficients (and their time derivatives), which are common in audio event detection as well. Features such as volume, band energy ratios, zero crossing rate, and bandwidth are also good candidates to be used for event detection. HMM’s will be considered to model audio events. From the machine learning side, SVM’s and GMM’s [Chu et al. 2004] may be used to fuse the characteristics of various audio events of one specific semantic concept.

TASK 3.2 SPEECH RECOGNITION - JOÃO P. NETO, INESC-ID

Objective
In this task we aim to meet the demand for high performing speech recognition systems in news broadcasts [Nguyen 2005]. We will build on a large vocabulary of continuous speech recognition tailored to broadcast news. Systems tuned in this way have achieved lower word error rates for anchor speakers (< 10%) compared with global rates (< 25%). In this project we will generate an adaptive system, tuned to the various tasks. With this adaptation process it is possible to increase the accuracy and the speed.

4 Visual analysis

OBJECTIVE
The objective of this task is to contribute to the visual part of the thesaurus. The goal is achieved by analyzing the context of images and videos characterizing scenes by type, frame composition, and directors’ style elements. Also object types will be classified when possible. We will aim to achieve this characterization of the scene by machine learning of invariant features derived from annotated examples.

TASK 4.1 TYPES OF MOTION, ACTIONS AND BEHAVIOUR - JORDI GONZÀLEZ, CVC

Motion characterisation

The objective of this task is to analyse the motion pattern from the motion estimation in WP 2. The main interest here is in characterizing motion patterns as people. The aim is to infer behaviour relative to other people and objects in the scene. Previous approaches have either used appearance-based models or local features. In sparse scenes it is reasonable to assume that reliable trajectories can be extracted and inferences about individuals are possible. Detecting people in crowded scenes is a challenging problem due to the various styles of clothing and occluding accessories. In designing these low level descriptors, we will use descriptors of trajectories and characteristics of trajectories.

TASK 4.2 TYPES OF SCENES - ARNOLD SMEULDERS, UVA; MARCO BERTINI, UNIFI

Background and scene

The background yields important information as it defines the context against which all other feature detectors operate. In the task, the scene is characterized by type. At this point there the scene can be discriminated along several important types:
1. scenery
2. frame composition
3. directors’ style elements

Scenery types
The scenery can be divided with further subdivisions: indoor {office, news-studio, house, sports arena, etc.} or outdoor {beach, mountain, forest, meadows, road, city, sea, etc}. We will learn to discriminate among these types from example images of surface materials and depth patterns [Hollink 2004, Smeulders 2002]. When sampled with scene-invariant features as discussed above, they provide a semantic anchor in the image regardless of context.

Scene types
Detection of scene types provides both the context of video action and an access unit to browse and retrieve video. In the first case the context can be exploited by other feature detectors can operate to detect events and objects, or can be directly used to derive high-level indexing of videos; in fact, in the particular cases of TV news and documentaries videos it is useful to label scenes with meaningful classifications such as anchormen shot or hunting scene. In the second case the scene can be used later to formulate queries or as a result of result visualization. We will use the methods for scene type classification in [Snoek, 2006] to discriminate among them.

Frame types
A frame may consist of: fill type {graphics, cartoons, real-world, home video, etc}, and edits {subscripts, splits screen, frames, inlays etc.}. Each of these subtypes holds crucial information and provides essential information in the understanding of the scene. Graphics can be detected by the saturation of colours, [Gevers 2000], split screens are indicative for interviews and news and can be detected by machine learning from examples [Snoek 2006, Amir 2005], and subscripts hold critical semantic information and can be detected by video OCR [Lienhart 2002, Doermann, 2000]. We will use these methods for fill type here as well. Edits are common in news videos, where they provide high level information about places and persons involved in the represented event [Hauptmann 2004]. Detection and recognition of scene text is particularly difficult when the text font is not optimized for video usage. We will adapt methods suited to the high variability in appearance and types of captions [Jain 2004] possibly supported by a standard dictionary.

Director style types

Our semantic indexing philosophy starts from the premise that any video production, be it a news broadcast or a concert registration, originates from the mind of the director. We aim for reconstruction of this semantic intent. In contrast to the state-of-the-art in video analysis, who emphasizes analysis of content and context [Amir 2005] only, we adhere to an integrated approach that also covers the notion of production style. We will adopt methods for detection of production style based on shot tempo, overlaid text, voice-overs, and the distance, angle, and motion of cameras and microphones [Snoek 2006b].

Edit effects

Usage of certain edit effects, the length of shots, the juxtaposition of shots characterized by certain visual content, the use of so-called L-and J-edit (created by different trimming of video and audio track, to extend the audio track in the following visual event, or anticipate the audio event in the previous shot, respectively) have well defined and established meanings that can be exploited to determine high-level or abstract concepts. This has been used for example for the semantic characterization of sensations induced in commercials [DelBimbo 1999], and is considered for usage in the case of analysis of documentaries, where directors’ style is highly significant. In addition, methods that cope with all the possible variations that are due to the different broadcasters, available visual effects and styles that may change over the years will have to be developed.

TASK 4.3 TYPES OF OBJECTS - THEO GEVERS, UVA

Recognition of object types

In machine vision, the use of invariant features has made object recognition much simpler by closing the sensory gap between objects and the millions of its slightly different appearances due to differences in illumination, viewing angle and scenes. Retrieving known objects [Schmid 2002, Geusebroek 2005] has shown good progress, to name a few. These results are all based on invariant and complete feature sets [Gevers 2006, Lowe 2004, Burghouts 2005]. Further, there exists a broad agreement that local features are an efficient tool for object recognition due to their robustness with respect to occlusion and geometrical transformations. The success of local features is greatly dependent on the information content of the selected features. Salient features with high information content are characterized by a low frequency of occurrence. Further, colour invariants have been proposed which an increased discriminative power for object recognition when compared to the visual measurements of which the invariants were composed [Smeulders 2000]. Combining local invariant features and the incorporation of colour are known to be nontrivial problems in computer vision, for which solutions are required. We have developed descriptors which have both photometrical (e.g. illumination and shading) and geometrical invariant (e.g. translation, rotation, scale and affine invariant). We will employ and evaluate the use of salient (colour) features [Geusebroek 2002, Gevers 2006, Weijer 2006]. Selection and weighting of the appropriate features will be considered for the task at hand. The selection will be based on context-specific machine learning of object types.

Human emotion types
One of the most important pieces of information is the human emotion in the video. Despite huge efforts from the computer vision and pattern recognition communities, currently available techniques have reliable performances only in constrained settings. That is, where the person is facing the camera, under controlled illumination conditions, and possibly without a cluttered background. In these cases, faces can be reliably matched against databases of faces taken under similar filming conditions. Comprehensive surveys on this type of approaches, usually adopted for the purpose of biometric identification, can be found in [Kriegman 2002] for face detection, and [Zhao 2003], for face recognition. Since direct face recognition methods cannot be applied, we need to find different visual cues that can lead to the identification of a person. For TV news, there is a rich source of information given by the text captions that is in use today by every broadcaster. Associating the information obtained from the caption stream with the faces extracted from the video stream has been already proposed in [Satoh 1999] and [Forsyth 2004]. Several aspects still need to be investigated, in particular: more than trivial method of name/face associations, the expression of the face robust to all sorts of circumstances, and on online learning to recognize the expressions of a specific face. We will investigate these issues by exploiting recent results obtained in the field of object recognition using local image-features [Lowe 2004], [Matas 2002], [Sivic 2005].

5 Learning integrated feature detectors

OBJECTIVE
The objective of this WP is to develop modern machine learning techniques, with less data to annotate and similar or better performance of the classifiers, for learning the elements of the thesaurus by combining audio, textual, and pictorial descriptions of shots.

TASK 5.1 ONE-CLASS ACTIVE LEARNING - KRYSTIAN MIKOLAJCZYK, UNIS

One class active learning
The objective of this task is to evaluate and develop methods for one-class classifiers suited for large scale thesauri. At the core of the modern approach to video interpretation are efficient methods for annotation of data and effective methods for machine learning of classifiers. Efficiency in annotation is achieved by active learning, that are computational methods for the assignment of critical object to label next, in order to gain as much insight as possible as to which of the objects matter in defining the boundary. From our application with an envisioned 1000 different classes of objects, scenes and concepts, poses a highly demanding list of constraints on effective methods for learning classifiers. The typical classifier to use here are one-class classifiers based on dissimilarity measures [Tax 2001, Roli 2004]. Both, positive sample based learning, as well as joint positive and negative sample learning of one-class models will be considered using generative and discriminative methods. For the former category, kernel based pdf estimators that are insensitive to outliers will be considered. For the discriminative approach we shall initially adopt the Support Vector Data Description (SVDD) technique [Tax 1999] which forces object class models of minimum volume occupancy. The final goal is to develop its Relevance Vector equivalent, based on the idea of Relevance Vector Machine [Tipping 2001] which favours sparsedata representation. We will employ these methods to facilitate computational efficiency. More over, the Bayesian framework of RVM learning supports the use of multiple kernels and this apparatus can be exploited for simultaneous feature selection [Mottl 2005] with the benefit of enhancing object class compactness as well as discriminability.

TASK 5.2 LEARNING SEMANTIC INTEGRATED FEATURE DETECTOR - KRYSTIAN MIKOLAJCZYK, UNIS; MARCEL WORRING, UVA

Mixed audio and visual types
The objective of this task is to evaluate and develop methods for semantic integration. The semantics of a scene will usually be carried in a describing text. Combination of visual and textual signs may indicate whether a concept is actually in the scene or that the text is just referring to an abstract concept. The text information will enhance the precision and accuracy of mixed-media retrieval. Classification of texts for video sequences with humans will incorporate motion verbs and details with respect to the motion of agents. An integrated approach combining text and image information at a low level andthe use of machine learning techniques is an essential ingredient of most successful systems. In this project we focus on combining the audio, visual and text features with under the guidance of machine learning from examples and explicit knowledge where necessary. Experience from previous research [Snoek 2005a] has learned that the optimal fusion strategy depends on the type of concept. For some concepts early fusion, where the features of different modalities are fused before learning the semantics, is optimal. For other concepts, the best approach is late fusion where one first learns the semantics in each individual modality after which a separate learning stage is used to learn the multimodal semantics. We aim to expand the principles of early and late fusion of [Snoek 2006] to find the optimal fusion strategy per concept to automatically learn the mixed media types of the thesaurus.

6 Technological software development

OBJECTIVE
The objective of the work package is to consolidate and validate the textual, audio, speech, motion and visual features in a common representation and indexing structure to enable wide and simple usage by the other system components. The consolidation is split into two parts: the one part, see task 6.1, contains the processing of video into shots, audio segments, and speech recognition, the other part, see task 6.2, of the consolidated software is the thesaurus of audiovisual detectors of semantic concepts.

TASK 6.1 SOFTWARE PROCESSING CONSOLIDATION - JUANJO VILLANUEVA, CVC; THEO GEVERS, UVA; JOAO P. NETO, INESC-ID; YIANNIS KOMPATSIARIS,CERTH

Task 6.1 aims to consolidate the software of the processing phase Tasks 2.1, 2.2, and 3.2, that is in as much as the software is not part of the set of semantic detectors. The video and speech processing is delivered into solid software suitable for the runtime version. The video processing software will be subject to good programming practices, proper documentation, and external reviewing before it is implemented in the runtime system. All labs participating in this task have experience with delivering robust software to the outside world in their own field of expertise. Hence, we can make a flying start with building two generations of complete systems. Nevertheless a considerable effort is projected to define a data interface between the components as in the overall data architecture of the components including the univocal definition of the parameters and formats. The software will be tested on a predefined sequence of video recordings and semantic primitives will be processed for interpretation and retrieval. Validation techniques of the final software version, together with a series of illustrative test sequences, will be provided.

TASK 6.2 CONSOLIDATION OF SEMANTIC DETECTORS IN THESAURUS - MARCEL WORRING, UVA

Thesaurus
Task 6.2 assembles all visual, audio and mixed-media semantic detectors as learned in the tasks 3.1, 4.1, 4.2, 4.3, and 5.2. A concept can be based on a shot or still, and/or a point. The results obtained from the tasks 3.1, 4.1, 4.2, 4.3, and 5.2 will be a MPEG-7 based description of the video which will be stored in the database. The integrated system will be tested in the labs and evaluated with respect to the accuracy of both interpretation and retrieval, and the degree of interaction. A predefined sequence of video recordings and semantic primitives will be processed for interpretation and retrieval. Validation techniques of the final software version, together with a series of illustrative test sequences, will be provided.

To extend the existing 101 concept thesaurus to a thesaurus of 1000 concepts, we start of from the query log analysis from task 1.2. The most relevant semantic concepts will be related to their associated concept in Wordnet [Hollink 2005]. The hierarchical structure of Wordnet will then be used to derive the set of most discriminant direct observables in the audio and video stream. Where needed additional audio and visual properties of Wordnet concepts will be defined. The list will be continuously compared to the LSCOM set of concepts [Kennedy 2006] to open the way for future integration.

MPEG 7
For the semantic representation of the features in a common representation the MPEG-7 standard is the prime candidate. Since the representation of audio-visual features has to be used in the query ontology, the most effective way is to translate MPEG 7 descriptors in OWL so that they can easily can be included and referred in the Query Ontology. MPEG-7 is a standard that has been built to define entities and their properties with respect to the specific domain of multimedia content while RDFS and OWL are languages that can define an ontology in terms of concepts and their relationships regardless of the domain of interest. The advantages of using MPEG-7 in multimedia domain is due to the fact that it has been designed to fully describe multimedia document structure, but at the same time it reflects the “structural” lack of semantic expressiveness of XML. Knowledge representation languages extend the capability and the expressiveness of XML. RDFS can define an ontology in terms of concepts, properties and relationships of concepts without any restriction. OWL adds to RDFS the capability to refine concept definition and class restrictions. Both of them are flexible and extensible because they are not standard for a specific domain but they have been designed as general-purpose languages for domain independent knowledge description. Moreover knowledge representation languages can support usage of inference engines that can enrich the knowledge of a domain with the inferred knowledge. We will use MPEG where feasible.

OWL
The possibility to translate MPEG-7 into an ontology language such as RDF and OWL has been exploited to overcome the lack of formal semantics of the MPEG-7 standard that could extend the traditional text descriptions into machine understandable ones. The first attempt that aimed to bridge the gap between the MPEG-7 standard and the ontology standards have been presented in [Hunter 2001]. In these works the first translation of the MPEG-7 MDS into RDFS has been shown. The resulting ontology has been also converted into DAML+OIL, and is now available in OWL. The ontology, expressed using OWL Full, covers the upper part of the Multimedia Description Schema part of the MPEG-7 standard. It consists of about 60 classes and 40 properties. A methodology and a software implementation for the interoperability of MPEG-7 MDS and OWL has been presented in [Tsinaraki 2004], developing from the previous work, and using OWL DL. Another MPEG-7 Ontology is the one provided by the DMAG group at the Pompeu Fabra University. This latter MPEG-7 ontology is an OWL Full ontology, and aims to cover the whole standard and is thus the most complete one. In addition to translation of MPEG 7 visual descriptors in OWL this task will investigate methodologies to establish links and relationships between concepts defined in the query ontology and the visual descriptors so that the query could be performed both on high level semantic concepts and low level visual features.

TASK 6.3 BENCHMARK EVALUATION - MARCEL WORRING, UVA

Participate in TRECVID competition 2008 & 2009
To allow for a quantitative evaluation of progress, we aim to participate in the TRECVID benchmark or a similar competition in the second and third year of the project. Participation will be in the concept detection task to see how the performance of individual detector is compared to other state-of-the-art systems and the interactive retrieval task where the aim is to have users find the best result for set of 24 information needs where for each need the interaction time is limited to 15 minutes. The latter will be a stepping stone to effective performance in the runtime interactive system. A baseline will be set with the current system.

7 Demonstrators and applications

OBJECTIVE
The objective of the work package is to study and develop methods, demonstrators and applications that allow a natural querying and exploration of the annotated multimedia content in different application fields. Various methods for querying will be considered as certain concepts can be expressed easier using natural language (e.g. abstract concepts), where other concepts are easier expressed through visual examples (e.g. a certain temporal behaviour like a sport highlight). Moreover, very often there is need of using both types of queries at the same time, e.g. “retrieve all the scenes of a happy crowd like those in shot X”, where the abstract concept of “happy” has to be linked to the visual concept of similarity to a specific video sequence. Different interaction modalities can be envisioned: a graphical interface to support the user in formulating complex queries to exploit the full potential of the query language (i.e. RDQL or SPARQL); a multimodal interface that combines category browsing (via an enriched ontology that adds multimedia data to the linguistic terms) and simple text search; in particular, a support for complex query formulation to exploit the full potential of the query languages, simple text search, and category browsing (via the available ontologies) are foreseen.

TASK 7.1 ONTOLOGY OF QUERIES - MARCO BERTINI, UNIFI

Visual-text ontology
The task will investigate the integration of the visual ontology with a text-based ontology. In a dictionary only a fraction of the words has a visual counterpart. The other words are abstractions. Older formal systems such as ICONCLASS and AAT are useful in art history, but they cannot be applied here in automatic analysis as these methods require semantic understanding and a feel for context that a computer simply lacks. Hence, it is interesting to see how visual descriptors relate to general-purpose lexical resources such as WordNet [Hoogs 2003] for text retrieval and where they require extension. The basic idea is to link the linguistic concept expressed by lexical resource with the related “visual concept” that can be expressed in the visual ontology by means of visual descriptors or detectors values used in the learning concept process. Visual concepts, once added to the ontology, will integrate the semantics described through linguistic terms up to a more detailed representation of the context domain. This WP will develop multimedia enriched ontology’s using in particular the OWL standard, see WP 6, where concepts that cannot be expressed in linguistic terms are represented by prototypes of different media like video, audio, speech and text. The ontology of queries defines the functionality of the system seen from a user’s perspective. It will therefore provide the blueprint for the whole demonstrator to be developed.

Text-based ontology

Depending on the learned objects and the learned types same lexical resources can have different meaning related to different visual descriptors. The text-based ontology will be structured so that it can be used to express and define relationships between resources so that even abstract concept based on spatial or temporal relation of learned concept can be expressed. Reasoning on the ontology may be used in the project in order to infer relations and complex concepts.

TASK 7.2 MULTIMEDIA QUERY TOOL - MARCO BERTINI, UNIFI

Query tool

In [Hollink 2004] a framework is provided for describing queries for users looking for visual information. They disguise abstraction levels, such as abstract, general and specific descriptions as well as characteristics of the image users search for {object, scene, spatial relations, etc.}. The project will use this as a way of identifying query types in the query logs where the system will work. The query process will deal both with visual and textual ontology so that high-level linguistic concepts and low level visual descriptions can be retrieved. The multimedia query can be performed by means both of free text and learned concepts. Capability to express temporal and spatial relations between different high level concepts has to be achieved. We will investigate techniques to map queries expressed in natural language into a proper query language for ontology’s (e.g. RDQL and SPARQL).

User understanding
A secondary goal of this task will be to understand how users would naturally interact with semantically enriched cross-media material and how a user interface should be designed to better support effective interaction. How much of a search system’s internal mechanisms (e.g. internal query language, reasoning, OWL formalism, see WP 6) should be exposed to the user will also be investigated here. The work will be focused on: i) query formulation and reformulation; ii) result display and exploration. The investigation carried out in this work package will be done in partnership with the users and/or their representatives that are expected to actively participate in the design and evaluation of the emerging solutions.

TASK 7.3 INTERACTION AND VISUALISATION - JUANJO VILLANUEVA, CVC

User interface

Task 7.2 will find and rank those videos more relevant to a given query. Thus, image and video is already indexed by pixel-level image attributes like colour, texture, and shape, and classified with higher-level semantic features. These semantic features will be organized into the video ontology defined at Task 7.1. As a result, most relevant videos will be ranked based on the matching score defined during Task 7.2. Subsequently, in this task 7.3, a user interface will be specified and designed to be used not only for a technical but also for a non-expert end-user. This interface will provide a component in which semantic concepts of task-based dialogs will be processed, either based on visual or text information. These semantic concepts will constitute the basis used for video retrieval, thus allowing the visualization of those video sequences which best matched those learned concepts.

Queries
We distinguish between three different modes of interactions for queries, namely interactions based on text, image and predefined visual features: Text-based queries will be take profit of linguistic cues to index and retrieve non-linguistic visual imagery. We will study at which extent the text may not document the aspect of the video of interest to the user performing the search. We will also investigate the role natural language processing can play in the conceptual indexing of videos by semantic categories instead of keywords. Secondly, a thumbnail of the visual interface will be dragged into the query window to initiate an image-based search using that image as a search key. The results of this search will be also provided by Task 7.2. The main aim is to state whether there is a class of topics for which the visual-only system might perform better than text-driven system. A third means of visual browsing will make use of the concepts defined in WP2 – WP5.

Evaluation

This task will compare standard precision-recall techniques versus automatic meta-scoring functions. The traditional evaluation method is through precision and recall, which requires human judgement about the relevance of the videos to the queries. To achieve this goal, a user study will be carried out to analyse novices' usability on interactive video search. This study will imply to recruit 30 subjects, inexperienced digital video searchers. Thus, we will be able to develop an analysis procedure of the interactions of novice users. Even further, this may allow evaluating the search results of predefined abstract concepts of the ontology, thus establishing the most relevant videos for concepts such as silence or peace.

Image to words
The inversion process of this image-to-words approach will be investigated also. The aim is to generate synthetic query-frames which reproduce an input semantic text and which will constitute the visual query. An end-user will interact with a virtual environment by including or removing synthetic objects and agents (i.e., visual information), and by defining a set of semantic primitives (i.e., textual information). Thus, the ontology will be visualized and the complexity of the synthetic world used for query will depend on the semantic knowledge embedded in the textual description. As a result of this process, a synthetic shot will be created and, since it embeds the textual description entered by the user, it will be used for video retrieval. Comparison of both the synthetic shot and the retrieved video will point out possible weaknesses during interpretation processes of the whole system due to the semantic gap.

TASK 7.4 USER GROUPS - PAOLO GALLUZZI, FRD

Objective
To guarantee the concreteness of the project and in order to reach the requirements and opportunities of the market it is important to create a relationship between the research activity and software development in the project and the methodological definition suiting the user requirements through the establishment of a User Group for all the three field of application The work-modality foresees three phases:

* users’ requirements definition as input to the project;
* pilot test and validation of the interfaces and searching modalities;
• dissemination and consensus building on the final products of the project.

The first two phases are included in the WP7 activity, the third phase is related to the WP9 work-plan. About the User Group, on scientific and cultural contents, due to the nature of the target community, some specific dissemination strategies will be implemented under the coordination of the FRD. About the other 3 fields of demonstration foreseen within VIDI-Video some small expert user groups will be set-up for all the three phases of the activity.

Scientific and cultural contents
The aim is to set up a User Group on scientific and cultural contents that is composed by two types of users: the producers of documentaries and short films and the final users. The FRD is the promoter and the organizer of the user group of the project and can guarantee liaisons with the culture community and provision of a small amount of digital contents (documentaries and short films) for the research/software development and annotation tools assessment. Specifically, the user group, coordinated by FRD, deals with the definition of the specific requirements for information retrieval and searching modalities that connote the cycle of production and distribution of this category of videos products (documentaries and short films).

Testing
The user group collaborates to define and to verify the automatic metadata-extraction algorithms and the sectored ontology definition to describe the information of documentaries and short films related scientific and cultural contents. The user group has also a fundamental role in the market approaching with the validation of the user interface of the application (specifically of the tool annotation automatic and semiautomatic) and about the legal problematic of the copyright and the author rights.

Collaboration
In this activity the FRD has the collaboration of important institutions, which will join the project in a second moment as collaborators. The contacts are defined with the following institutions: Mediateca Regionale Toscana, Istituto Luce in Rome, Fox Video-National Geographic Italia and Short Village, that is the most important distributor of independent short films in Italy.

TASK 7.5 CULTURAL HERITAGE DOCUMENTARIES - PAOLO GALLUZZI, FRD

To evaluate and assess the technology produced in the VIDI-Video project in the context of cultural heritage documentary archives a field trial will be executed in the final stage of the project. To do so the run-time interactive system will be embedded in the workflow of the FRD archive. The trial will take into account different aspects, showing different modes of use of the system, and address different target audiences. The field trial will define the challenges for future generation semantic video search engines.

TASK 7.6 BROADCAST ARCHIVE FIELD TRIAL - JOHAN OOMEN, B&G

To evaluate and assess the technology produced in the VIDI-Video project in the context of broadcast archives a field trial will be executed in the final stage of the project. To do so, the run-time interactive system will be embedded in (a proxy of) the workflow of the B&G asset management system. The trial will take into account different aspects, showing different modes of use of the system, and address different target audiences. The field trial will define the challenges for future generation semantic video search engines. The findings of the field trails in Tasks 7.5 and 7.6 will be integrated during the production of the final publication and showcases for dissemination; see task 9.1

TASK 7.7 VIDEO SURVEILLANCE - RITA CUCCHIARA, UOM; SUBHASIS CHAUDHURI, IIT

Surveillance video collection
An important element of the task is to create a contact with user groups of surveillance enabling a higher impact of the results of VIDI-Video. Different sources of surveillance data video will be available, such as fixed indoor and outdoor cameras, mounted at high positions with a large field of view, moving cameras with pan, tilt and zoom capabilities, fixed indoor cameras, and mobile cameras, such as those mounted on board of cars of some private surveillance companies. At UoM tools and video analysis techniques have been developed, and could be used to further provide annotation for a posteriori logging. It could also be used as a searching system for activities detection in case the VIDI-Video system is too general in its capabilities. The task will perform the following activities:

- Providing a large collection of security and surveillance videos, in order to create a complete set of views of a significantly wide area, covering a 24 hours time frame, with different, also non-overlapping, views. Videos will be provided about outdoor and indoor scenes, such as roads, public parks, offices and university campus. This allows potential queries such as find me all sequences that contain a person pushing a stretcher from 6.00am to 6.30am or give me all the clips of video acquired in this area containing a person with a read coat.
- Metadata annotation in MPEG-7 to ensure interoperability with Task 6.2 and 4.1, allowing us to provide additional features and metadata to the query engine.
- Testing the capability of the concept detection techniques developed in the project, by means of a sub-set of thesaurus such as people, face, car, bicycles, all providing insight in a surveillance setting. Videos will be provided about outdoor and indoor scenes, such as roads, public parks, offices and university campus.
- Compare the results obtained with the general-purpose features extractors and invariants, as defined in the Tasks 4.1, 4.2 and 4.3 with specific surveillance techniques that take into account additional information such as camera calibration data.

Surveillance User group
This as preparation to the aim of development of a forum for the surveillance community, able to attract user groups which will have the ability of providing new requests and use cases, sharing knowledge and annotated video and testing different approaches to the video surveillance application field.

8 Data, annotation & queries

OBJECTIVE
The aim is to provide a realistic dataset, from various sources, with manual annotations and an analysis of queries that have been posed on this dataset to be used throughout VIDI-Video.

TASK 8.1 DATA, ANNOTATION & QUERIES - JOHAN OOMEN, B&G

Every broadcaster has a tremendous archive of material which has been broadcasted in the past. By law or by commercial interest the data is manually annotated so that it can be reused later, possibly in different settings and contexts. However, for the project we will benefit from the wealth of annotations that are already available in the archives. Where needed, the data will be manually annotated with other relevant descriptions. It should be noted here that B&G is already planning to donate 300 hours of video to TRECVID to be used in the benchmark. This material will form part of the data to work on.

TASK 8.2 QUERY LOG ANALYSIS- JOHAN OOMEN, B&G

The archives are frequently queried by professionals, semi-professionals, and the general public to extract relevant audiovisual fragments from the archive. These query logs will be analyzed to derive the set of most relevant concepts which need to be in the thesaurus. In addition the structure of the queries will be used to define the different query interfaces to be defined for the run time interactive system.

9 Dissemination

OBJECTIVE
The objective of this WP is to cluster the activity and merging the efforts done by the research community and other initiatives (other projects, fairs, and conferences), avoiding information overload. In particular, a strong cooperation with the Thematic Networks of excellence as DELOS and MUSCLE, facilitating therefore collaboration amongst their members, increasing the efficiency of each of its partners’ activities and extend the scope, reach, and impact of the
coordinated actions.

Dissemination
The dissemination of project results covers three fields of demonstration carried out in the WP7. From the beginning of the project it will be appointed a Dissemination manager which will coordinate the building consensus and dissemination activities. The project results will be presented through an appropriate dissemination campaign from the beginning of the project. The Dissemination Manager will produce at month six a Dissemination Plan that describes the contents and timing of specific dissemination activities to take place nationally as well as internationally. The Dissemination Plan will be updated every six months with contributions by all partners of the Consortium. Dissemination actions will include scientific communities by participating in the principal conferences and workshops in Europe. They will also include the community of end users by participation in selected events on behalf of the consortium. Finally, the Dissemination Plan will identify the most important professional or public events for the private sector to which we are invited to show the progress of the project and to interact with media IT-companies and other players in selected vertical markets on the validity of the results. As an alternative to visiting events we consider visits to potential private sector partners to check the interest and needs of media producers in the emerging markets (for example Bollywood, India). The budget for dissemination is approx €16.000. Dissemination products mentioned in Appendix X (leaflet, website, presentation and showcase) are included in this work package description.

TASK 9.1 PUBLIC AWARENESS AND DISSEMINATION

The project web site will be a fundamental instrument to provide the information on the project’s activities and foster the ideas exchange during the whole project lifespan. The public access site will be mainly targeted at the wider research community, broadcasters, developers and content owners but also aims to be a public window for every citizen interested in broadcast and cultural topics. Apposite web tools will be developed to have an easy access to all the documents related to the developed activities through a public download area containing:

1. project presentation and objectives,
2. Consortium presentation and Partner aims,
3. contact details,
4. used methodologies, guidelines,
5. publications and scientific articles,
6. events calendar,
7. meetings documentation,
8. products demo and user manual
9. other relevant web-sites links.

To foster the ideas exchange between the site users, a Forum will also be developed. A Newsletter or similar publication will provide project news (project developments, conferences, events, references to regulatory and policy developments) through a mailing list. The objective is to stimulate interest and inform target groups while avoiding information overload. The site will also be linked to other targeted web sites such as (1) partners web sites, (2) Networks of Excellence, (3) Cordis – IST, e-Contentplus projects, (4) National Ministries and Minerva EC, (5) Scientific communities and other RTD projects. A dissemination campaign will be planned to reach the broadest dissemination between target groups and citizens. This task will be carried out through scientific publications, articles on scientific journals, leaflets, and through direct e-mailing, traditional mailings. We aim to cluster with other initiatives and share the participation in the IST Conferences Networking Sessions and Workshops yearly organized by European Commission services as an open forum for exchanging views and ideas. Local intermediate focus workshops are planned at the end of the first and second year and we aim for the production of a final publication (printed and on line versions). On the basis of the experiences gained within the project and the feedback obtained in the local intermediate workshop an end-ofproject showcase will be developed (drawing also from results of Tasks 7.5 and 7.6). Dissemination will be achieved through a series of local focus workshops organized by each partner at the end of 1st and 2nd years. These workshops will be opened to restricted technical groups. The events calendar will be decided during the first partners’ meeting. A public conference will be organized at the end of the project. Partner testimonials will expose the achieved results. European and national stakeholders will be invited to the event (European Commission and National Authorities officers, Excellence Networks representatives, other initiatives projects managers). The final publication will be produced (a printed and on-line versions). The publication will contain all the achieved results, the adopted methodology and all the project relevant documents. The paper publication will be mailed to a targeted mailing list. An on-line version will be downloadable from the web site.

TASK 9.2 NETWORKING AND CLUSTERING OF THE PROJECT

The Consortium will adhere from the beginning at the already constituted Networks of Excellence DELOS and MUSCLE and, during the project, will cluster its activities with other initiatives and projects (PRESTOSPACE, DELOS, MINERVA, TAPES, BRICKS etc…) and other IST Scientific and Cultural Communities. This will lead to the synchronisation and coordination of activities in order to increase the efficiency of each of its partners’ activities and extend the scope, reach, and impact of the dissemination coordinated actions. It will also enhance co-operation among broadcasting RTD projects and stakeholders. It will deliver the results of past and current research to potential users in this important sector with the greatest dissemination potential, with the overall objective of improving the video retrieval efficiency in Europe. And, it will minimise overlap and facilitate communications between national and EC-funded activities. Networking and clustering will thus contribute to an economical use of available resources. The networking and clustering activities will be carried out at two levels:

Internal
Internally, we aim to exploit synergies, re-use and valorise results previously achieved by other thematic projects; to evaluate the effectiveness of different strategies and media in disseminating RTD results and supporting innovation in the European broadcasting sector; to encourage the formation of new RTD partnerships between stakeholders in broadcasting including industry, content providers, developers and researchers; to find new users for user groups’ activity, and to ideate joint training courses.

External
For the external world, we aim to share events calendar to maximize the efficiency of the participation in network dissemination events and meetings; to exploit the existing dissemination channels of the Excellence Networks and Communities; to sponsor the exchange of researchers; to inform the wider research community of the results of past and current research projects in this thematic area, thereby avoiding duplication; and to produce Common Recommendations to be submitted to European and national policy-makers.

Click on each main title to get more information