During the last month
Ramon and I have been working hard in writing about
Digimatge, a Rich Internet Application to search videos in the
CCMA Multimedia Asset Manager. He is planning to defend his master thesis in a month, while I was writing a a paper to the demo session in
ACM MIR next March in Philadelphia. During this period we have studied many image search engines that based on textual or visual descriptors, or both at the same time. I would like to comment on a few we have considered interesting.
Searching for images with text is the most common method nowadays. Firstly, b

ecause humans are really good at explaining what we are looking for with words and, secondly, because the database technology in text retrieval is pretty mature and fast... at least, more than the multimedia-based. Nevertheless, this approach requires generating tags associated to the images. A first option is the manual annotation of the content, a very consuming task when performed by a human. The solution is automatation of the process. The most popular search engines nowadays, such as
Google Images or
Microsoft's Bing Images, find the tags the image filenames or contextual web text. Alternatively, another solutions try to extract textual data by looking at the images themselves. The first option is just to look for "text" appearing on the image and the read it, in the
DIRS system. A second option is to to teach the computer to understand certain objects or concepts that appear on images. As this second option is my Phd thesis topic, I am really excited about it !
A growing trend in

commercial search engines is introducing visual similarity criteria. In these case, signal processing algoithms automatically generate the metadata, so the annotation and tagging problem is not there anymore. There exist different options for the users to express their queries visually. In one hand, they can just choose what visual features (color or textures) they are looking for, as it is the case of
Idée Multicolr. In the other hand, the user can choose to provide examples or sketches of what he/she wants, as the versatile
Simplicity from Stanford University proposes or
GOS, the tool implemented at our lab, which already allows global scale similarity and will very soon do the same at a region scale.
Two modalities to search. Which is the best one ? We have text, which lets you expr

ess your ideas pretty precisely, but there are also visual cues, which are immediately available to be used as they do not require annotation. The trend right now is to combine both in multimodal interfaces ready to process text and visual queries. In most cases, text is used for an initial fast search and later results are filtered according to visual similarity. This is the enhancement that
Google Similar Images brought a few months ago, and also the strategy of
Sapir,
Xcavator and
Picitup Shop.
In this last category is where
Digimatge falls. Let's cross our fingers and wish the congress reviewers are interested in knowing how these hybrid videotextual queries will be applied in a broadcaster domain as it is the case of
CCMA.