Boosting Image Retrieval through Aggregating Search Results based on Visual Annotations Ximena Olivares1 Massimiliano Ciaramita2 Roelof van Zwol2 1
Universitat Pompeu Fabra 2Yahoo! Research
Introduction
Very large continuosly growing, online collections of humanannotated content Annotations are essential to make photo easily retrievable However, retrieval models that are effective in text retrieval do not work as well for textbased image retrieval
Introduction 1. Textbased queries Textual annotations are sparse and short [Marlow et al]
Diverse motivations : Spatial, temporal, social
[Ames & Naaman]
Events, personalities, media tagging [Dubinko et al]
Illinois Chicago autumn farmers market Thanksgiving holiday thoughts orayer thankful thanks
Introduction 1. Textbased queries 2. Query by Image Content
User requires to begin with a sample image
Introduction 1. Textbased queries 2. Query by Image Content 3. Detectors
Specific for each topic
e.g. face, skin, anchor man, ...
Good results in narrow domain
Introduction What about broad domains? (e.g. Web, Flickr)
Billion photos
Billion of users
Large quantity of annotations
Problems:
Retrieval performance of contentbased image systems is lower than keywordbased Results of search system using only tags are noisy and suboptimal
Introduction Flickr allows another kind of annotations (notes)
Associate text with visual area
Highly relevant to content
→ Visual Annotation
Valuable to learn different
the visual representations of an object
Motivation The main objective of our research is to improve retrieval performance, specifically precision at early recall by combining textual and visual information Main idea Use visual annotations (text & image) and rank aggregation to improve retrieval
Highlevel outline 1.User performs a query (e.g. ”coke can”) 2.Visual annotations matching the query are selected (2)
(1)
Highlevel outline 3.For each annotation, the top k similar images are retrieved, using contentbased image retrieval
(2) (3) (1)
Highlevel outline 3.For each annotation, the top k similar images are retrieved, using contentbased image retrieval
(2) (3) (1)
Highlevel outline 3.For each annotation, the top k similar images are retrieved, using contentbased image retrieval
(2) (3) (1)
Highlevel outline 4.The result lists are aggregated to obtain the final result ranking
(2) (3) (1)
(4)
Contentbased Image Retrieval Research in this area is extensive
Characteristics of contentbased image retrieval
Stateoftheart object retrieval
Scalable to a Web domain
We adopted the framework proposed by [Sivic and Zisserman]
Successfully applied to a collection of flickr images, detecting buildings [Philbin et al]
Promising results (scalability and performance)
Baseline
Contentbased Image Retrieval 1. Extract visual features and describe them.
Processed 12,000 images. Computed Harris and Hessian features Described using SIFT
2.Build visual vocabulary.
Clustered SIFT descriptors to create vocabulary of 10,000 words Implemented an approximate Kmeans algorithm 3 resulting vocabularies. Based on Harris, Hessian and a combination of those 2 features.
}
SIFT descriptors
kmeans clustering
Visual vocabulary 10k words
Contentbased Image Retrieval 3. An image can be represented as a set of visual words
Using the analogy with text retrieval, images can be represented in the vector space model Similarity can be measured by calculating the cosine similarity
4. Considerations regarding spatial distribution must be taken
Spatial distribution of words is significantly more important in image retrieval than in text retrieval
}
spatial distribution cosine similarity
Aggregating visual annotations Borda count
Aslam & Montague compared different aggregation methods in Web scenario => Borda count simple and efficient algorithm
Aggregating visual annotations Borda count
Aslam & Montague compared different aggregation methods in Web scenario => Borda count simple and efficient algorithm
Aggregating visual annotations Borda count
Aslam & Montague compared different aggregation methods in Web scenario => Borda count simple and efficient algorithm
Aggregating visual annotations Borda count
Aslam & Montague compared different aggregation methods in Web scenario => Borda count simple and efficient algorithm
Evaluation Task Hypothesis:
H1: Rank aggregation using visual annotations will significantly improve the retrieval performance in terms of precision H2: Tagbased search combined with CBIR using visual annotations will improve retrieval in terms of precision
Experimental setup Image Collection Flickr
Images with high variability
Web extracted
Annotated by Web users
Crawled 12,000 images (Flickr API), based on their tags
No restriction on relevancy to surrounding tags, or whether the object appears on the image
Contains 59,693 unique tags (from a total of 229,672)
Medium size image: 500x333 pixels
Topics
Set of 30 topics, derived from Flickr search logs
Filtered most frequent queries for objects
Defined four categories: 1. Fruits and flowers 2. Monuments and buildings 3. Brands and logos 4. General objects
List of Topics
Strawberry Daisy Moai Sunflower Sushi roll Golden Gate McDonald logo Taj Mahal Hot air balloon Petronas Twin Towers Telephone booth Butterfly Converse Watermelon
American flag Big Ben clock tower Arc de Triomphe Clock Coke can CN tower Dice Eiffel tower Engagement ring Guitar Soccer ball Statue of Liberty Apple logo Rose Parthenon
Systems Each system uses:
Input: keywordbased query
Output: ranked list of image results
S1: Textbased retrieval
Based on vector space model for text retrieval. Using the textual annotations, images are retrieved for a given query
S2: Contentbased image retrieval using visual annotations
Using the keywordbased query to select at random one of the visual annotations that matches the query. We constructed 25 random runs for which we report the average performance
System
S3: Aggregated ranking over the results of CBIR using visual annotations Search using 10 visual annotations (per topic) and retrieve top25 results for each annotation. Apply rank aggregation over the result lists
S4: CBIR using visual annotations and a tag filter Similar to S2, with an additional filter over the image annotations, tag match over all the query
S5: Aggregated ranking over the results of content based image retrieval using annotations and tag filters Similar to S3, with additional filter over the image annotations
Pooling and Assesments
Topic pools Based on top 25 results for each topic retrieved by each of the systems
Assessors judged the results for a given topic as relevant or not relevant Evaluation Measures
Mainly interested in achieving a high precision at the top of the ranking Focus on P@N with N ranging from 125
Results: Feature selection & Summary statistics Feature selection S2
System
S3
S4
S5
Feature COM HAR HES COM HAR HES COM HAR HES COM HAR HES P@10
0.31 0.23 0.27 0.48 0.49 0.48 0.71 0.69 0.72
Differences are not significant
Combined vocabulary is used
0.8
0.77 0.79
Statistics
S5 outperform
other systems
System Images Retrieved Relevant Retrieved P@5 P@10
S1 S2 S3 S4 S5 750 750 750 742 748 393 149 301 494 562 0.53 0.34 0.55 0.72 0.82 0.49 0.31 0.48 0.71 0.80
Results: Systems comparison Tags only Visual Annot. Agg Visual Annot. Visual Annot. + Tags Agg Visual Annot. + Tags
Results: Systems comparison Tags only Visual Annot. Agg Visual Annot. Visual Annot. + Tags Agg Visual Annot. + Tags
H2
}H1
}
H1
Results: Topic analysis S5: aggregated annotations and tags
S1: tagsonly
Avg: 0.80 StD: 0.19
Topic frequency
Topic frequency
Avg: 0.49 StD: 0.24
P@10
P@10
Detect if present results are caused by abnormalities of some topics
Observe a significant and uniform increase in retrieval performance in all topics
Result: Topic analysis MAP Histogram per Topic
MAP
Tags only Visual Annot. Agg Visual Annot. Visual Annot. + Tags Agg Visual Annot. + Tags
Topic
Result: Topic analysis MAP Histogram per Topic Tags only Visual Annot. Agg Visual Annot. Visual Annot. + Tags Agg Visual Annot. + Tags
MAP
Butterfly
Topic
Conclusions and Future Work
We have proposed to use rank aggregation to combine the results sets of a contentbased image retrieval system that uses the visual annotations to retrieve similar images Our results clearly show that the quality of the results significativily improves when applying the aggregation Aggregation strategies that can be applied as a pre retrieval fashion, rather than postretrieval Learn to select good visual annotations Detect different senses of the same keyword using a visual analysis