Semantic Scene Modeling and Retrieval

DISS. ETH NO. 15751 Semantic Scene Modeling and Retrieval A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH for the degr...

Author: Rosamond Gilbert

1 downloads 0 Views 8MB Size

Report

Download PDF

Recommend Documents

Information Retrieval and Semantic Technologies

Semantic Retrieval of Video

Scene Graphs & Modeling Transformations

3D Model Classification and Retrieval Based on Semantic and Ontology

Statistical Models for Semantic-Multimedia Information Retrieval

SEMANTIC CLASSIFICATION AND RETRIEVAL SYSTEM FOR ENVIRONMENTAL SOUNDS

Fuzzy Semantic Retrieval for Traffic Information Based on Fuzzy Ontology and RDF on the Semantic Web

SEMANTIC MODELING FOR DEDUCTIVE QUESTION-ANSWERING

Semantic Parsing for Text to 3D Scene Generation

Semantic Modeling of Collocations for Lexicographic Purposes

Semantic Clustering for Robust Fine-Grained Scene Recognition

AN ONTOLOGY-BASED RETRIEVAL SYSTEM USING SEMANTIC INDEXING

Rehabilitation of lexical retrieval in aphasia: Evidence from semantic complexity

Semantic Web: Implications for Modeling and Simulation System Interoperability

Semantic Annotation Framework For Intelligent Information Retrieval Using KIM Architecture

Source separation and music. Computational auditory scene analysis. Probabilistic linear modeling. Probabilistic variance modeling

Semantic Technology in Business Process Modeling and Analysis. Part 2: Domain Patterns and (Semantic) Process Model Elicitation

A Semantic Web Ontology for Context-based Classification and Retrieval of Music Resources

MediaView: A Semantic View Mechanism for Multimedia Modeling

Towards semantic search and inference in electronic medical records: An approach using concept-based information retrieval

Tourism Information System-Integration and Information Retrieval of Tourism Information Systems using Semantic web services

Attention and Scene Perception

Negative Correlation Discovery for Big Multimedia Data Semantic Concept Mining and Retrieval

DISS. ETH NO. 15751

Semantic Scene Modeling and Retrieval

A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH

for the degree of Doctor of Technical Sciences

presented by JULIA VOGEL Dipl. Ing. Elektrotechnik, M.S. Electrical and Computer Engineering

born November 23, 1973 citizen of Germany

accepted on the recommendation of Prof. Dr. Bernt Schiele, examiner Prof. Dr. Andrew Zisserman, co-examiner

2004

Abstract Semantics-based image retrieval has gained increasing interest in recent years. As an area in linguistics, semantics deals with the sense and the meaning of language. In the context of content-based image retrieval, the research goal is to access the meaning of images by naming or describing the most important image regions and their relationships. The topic of this dissertation is the semantic description, understanding, and modeling of natural scenes. The primary objective is to develop a computational image representation that reduces the semantic gap between the image understanding of humans and the computer. For humans, the most intuitive means of communications about images is image description. Image semantics and image description are thus closely interconnected. We propose a semantic modeling of natural scenes that is based on the classiﬁcation of local semantic concepts. Image regions are extracted on a regular 10x10 grid. The resulting patches are classiﬁed into nine concept classes that subsume the main semantic content of the database images. Images are represented through the frequency of occurrence of the semantic concepts. This semantic modeling constitutes a compact, semantic image representation that allows to describe or search for speciﬁc image content, or, on a higher level, to model the semantic content of natural scene categories. The semantic modeling has been intensively studied for categorization and retrieval of natural scenes. Depending on the classiﬁcation method and on the quality of the concept detectors, good to very good categorization and retrieval performance has been obtained. In particular, it is shown that the semantic modeling leads to considerably better categorization and retrieval performance compared to directly employing low-level features. Nevertheless, the analysis of the mis-categorized scenes reveals that the regular semantic ambiguity of the database images demands rather for a typicality ranking of images than for hard-decision categorization. This hypothesis is supported in two psychophysical experiments. Humans are able to consistently categorize images, but the employed database consists to a large degree of images that can be assigned to several scene categories. However, the human participants were very consistent in ranking the database images according to their semantic typicality. It is shown visually and quantitatively, that the proposed semantic modeling is also well-suited for semantic ranking of images. In particular, the typicality transition between two scene categories can be modeled. In addition, we propose a perceptually plausible distance measure that represents the most discriminant semantic concepts of each scene category. The typicality ranking obtained with this distance measure correlates highly with the human rankings. Finally, this thesis discusses the problem of performance evaluation in content-based image retrieval systems. When searching for speciﬁc local semantic content, the retrieval results can be modeled statistically. We develop closed-form expressions for the prediction of precision and recall in our vocabulary-supported retrieval system. In addition, these expressions allow to optimize precision and recall by up to 60%.

i

Zusammenfassung Die inhaltsbasierte Bildsuche befasst sich mit der Frage der inhaltlichen Ähnlichkeit von Bildern. Seit einigen Jahren besteht ein besonderes Interesse an Verfahren, die speziell semantische Informationen aus Bildern extrahieren. Die Semantik ist ein Teilgebiet der Linguistik und beschäftigt sich mit dem Sinn und der Bedeutung von Sprache. Auf Bilder übertragen ist das Ziel demnach, den Sinn und die Bedeutung von Bildern zu erfassen. Ein Schritt in diese Richtung ist die Benennung und Beschreibung der wichtigsten Teilregionen eines Bildes und deren Beziehungen. Diese Dissertation beschäftigt sich mit dem Bildverstehen, insbesondere mit der semantischen Beschreibung und dem Modellieren von Bildern. Ein Hauptziel ist die Entwicklung einer Bildrepräsentation, die den sogenannten “Semantic Gap”, das ist der Unterschied im Bildverstehen des Menschen und des Computers, verkleinert. Während Bilder im Computer als eine Matrix von Pixelwerten dargestellt wird, ist für den Menschen die Beschreibung die intuitivste Form der Kommunikation über Bilder. Bildsemantik und Bildbeschreibung sind daher eng verwandt. In dieser Arbeit wird eine semantische Modellierung von Bildern präsentiert, die auf der Klassiﬁkation von lokalen semantischen Konzepten beruht. Bilder werden in ein reguläres Netz von 10x10 Bildregionen unterteilt und die entstehenden Bildsegmente als eines von neun semantischen Konzepten klassiﬁziert. Dabei decken die neun Konzeptklassen 99.5% des semantischen Inhalts der verwendeten Bilder ab. Anschliessend wird für jedes Bild die Häuﬁgkeitsverteilung der semantischen Konzepte bestimmt. Bilder können mit Hilfer dieser sogenannten “Concept-Occurrence Vectors” sehr kompakt repräsentiert werden. Insbesondere erlaubt die Bildrepräsentation durch das semantische Modellieren, Bilder zu beschreiben, nach Bildern mit bestimmten semantischen Konzepten zu suchen oder gar den semantischen Inhalt von Szenenkategorien zu modellieren. Als Anwendungsgebiet für die vorgeschlagene Bildrepräsentation wurden die Bildkategorisierung und die Bildsuche von Naturaufnahmen getestet. Kategorisierungs- und Suchresultate sind gut bis sehr gut abhängig von der gewählten Klassiﬁkationsmethode und der Güte der entsprechenden Konzeptdetektoren. Insbesondere wird gezeigt, dass sowohl Bildkategorisierung als auch Bildsuche zu besseren Ergebnissen führen, wenn das semantische Modellieren eingesetzt wird statt Bildmerkmale direkt zu extrahieren. Dennoch zeigt die Analyse der falsch kategorisierten Aufnahmen, dass diese semantisch zweideutig sind und nicht fest einer Kategorie zugeordnet werden können. Aus diesem Grund scheint eine Sortierung von Bildern anhand ihres semantischen Inhalts sinnvoller. Diese Hypothese wird von zwei durchgeführten psychophysischen Experimenten unterstützt. Zwar sind die Teilnehmer in der Lage, Bilder in konsistenter Weise zu kategorisieren, jedoch zeigt sich, dass die verwendete Datenbank zu einem grossen Teil aus Bildern besteht, die mehreren Kategorien zugeordnet werden können. Die zweite Studie zeigt, dass die Teilnehmer ebenfalls sehr konsistent darin sind, die Bilder anhand ihrer semantischen Ähnlichkeit zu sortieren. Die vorgeschlagene Methode des semantischen Modellierens eignet sich sehr gut für die semantische Sortierung von Aussenaufnahmen. Dies wird sowohl visuell als auch iii

iv quantitativ gezeigt. Insbesondere können die Typikalitätsübergänge zwischen zwei Bildkategorien modelliert werden. Darüberhinaus wird ein Distanzmass vorgeschlagen, das - psychophysisch plausibel - die diskriminierenden Konzepte jeder Kategorie lernt. Die Ähnlichkeitssortierung, die mit diesem Mass erreicht wird, korreliert in hohem Masse mit der Sortierung der Menschen. In dieser Dissertation wird ausserdem die Frage der Performanzevalutation von inhaltsbasierten Bildsuchsystemen erörtert. Wenn die semantische Modellierung für die Suche nach bestimmtem Bildinhalt verwendet wird, können die Suchresultate statistisch modelliert werden. Anhand der statistischen Modellierung können Präzision und Recall unserer vokabular-gestützten Bildsuche in geschlossener Form dargestellt werden. Dies erlaubt die Vorhersage von Präzision und Recall vor der tatsächlichen Suche. Ausserdem können in einer iterativen Prozedur interne Parameter zur Optimierung von Präzision und Recall angepasst werden. Dadurch steigen diese Werte um bis zu 60%.

Acknowledgments I would like to sincerely thank all people that contributed in various aspects to the successful completion of this dissertation. My sincere gratitude goes to my advisor Prof. Bernt Schiele for his guidance and motivation throughout this thesis. In numerous meetings and discussions, he managed to both free me when being stuck in too many details and to keep me focussed when trying to solve several major challenges in computer vision simultaneously. I am grateful to Prof. Andrew Zisserman for the interest in my work and for agreeing to be the co-examiner of this dissertation. The discussions with him were enlightening as well as encouraging. Special thanks go to my collaborators from the Department of Psychology at the University of Zurich, Adrian Schwaninger and Franziska Hofer. With their psychophysical expertise, they helped to develop ideas and settings for the human studies. I would like to thank especially Franziska for the organization and supervision of the experiments. The four years at ETH would have been boring without the crew of the PCCV (Perceptual Computing and Computer Vision) Group. Thanks to Bastian Leibe, Florian Michahelles, Hannes Kruppa, Martin Spengler, Nicky Kern, and Stavros Antifakos for fun during group retreats and for interesting, not solely research-related discussions during numerous lunch, coﬀee, and dinner breaks. Special thanks to Bastian and Martin for sharing their code, and to Bastian for the stimulating discussions, especially during the last months. The work was part of the CogVis (Cognitive Vision) project, funded by the Commission of the European Union (IST-2000-29375) and the Swiss Federal Oﬃce for Education and Science (BBW 00.0617). Thanks to all CogVis members from KTH Stockholm, Hamburg University, Max-Planck Institute for Biological Cybernetics, Leeds University, DIST Genova, ETH, and University of Ljubljana for stimulating research meetings. Last, but not least, I would like to thank my family for their help, encouragement, and emotional support. In addition, many friends provided assistance in diﬃcult as well as joyous moments. Special thanks to Marcel, but also to Gudrun, Nicky, Manuel, Salvo, Claudia, and others.

v

Contents

1 Introduction and Motivation 1.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Semantic Scene Understanding in Content-Based 2.1.1 General Image Retrieval . . . . . . . . . 2.1.2 Scene Classiﬁcation . . . . . . . . . . . 2.1.3 Image Description and Annotation . . . 2.1.4 Annotation of Image Regions . . . . . . 2.1.5 Visual Dictionary Approaches . . . . . . 2.1.6 Finding Objects in Images . . . . . . . . 2.1.7 Video Retrieval . . . . . . . . . . . . . . 2.2 Psychophysics of Natural Scene Perception . . . 2.3 Performance Evaluation in Computer Vision . . . 3 Semantic Scene Modeling 3.1 Semantic Modeling . . . . . . . . . . 3.2 Natural Scene Retrieval . . . . . . . 3.2.1 Search for Image Content . . 3.2.2 Scene Categorization . . . . . Selection of Scene Categories 3.3 Databases . . . . . . . . . . . . . .

I

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

Performance Prediction and Optimization

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . .

. . . . . .

. . . . . . . . . .

. . . . . .

. . . . . . . . . .

. . . . . .

1 3 4

. . . . . . . . . .

7 7 9 9 10 12 12 13 14 14 17

. . . . . .

19 21 22 23 23 24 26

29

4 Performance Prediction 31 4.1 Performance of the Concept Detectors . . . . . . . . . . . . . . . . . . 32 vii

viii

Contents 4.2 Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Performance Optimization 5.1 Performance Optimization in Stage II . . . . . . . . 5.1.1 Optimization Algorithm . . . . . . . . . . . 5.1.2 Results: Stage II Performance Optimization . 5.1.3 Approximate Performance Optimization . . . 5.2 Performance Optimization in Stage I . . . . . . . . 5.2.1 Training of the Concept Detectors . . . . . . 5.2.2 Results: Stage I Performance Optimization . 5.3 Joint Two-Stage Performance Optimization . . . . . 5.4 Performance Optimization by Query Mapping . . . . 5.5 Summary of Part I . . . . . . . . . . . . . . . . . .

II

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Categorization and Retrieval of Natural Scenes

6 Concept Detectors 6.1 Local Semantic Classiﬁcation . . . . . . 6.2 Features . . . . . . . . . . . . . . . . . 6.3 Classiﬁcation Methods . . . . . . . . . . 6.3.1 k-Nearest Neighbor . . . . . . . Location Prior . . . . . . . . . . 6.3.2 Support Vector Machines . . . . 6.4 Experiments . . . . . . . . . . . . . . . 6.4.1 Results of the kNN classiﬁcation 6.4.2 Results of the SVM classiﬁcation 6.5 Discussion . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

7 Scene Categorization and Retrieval 7.1 Categorization Approaches . . . . . . . . . . . . . . . . 7.1.1 Representative Approach: Category Prototypes . 7.1.2 Discriminative Approach: Multi-Class SVM . . . 7.2 Scene Categorization . . . . . . . . . . . . . . . . . . . 7.2.1 Categorization based on annotated image regions 7.2.2 Categorization based on classiﬁed image regions 7.2.3 Categorization without semantic modeling step . 7.3 Scene Retrieval . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Retrieval based on annotated image regions . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

39 40 42 43 45 47 47 49 49 51 53

57 . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

59 60 62 62 63 63 65 65 66 67 68

. . . . . . . . .

73 74 74 76 77 78 79 81 81 82

Contents

ix

7.3.2 Retrieval based on classiﬁed image regions . . . . . 7.3.3 Retrieval without semantic modeling step . . . . . . 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Semantic Typicality Transitions between Categories . 8 Semantic Typicality of Natural Scenes: Human Studies 8.1 Psychophysical Experiment 1: Categorization . . . . . . . 8.1.1 Participants . . . . . . . . . . . . . . . . . . . . 8.1.2 Materials and Procedure . . . . . . . . . . . . . . 8.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . 8.2 Psychophysical Experiment 2: Typicality . . . . . . . . . 8.2.1 Participants . . . . . . . . . . . . . . . . . . . . 8.2.2 Materials and Procedure . . . . . . . . . . . . . . 8.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . 8.3 Summary of Psychophysical Studies . . . . . . . . . . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

9 Perceptually Plausible Ranking of Natural Scenes 9.1 Typicality Ranking using the Prototype Approach and the SSD . . 9.2 Typicality Ranking using a Perceptually Plausible Distance Measure 9.3 Comparison SSD vs. PPD . . . . . . . . . . . . . . . . . . . . . . 9.4 Typicality Ranking with the SVM Approach . . . . . . . . . . . . . 9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . .

. . . . .

. . . .

. . . .

83 84 85 89

. . . . . . . . .

. . . . . . . . .

91 92 92 92 92 96 97 97 97 100

. . . . .

103 . 104 . 105 . 109 . 111 . 112

10 Conclusion and Perspective 117 10.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.2 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A Additional Results of the Semantic Typicality Ranking

123

List of Figures

131

List of Tables

133

Bibliography

135

Introduction and Motivation

Last week we were hiking in the mountains. Stepping out of the forest, a wonderful view: “In front of us a small lake, behind a rugged, snow-capped mountain, and on the left the dark forest.”. Most humans have a clear idea of the scenery after listening to this semantic description because human verbal communication consists of many similar descriptions. Humans use them for illustrating experiences as above, for giving directions (e.g. “Go through the forest. As soon as you catch sight of the large mountain, turn into the path on the right.”), for describing a particular setting in a movie, or for searching an image of the view they have encountered during their hike. The ﬁeld of research that deals with the search of images based on automatically extracted features is content-based image retrieval (CBIR). The main challenge lies in the fact that for a computer, the image content is a matrix of pixel values which are summarized by low-level color, texture, or shape features, whereas for humans the content of an image refers to what is seen on the image, e.g. “a forest, a house, a lake”, or “a solemn lake scene”, or “a summer residence at a lake”, or “our summer residence at lake X”. As exempliﬁed here, depending on experience and on memories, on the mood or on the application, the same image content can have very diﬀerent meaning to diﬀerent people. One of the research issues in content-based image retrieval is to overcome this semantic gap between the image understanding of humans and the image understanding of the computer. Especially since the multimedia PC and the digital camera entered people’s homes, there is new interest in automatically searching and categorizing large collections of pictures. There is the personal photo album with the pictures of holidays, family reunions, and outdoor activities; there are the little ﬁlms of the same events;, and there are all these interesting pictures on the Internet. Due to the cheap and easy way to obtain images, 1

2

Chapter 1. Introduction and Motivation

the average users have large amounts of pictorial data on their personal computer that they would like to organize, search, and compare. For those requirements in personal homes, an image retrieval system should for example provide an automatic categorization of images. But most notably, it should enable the user to describe images semantically because for humans, description is the most intuitive means of communication about pictures. The topic of this dissertation is the semantic description, understanding, and modeling of natural scenes. The goal is to reduce the semantic gap between the image understanding of the human and the computer by developing an image representation that allows to describe images. The work focuses especially on the semantics of natural images. Recently often used in the context of content-based image description, semantics is actually an area in linguistics that deals with the sense and the meaning of language and the question how to deduce the meaning of complex concepts from the meaning of simple concepts. Because of the linguistic background, for us, semantic description implies verbal description, and we thus aim for a description of natural scenes based on keywords. The main idea is to represent images through histograms of local semantic concepts. Semantic concepts are descriptors for the semantic content of small image subregions that, when accumulated to contiguous regions, might also form a larger semantic entity. For example, several grass regions might actually make up a meadow. We show in this thesis that the proposed image representation based on semantic concepts, that is the semantic modeling is well-suited for catching the main semantic content of natural scenes. Database images can be searched for by description using the vocabulary composed of the semantic concepts. In addition, the image representation through semantic modeling allows to learn the relevant semantic content of full natural scene categories. The experiments in this thesis illustrate that the image and category representations are capable of categorizing, retrieving, and even typicality-ranking natural scenes. Furthermore, the modeling has been validated from a human point of view showing that the computational typicality ranking correlates highly with human rankings of the same natural scenes. The term “natural scene” often refers to photographic images as opposed to line drawings or paintings. In the context of this thesis, a natural scene is understood as a photographic image of nature without any additional artiﬁcial or human objects such as buildings, cars, or people. So far, images of purely natural content have not been studied in detail, neither from a semantic image understanding, nor from a human perception perspective. One reason might be that those images have a semantic variety that is both hard to model computationally and diﬃcult to study psychophysically. More attention was directed towards research on the identiﬁcation or categorization of rigid or semi-rigid objects such as cars, bikes, animals, people, or faces, that often convey the main meaning of an image. However, the recognition and modeling of the nature parts of a scene open interesting applications in context priming for object-based recognition systems. Also other applications that rely on understanding the context, such as color correction or segmentation algorithms, might proﬁt from the modeling of nature scenes.

1.1. Contribution

3

The thesis also addresses a second challenge in content-based image retrieval: the performance evaluation of CBIR systems. Performance evaluation of CBIR systems refers to the benchmarking of retrieval systems with respect to their precision, recall, accuracy, or other evaluation measure. It is indispensable for the comparison of diﬀerent algorithms and permits the integration of CBIR algorithms into larger systems. However, performance evaluation of image retrieval systems is especially challenging. The ultimate notion of performance evaluation should be related to user satisfaction. Since the desired outcome of the retrieval varies greatly between individuals, tasks, applications, and even between sessions with the same user, it is demanding to obtain consistent, yet necessary ground truth. In addition, the semantic complexity of full images renders the generation of ground truth very diﬃcult. The semantic image representation proposed in this thesis splits the image analysis into two steps, thereby making the acquisition of ground truth manageable. Together with an appropriate model of the image classiﬁcation process, the performance, that is precision and recall, of the retrieval system can be predicted as well as optimized. An important concern of this work is the evaluation and the thorough analysis of all computational steps. The goal is especially to understand both the success and the failure of the employed methods from a semantic point of view. For that reason, many exemplary images have been selected in all chapters for illustrating the system’s behavior and for drawing conclusions.

1.1

Contribution

This dissertation makes four contributions to the ﬁeld of scene modeling and contentbased image retrieval. Semantic Modeling of Natural Scenes. In this thesis, an image representation is developed that allows to access scenes by local semantic image description. In particular, it is shown that this semantic image representation is well suited for modeling the semantic content of scene categories. Various methods for the categorization and the retrieval of scenes based on the proposed image and category representation are discussed. Furthermore, the image representation based on semantic modeling qualiﬁes for ranking natural scenes according to their semantic similarity. This application is of special interest for content-based image retrieval systems that rely on the correct ordering of the returned images. Psychophysical Experiments. Two psychophysical experiments were conducted that study human perception of the scenes employed in this thesis. The results of the experiments give evidence that part of the employed natural scenes are in fact so ambiguous that a hard-decision scene categorization is futile from a human point of view. Nevertheless, the human participants were able to consistently categorize the natural scenes, and to rate their semantic typicality with respect to ﬁve scene categories.

4

Chapter 1. Introduction and Motivation

Perceptually Plausible Distance Measure. The human typicality rankings are used to learn a psychophysically plausible distance measure. With this distance measure, the natural scenes are ordered according to their semantic typicality. The thus obtained typicality ranking correlates highly with the human typicality ranking. Performance Prediction and Optimization. With the proposed semantic modeling, images can also be searched for by specifying local region content. For this so-called vocabulary-supported image retrieval, the retrieval performance can not only be predicted, but also optimized. In this thesis, closed-form expressions for the prediction of precision and recall in the vocabulary-supported retrieval system are developed. In addition, these expressions are employed for various forms of performance optimization of the proposed retrieval system. Parts of this dissertation have been published in refereed conference papers. The local semantic concepts and the basic two-stage retrieval system were ﬁrst proposed in [Schiele and Vogel, 2000]. [Vogel and Schiele, 2001] covers the performance prediction of the vocabulary-supported retrieval system. In [Vogel and Schiele, 2002a], the performance prediction was extended to performance optimization. In addition, [Vogel and Schiele, 2002b] compares several techniques for performance optimization. [Vogel and Schiele, 2004a] addresses the topic of natural scene categorization and retrieval based on the semantic modeling. In [Vogel and Schiele, 2004b], the futility of hard-decision scene categorization is discussed and the semantic typicality ranking of natural scenes is introduced.

1.2

Outline of the Thesis

This thesis is divided into two parts. Both parts build upon the semantic modeling deﬁned in Chapter 3. Part I comprises Chapter 4 and Chapter 5, and deals with performance prediction and optimization of our vocabulary-supported retrieval system. In vocabularysupported image retrieval, a set of local semantic concepts allows the user to search for particular image content. The errors of the concept detectors are modeled statistically and can thus be predicted before the retrieval took place. In addition, the detector errors can be compensated for during the retrieval, thus improving the retrieval performance. Part II consists of Chapters 6 – 9. It covers the semantic description of natural scenes and natural scene categories. The semantic image and category representation is used for the categorization and the retrieval of images. Furthermore, after some psychophysical experiments, a typicality ranking of natural scenes is favored that correlates highly with human perception. In the following, the content of each chapter is brieﬂy summarized.

1.2. Outline of the Thesis

5

Chapter 2 reviews research that is relevant in the context of this thesis. The goal is especially to highlight the main research directions concerning semantic understanding, modeling, and retrieval of images. In addition, we discuss work in psychophysics that provides insight into human perception of scenes. Chapter 3 gives an overview of the requirements that we consider crucial for a semantic image retrieval system. Subsequently, an image representation based on semantic modeling is introduced that fulﬁlls the mentioned requirements. Both parts of the thesis build upon this image representation. The core of the semantic modeling is the labeling of local image regions with semantic concepts. This allows to search for images using a given vocabulary or to fully describe natural images. Chapter 4 presents closed-form expressions for the prediction of precision and recall in the proposed retrieval system. Based on the assumption that the local region classiﬁers work independently, the main idea is to model the classiﬁcation of all image regions as binomial distribution. Chapter 5 uses the performance prediction developed in Chapter 4 to modify internal parameters for the performance optimization of the retrieval system. In the chapter, several methods for the optimization of only precision, only recall, or both precision and recall are developed and compared. Chapter 6 compares the k-Nearest Neighbor and the SVM method for the classiﬁcation of local semantic concepts. The classiﬁcation problem is especially challenging due to the disproportionate size of the concept classes and due to the fact that despite the semantic similarity, the concept classes exhibit a strong visual dissimilarity. Several low-level feature combinations and parameter settings are evaluated experimentally. In addition, the use of a location prior for this special classiﬁcation problem is discussed. Chapter 7 covers the categorization and retrieval of six semantic scene categories based on the semantic modeling introduced in Chapter 3. The main goal is to evaluate the applicability of the semantic modeling for learning a high-level category representation. For the categorization, a prototype and a SVM approach are compared in detail. The categorization and retrieval experiments in the chapter suggest that a hard-decision scene categorization is actually not desirable from a user’s point of view. Instead, images should be ranked semantically. Thus, the chapter is concluded with experiments on the semantic typicality ranking of natural scenes. Chapter 8 summarizes two psychophysical experiments concerning human perception of the natural scenes used in this work. The ﬁrst experiment shows that humans are able to consistently categorize the database images, although the majority of the database is semantically ambiguous concerning the given scene categories. In a second experiment, the human subjects were asked to grade the semantic typicality of the presented scenes with respect to the ﬁve given scene categories.

6

Chapter 1. Introduction and Motivation This task was also performed consistently by the humans. From the experiment, a set of human typicality rankings per database image was obtained.

Chapter 9 proposes a perceptually plausible distance measure that allows to rank natural scenes with a high correlation to the human ranking of these scenes. The new distance measure is learned through optimization. In the chapter, several methods for typicality ranking based on the semantic modeling are compared. Chapter 10 summarizes this dissertation, draws some conclusions, and gives an outlook for further research.

Related Work

In this thesis, we propose an image representation that allows to semantically access local image regions and, based on those local semantic concepts, full images. Due to the fact that this image representation can be used for semantic image retrieval, the work is situated in the context of content-based image retrieval. Furthermore, techniques developed in the ﬁeld of local semantic annotation and description of images, as well as image classiﬁcation relate to our work. In Section 2.1, we will thus review relevant systems and techniques for the content-based access and the automated annotation of images, image regions, and scenes. Since the result of any image description, classiﬁcation or retrieval will always be presented to a human, it is imperative not to neglect the human way of perceiving natural scenes. For that reason, Section 2.2 gives a short overview of the most prominent work in psychophysics of natural scene perception. A third area covered in this thesis is performance evaluation of content-based image retrieval systems. Section 2.3 reviews some approaches and thoughts on the topic of performance evaluation of computer vision systems.

2.1

Semantic Scene Understanding in Content-Based Image Retrieval

Semantic scene understanding implies the ability to describe the content of an image in a “human way”, that is usually by using natural language and keywords. Thus, scene understanding is closely connected with content-based image retrieval: If we were able to describe images using words, we could also retrieve images by these words, be it on 7

8

Chapter 2. Related Work

global image level or be it on region level. In the following, we will review the major steps in semantics-based image retrieval. These steps are not necessarily presented in chronological order because many research strings proceeded and proceed in parallel. The attempt is rather to disentangle these strings and to highlight the diﬀerent approaches to access semantic information in images. Image retrieval and access to image information is used here interchangeably. While reviewing the state of the art, we identiﬁed seven related but in methodology and objective diﬀering approaches to image understanding and retrieval that are summarized in the following sections. The factors taken into account for their diﬀerentiation are: Semantics vs. low-level features This point addresses the question, how and to which extent semantic information is extracted from the images. Is semantic image information extracted or is the method solely based on low-level features? In our opinion, the extraction of semantics is clearly the approach to be preferred. How is the semantic information obtained? Through expert knowledge or through learning? In the case of expert knowledge, what is it based on, e.g. psychophysical ﬁndings? In the case of learning, does the data in fact exhibit the claimed automatically extracted semantics? Local vs. global feature extraction Here, the locality of the feature extraction process is concerned. Is the image information based on globally or locally extracted low-level feature? In the case of a region-based approach, how are the local regions obtained? Is there any combination between local and global features? Local features allow to model local details whereas global features or the combination of local features can be used to extract global image information. Local vs. global image annotation This point regards the goal of the application. Is the aim to attach one or multiple global labels to the images, or is it to assign labels to image regions, or is the goal even to do both? How detailed are the labels? Is there any information propagation from global to local labels or vice versa? Supervised vs. unsupervised learning Finally, the learning method is concerned. Is the learning performed in a supervised or an unsupervised manner? How does that aﬀect the resulting retrieval or annotation accuracy? In the supervised case, how is the training data obtained? When evaluated based on these aspects, the ultimate goal of an image retrieval system would be to have a strong semantic component both on region and on global image level. This semantic image representation could for example be reached through automatic annotation of both local image regions and full images. Features should be extracted from local image regions in order to represent as many local details as possible, but should be combined to a global image representation. In addition, the learning should be as unsupervised as possible without trading too much retrieval accuracy.

2.1. Semantic Scene Understanding in Content-Based Image Retrieval

2.1.1

9

General Image Retrieval

In the early years, image retrieval was in most cases solely based on global, low-level feature information such as color, texture, or structure (for overviews see [Rui et al., 1999] [Eakins and Graham, 1999] [Smeulders et al., 2000] [Veltkamp and Tanase, 2001]). The assumption is that images with similar color or texture distributions are also semantically close to each other (e.g. [Flickner et al., 1995]). So if the user presents a beach scene during sunset as query image, the retrieval system will return many sunset images even if the user is actually interested in the beach. The next step towards more semantics is the introduction of region-based retrieval systems (e.g. [Ma and Manjunath, 1999] [Moghaddam et al., 2001]). Users select regions in the query image they are interested in which are matched to regions in the database images. The matching itself is again based on low-level features. This diﬀerence in the image representation of the user and the image representation of the machine commonly known as semantic gap is fought against by many systems through relevance feedback methods [Zhang, 2001]. The idea is that the users select positive and negative examples of the returned images through which the retrieval process is narrowed down to what the user really wants. The downside is that the user often has to select a large number of images or that the retrieval does not converge to the image of interest. Tieu and Viola present a novel approach to ﬁnding discriminant low-level features for modeling a particular user query [Tieu and Viola, 2004]. During query, the user selects a small set of positive examples of which more than 45’000 highly selective features are extracted. AdaBoost [Freund and Schapire, 1997] is used for learning a classiﬁcation function that selects 20-50 features for distinguishing the images. This is a very elegant solution for reducing the feature set. However, semantics are only deﬁned through the selection of the examples leading to low precision in many examples. Also, the features only represent color and, to some extent, texture, while no spatial information has been modeled.

2.1.2

Scene Classification

Scene classiﬁcation is a special case of global image annotation where only one label, that is the respective image class, is attached to an image. In most of the earlier work on scene or image classiﬁcation, semantics are often only found in the deﬁnition of the scene classes, e.g. indoor vs. outdoor, or waterfalls vs. mountains. The classiﬁcation itself is bottom-up and based on low-level image features. Lipson et al. [Lipson et al., 1997] hand-crafted ﬂexible templates that describe the relative color and spatial properties of image categories and showed that they could be used to classify natural scenes of the four classes ﬁelds, waterfalls and snowy mountains, and snowy mountains with lakes. Diﬀerent from other approaches, in [Lipson et al., 1997] especially the spatial conﬁguration of the selected scene classes is modeled. Going a step further, Maron and Ratan [Maron and Ratan, 1998] model images as bags of multiple instances (e.g. subregions) and learn scene templates given a small set of positive and negative examples. Using the probabilistic diverse density algorithm that learns from ambiguous examples, the

10

Chapter 2. Related Work

classiﬁcation method is evaluated based on RGB color features for the classes waterfalls, mountains and ﬁelds. Szummer and Picard [Szummer and Picard, 1998] compare combinations of color and texture features computed on subregions for indoor-outdoor classiﬁcation using a k-nearest neighbor classiﬁer. In a more recent work, Feng et al. [Feng et al., 2003] investigate several image descriptors based on global color-histograms and use PCA and SVMs for the classiﬁcation of ten categories (lions, elephants, tigers, horses, sky scenes, cheetahs, eagles, night scene, sunset, roses) and report good classiﬁcation rates. Vailaya et al. [Vailaya et al., 2001] classify images in a hierarchical manner employing binary Bayesian classiﬁers. Based on color and texture features, images are subsequently divided up into indoor vs. outdoor, outdoor into city vs. landscape, landscape into mountains vs. non-mountains, etc. Oliva et al. [Oliva et al., 1999] are among the ﬁrst to bring a more semantic component into the ﬁeld of scene classiﬁcation by proposing to organize images along three semantic axes between three pairs of scene classes: artiﬁcial to natural, open to closed, and expanded to enclosed. This continuous organization between two categories is achieved by globally computing the images’ power spectra, and using Discriminant Analysis to maximize the distance between classes and to minimize the in-class dispersion. The images are sorted relative to their distance to the classes. Serrano et al. employ semantic features, in particular sky and grass, in addition to low-level color and texture features in order to increase indoor-outdoor classiﬁcation performance [Serrano et al., 2004]. The two feature types are combined using a Bayesian network. Unfortunately, the classiﬁcation accuracy for the detection of sky and grass, respectively, is not reported, but the overall indoor-outdoor classiﬁcation rate increases by 2% when using the semantic features in addition to the low-level features. A general problem in image classiﬁcation is the assumption that the chosen categories are mutually exclusive. But in reality, most scene categories overlap semantically. For that reason, Boutell et al. [Boutell et al., 2004] propose a framework for handling natural scenes that can be described by multiple labels, e.g. a mountains image that also depicts a beach. The authors discuss several training methods and testing criteria for multi-label data and show the suitability of their approach experimentally.

2.1.3

Image Description and Annotation

In this section, systems for global and/or local image labeling are reviewed. Concerning the labeling of local image regions, here, only systems are discussed that infer the region label from the global annotation. In the last three or four years, several systems addressing a global image annotation have been proposed. At ﬁrst, this task may seem similar to scene classiﬁcation. But in contrast to image classiﬁcation systems, here, the goal is to obtain a more detailed description of images using multiple keywords that often only relate to a part of the image. Oliva and Torralba [Oliva and Torralba, 2002] [Oliva and Torralba, 2001] developed a scene-centered representation of natural images that is holistic and free of segmentation. Through psychophysical experiments (see Section 2.2) they identiﬁed global scene

2.1. Semantic Scene Understanding in Content-Based Image Retrieval

11

descriptors (e.g. depth range, openness, ruggedness, expansion, busyness, naturalness) quantized in subgroups such as close-up, small space, large space, panorama (depth), or open, semi-open, semi-closed, closed, enclosed (openness). These spatial envelope properties are estimated by analyzing the images’ localized energy spectra and solving a regression problem. The method requires annotated training data. Barnard and Forsyth [Barnard and Forsyth, 2001] discuss a generative hierarchical clustering model based on [Hofmann and Puzicha, 1998] for attaching words to images. Images are represented as sets of associated, that is globally annotated, words and sets of image regions. The model learns the joint statistics of words and image regions. The method allows to either access the clusters found by analyzing the word-region co-occurrences or to attach words to images given its image regions. In [Barnard et al., 2002], several modiﬁcations of the hierarchical clustering model are presented and compared. In addition, the approach is extended to region based word prediction where words are not associated with the full image but to the assumed corresponding image region. Also the work of Duygulu et al. [Duygulu et al., 2002] deals with the problem to model the correspondence between global keywords and image regions. Instead of using a hierarchical model, the proposed method adapts a concept from statistical machine translation. Furthermore, the authors cluster indistinguishable words like cat and tiger using the KL divergence between the posterior probability of regions given words. Lavrenko et al. propose a continuous generative model inﬂuenced by relevance models in information retrieval for learning the correspondence between global annotations and images represented by image regions [Lavrenko et al., 2003]. Here, words are modeled by multinomial distributions and images by continuous-valued feature vectors. In all methods mentioned in this paragraph, regions are extracted with the normalized cuts algorithm [Shi and Malik, 1997]. In [Li and Wang, 2003], a complete, with 5-10 words annotated category of 40 images is viewed as one concept, and 600 categories (= concepts) are used to train a dictionary of concepts. Images in one category are regarded as instances of a stochastic process that characterizes the category. For each category, a two-dimensional multiresolution hidden Markov model is trained. The extent of association between an image and the description of the category is measured by computing the likelihood of the occurrence of the image based on the corresponding stochastic process. In general, all approaches to learn the correspondence between global annotations and images or image regions, respectively, established a promising trend in image understanding. Nevertheless, the fact that the global annotations are more general than a pure region naming and hence a semantic correspondence between keywords and image regions does not necessarily exist is often neglected. This is especially true for the correspondence between category labels and category members. In addition, the mean per-word precision and the mean per-word recall of the methods for automated regions labeling lie only between 4% and 19% for precision and 6% and 16% for recall. Feng et al. [Feng et al., 2004] show that the use of rectangular image regions instead of segmented image regions increases the precision of the continuous relevance model [Lavrenko et al., 2003] from 19% to 23% and the recall from 16% to 22%.

12

2.1.4

Chapter 2. Related Work

Annotation of Image Regions

Another way to access image information is to attach automatically a set of manually selected, semantically meaningful labels to local image regions that can be searched for in a subsequent retrieval step. Picard and Minka describe a method that uses a multitude of texture models for the annotation of image regions [Picard and Minka, 1995]. Users label rectangular parts of an image semantically (e.g. leaves, building), the system selects the best ﬁtting model and propagates the label to visually similar regions. The FourEyes system [Minka and Picard, 1997] ﬁnds in a similar way groupings within and across images and learns weights on the groupings. A given semantic label is thus propagated within an image and across images. Kumar and Hebert ﬁnd man-made structure, that is houses, fences etc., in natural images [Kumar and Hebert, 2003]. The approach uses a causal multiscale random ﬁeld as a prior model on the class labels and models the diﬀering local spatial depedencies of structured and non-structured data via a multiscale histogram over gradient orientations. In the system of Town and Sinclair [Town and Sinclair, 2000], neural networks are trained to classify previously segmented image regions into one of eleven semantic classes such as brick, cloud, fur or sand. The image regions are represented by color and texture features and images are retrieved using a visual thesaurus. Although all of the presented region-labeling approaches require annotated training data, the eﬀort often pays oﬀ in high classiﬁcation rates and thus high precision and recall. In a set of psychophysical experiments that are described in Section 2.2, Mojsilovic et al. [Mojsilovic et al., 2004] obtain not only 20 semantic categories relevant for humans but also verbal descriptions of these categories. The verbal descriptions cover for example the presence of people or sky in the image, the color composition, spatial organization etc., and are mapped to calculable image processing features (e.g. skin = yes/no, number of edges, number of regions, central object = yes/no, corners, straight lines). The authors also developed a query language based on the verbal descriptions. Unfortunately, only qualitative retrieval results are presented.

2.1.5

Visual Dictionary Approaches

The previous sections covered approaches for either labeling images globally or labeling image regions. In this section, some approaches that build visual dictionaries for classiﬁcation and retrieval are reviewed. The general idea of a visual dictionary or visual thesaurus is to collect or even learn a set of visual keywords, that is a set of visually or semantically similar local image regions. These visual keywords are employed to represent, describe, and retrieve images. Concerning semantic image representation, the challenge lies in the construction of the visual thesaurus. On the one hand, a visual dictionary can be learned based on a given set of semantically meaningful concepts. This allows to represent images indeed semantically but is limited to a subset of scenes due to the supervised learning stage. On the other hand, a visual thesaurus can be built automatically by grouping visually, but not necessarily semantically similar image regions. This allows to describe a wider set of images, but the semantic relevance of the visual keywords or

2.1. Semantic Scene Understanding in Content-Based Image Retrieval

13

the retrieved images often remains unclear. Picard is among the ﬁrst to develop the general concept of a visual thesaurus by transferring the main ideas of a text thesaurus to a visual thesaurus [Picard, 1995]. The FourEyes system [Minka and Picard, 1997] mentioned in Section 2.1.4 serves as example for the possible assembly of such a visual thesaurus. Lim [Lim, 2001] builds manually a vocabulary of visual keywords such as face, crowd, building and proposes a query formulation method based on visual constraints. In the work the accuracy of the visual keyword extraction is not mentioned. In [Lim and Jin, 2004], SVMs are trained on image regions of a small number of images belonging to seven semantic categories. Those regions that lead to a high SVM output are clustered and form a visual dictionary. Unseen images are represented through histograms of this dictionary. It is not evaluated whether the visual dictionary contains in fact semantically meaningful clusters. Zhang and Zhang [Zhang and Zhang, 2004] build a visual dictionary by learning a self-organizing map on the feature vectors of segmented image regions. Viewing the two-dimensional plane as binary image and performing an erosion operation, each connected component represents a visual code word. The semantic content of those code words depends primarily on the quality of the region segmentation. Fauqueur and Boujemaa build a visual dictionary through clustering and grouping of segmented image regions [Fauqueyr and Boujemaa, 2003]. Both the segmentation and the grouping step are based on color. Users can select positive and negative examples from the visual keywords in order to represent their “mental images”. It is not clear though how well humans can recognize the semantic content of the visual keywords due to their size and the missing context. Inspired from psychophysics, the computational part of the work of Walker Renninger and Malik is essentially a visual thesaurus approach [Walker Renninger and Malik, 2004]. Here, the claim is not to learn semantic visual keywords. Instead, the model learn universal textons from the training set and represents images through histograms of these textons. The method is used for scene identiﬁcation and is further discussed in Section 2.2. Also, Sivic and Zisserman [Sivic and Zisserman, 2003] build a visual dictionary for their video object retrieval (for details see Section 2.1.7).

2.1.6

Finding Objects in Images

Certainly, also the task of detecting speciﬁc objects in images and retrieving images that contain those objects is a specialized branch of content-based image retrieval (e.g. [Hoiem et al., 2004]). Schmid [Schmid, 2004] proposed a method for weakly supervised learning of visual models using rotational invariant features. By evaluating the spatial neighborhood relationships of these features and automatically selecting the most distinctive neighborhood descriptors, texture-like visual structure can be modeled without manual extraction of objects. The visual models can be employed for retrieving and localizing textured objects. Nevertheless, object-centered image retrieval is not the focus of this thesis. For an extensive review of object recognition methods refer to [Leibe, 2004].

14

2.1.7

Chapter 2. Related Work

Video Retrieval

In recent years, the attention shifted more and more from content-based image retrieval to content-based video retrieval, video annotation or video indexing (e.g. [CIVR, 2004]). Also the introduction of a video track in the TREC series [TREC, Video Track, 2001] [TREC, 2003] supported this development. The main diﬀerence between image and video retrieval is that semantic information in videos is conveyed to a larger extend across frames than within a frame. This inﬂuences the deﬁnition of semantically “interesting” information and leads to a more object-centered analysis of moving pictures. Early approaches to semantic video retrieval include [Naphade and Huang, 2001] who proposed a probabilistic framework for semantic video indexing. By deﬁning probabilistic multimedia objects for capturing high-level semantics such as explosion, music, mountain, beach, and a graphical network of these concepts, scene context is modeled by discovering inter- and intra-frame relations between the concepts. Adams et al. [Adams et al., 2003] combine audio and visual content analysis with textual information retrieval for semantic modeling of multimedia content. A more recent approach is the Video Google system by Sivic and Zisserman [Sivic and Zisserman, 2003]. The goal is to perform video object retrieval similarly to text retrieval by using weighted visual words based on viewpoint invariant regions, a stop list composed of too frequent and too infrequent visual words, and, in addition, attending to the spatial consistency of the visual words. The system is extended in [Sivic and Zisserman, 2004] and [Sivic et al., 2004] by modeling in particular the spatial conﬁgurations of the visual words and by ﬁnding object levels groupings based on appearance and motions. Within this frameworks multiple, independent appearances of objects can be marked and tracked throughout a full movie.

2.2

Psychophysics of Natural Scene Perception

Another way to analyze semantics of images is to ﬁnd out how humans perceive natural scenes. Psychophysics or cognitive psychology is a branch of psychology dealing with quantitative relations between physical stimuli and their psychological eﬀects. In the context of human vision, the stimuli are usually still images, and the research is concerned with the human perception of scenes or objects. Several starting points concerning human vision provide an insight into where to possibly guide computer vision: Which are relevant scene categories for humans? How do we describe those or which scene attributes stand out? Which image features are possibly evaluated by humans? Also the question what humans ﬁxate on in an image might be informative. Since the visual attributes of scenes are more diﬃcult to subsume than those of objects, only little psychophysical research has been performed in scene categorization in comparison to object categorization (see [Murphy, 2002] [Lakoﬀ, 1987]). Tversky and Hemenway were the ﬁrst to build a taxonomy for environmental scene categories [Tversky and Hemenway, 1983]. Inﬂuenced by the work of Rosch [Rosch, 1978] on object categorization, the goal was to provide evidence for a basic or preferred level of

2.2. Psychophysics of Natural Scene Perception

natural scenes

15 forest/farmland mountain beach pathway

landscapes sunset/sunrise Images city shots

street/building long distance city shots towers Washington DC misc. face

Figure 2.1: From [Vailaya et al.,1998]: Hierarchy of categories

categorization for scenes. The authors selected indoors and outdoors as superordinate categories and found home, school, store, and restaurant, and park, city, beach, and mountains, respectively, as basic level of categorization. As a byproduct, they obtained perceived attributes for each category, such as {sun, bird, towel, water} for beach. In a simple experiment, Vailaya et al. [Vailaya et al., 1996] [Vailaya et al., 1998] asked eight human subjects to group 171 outdoor images into semantically meaningful clusters. The result was a hierarchically organized set of eleven categories: city shots (Washington DC, towers, long distance city shots, street/building), landscapes composed of sunset/sunrise and natural scenes (forest/farmland, mountain, beach, pathway), face, and misc. (see also Figure 2.1). Although the category names diﬀer, the resulting categories of [Tversky and Hemenway, 1983] and [Vailaya et al., 1998] are similar. Rogowitz, Mojsilovic, and collaborators did a whole series of experiments in order to determine semantic categories of photographs and verbal descriptors for these categories [Rogowitz et al., 1997] [Mojsilovic and Rogowitz, 2001] [Mojsilovic et al., 2004]. Their ﬁrst result on a database of 97 images was that humans categorize images along two main axes: man-made vs. natural and human vs. non-human. In a subsequent experiment with 196 images, they identiﬁed four major categories and 20 subcategories. In addition, for each semantic category, a set of descriptors that humans use to describe these categories was found. For example, “water”, “sky/clouds”, “snow” and “mountains” emerged as very important cues for the nature categories. Also, humans seem to be very sensitive to the presence of people in the image. Color composition and color features are important when comparing natural scenes, but are seldomly used in describing images with people or man-made environments. Straight lines, straight boundaries and sharp edges are characteristics of man-made images while images of natural scenes have rigid boundaries and a random distribution of edges.

16

Chapter 2. Related Work

Oliva and Torralba [Oliva and Torralba, 2001] pursued a slightly diﬀerent approach by asking 17 subjects to split 81 pictures into groups that have a similar global aspect, a similar global structure or similar elements, and not to use any criteria related to objects. The experiment resulted in ﬁve spatial envelope properties that relate to the global composition of an image: naturalness, openness, roughness, expansion, and ruggedness. Research in human perception is also concerned with the question how humans perceive scenes in general, where they look, and which image information we most probably use. Henderson and Ferreira provide a comprehending overview of the basic facts about the visual cognition of scene viewing and eye movements [Henderson and Ferreira, 2004]. Humans seem to be able to very rapidly determine the gist of a scene, that is the main scene category (e.g. city vs. landscape) [Potter, 1976]. Concerning the reason for this rapid scene comprehension, there exist several hypotheses, be it a diagnostic object [Friedman, 1979], important scene-level features [Biedermann, 1995], low-frequency spatial information [Schyns and Oliva, 1994] or color [Oliva and Schyns, 2000]. In addition, Buswell demonstrates in an early study that viewers do not ﬁxate on empty, uniform, and uninformative scene regions where “interesting” scene regions often correspond to areas with high spatial frequency content and edge density, and high local contrast [Buswell, 1935] [Henderson, 2003]. This suggests that humans would ﬁxate quickly on any man-made object in a scene since man-made objects usually exhibit sharp edges and corners and thus high spatial frequencies. Li et al. found through experiments that humans are able to categorize natural scenes in the near absence of attention [Li et al., 2002]. When posed as single task, humans managed well to decide whether or not an animal is present in the test scene. Quite surprisingly, the categorization performance of this task did not drop when the attention was ﬁxed at a second, primary task. The same is true for the detection of vehicles even when the distractor images contain animals and vice versa. In addition to these experiments, it would be interesting to test the inﬂuence of context on the recognition performance, that is the detection of animals and vehicles in a natural vs. a metropolitan context. Further research has been performed on the question which image features guide the human scene recognition. Walker Renninger and Malik compared the recognition performance of human subjects on ten scene categories (beach, forest, mountain, city, farm, street, bathroom, bedroom, kitchen, living room) with that of a simple texture recognition model. They conclude from their results that a model based on histograms of clustered textons explains early scene identiﬁcation [Walker Renninger and Malik, 2004]. McCotter et al. [McCotter et al., 2004] performed similar recognition experiments on eight scene categories (highway, street, town center/house, tall building, coast/beach, open landscape, forest, mountain). By analyzing the phase spectra of the scene categories, they found that category-speciﬁc diagnostic regions in the phase spectra lead to correct categorizations. The authors suggest that these diagnostic orientations and bandwidths contain the scene information with maximum in-class overlap and minimal between-class overlap. Also, Oliva and Torralba [Oliva and Torralba, 2001] examined the energy spectra of the employed images and found that the average slope and the dominant orientations of the amplitude spectra correlate with the above mentioned envelope

2.3. Performance Evaluation in Computer Vision

17

properties. In summary, much research in human perception has been done on the classiﬁcation and perception of general scenes, that is indoor and outdoor, with and without humans, with and without human inﬂuence, natural and artiﬁcial, but not for image sets composed only of nature scenes as in this thesis. The selected basic-level scene categories employed for the psychophysical experiments seem to converge to commonly used categories such as mountains or city. For the recognition of these scene categories, pure texture features seem to model well human performance [Walker Renninger and Malik, 2004] [McCotter et al., 2004]. Only Oliva and Torralba [Oliva and Torralba, 2001] by modeling semantic axes and Mojsilovic et al. [Mojsilovic et al., 2004] by determining a descriptive vocabulary per category aim for a semantic model of human perception.

2.3

Performance Evaluation in Computer Vision

Performance evaluation in computer vision addresses only to a small extent the eﬃciency or computational speed of the developed algorithms. The goal is rather to evaluate the performance of vision algorithms with respect to their accuracy in segmentation, detection or retrieval, precision, recall and similar measures. It has long been recognized that for computer vision performance evaluation is of utmost importance [Haralick, 1985] [Price, 1985]. Also, there have been several dedicated workshops and discussions underlining the interest in performance evaluation (see for instance [Courtney and Thacker, 2001] [Haralick et al., 1998] [Clark and Courtney, 1999] [Bowyer and Phillips, 1998] [Christensen et al., 1996] [Förstner, 1996]). Nevertheless, performance evaluation of vision algorithms in general is a complex and challenging problem. Therefore relatively little research has been done on more general performance evaluation of vision algorithms, vision modules, or vision systems. Quite often performance evaluation requires the generation of benchmark sets with hand-labeled or synthetically-produced ground truth. Due to its importance, this eﬀort has been carried out for several computer vision applications such as tracking and surveillance with the PETS workshop started in 2000 [Ferryman and Crowley, 2004], document analysis with the TREC series carried out for the ﬁrst time in 1992 [TREC, 2003], face recognition [Phillips et al., 2000], and image ﬂow, vehicle detection, symbol and shape recognition [Aksoy et al., 2000]. Only since very recently, as part of TREC 2001, a video track devoted to the research in automatic segmentation, indexing and content-based retrieval of digital video was put together [TREC, Video Track, 2001]. Besides the Benchathlon Network [The Benchathlon Network, ] that never got really widely accepted, for content-based image retrieval systems, no benchmark sets could establish themselves for a wider community. Even though, performance evaluation is also of prime relevance for CBIR systems as for example argued by Smith [Smith, 1998] and Müller et al. [Müller et al., 2001].

Semantic Scene Modeling

The goal of this thesis is to develop and evaluate an image representation that provides a handle to a semantic description of natural scenes. We especially investigate nature scenes where images with objects such as people, man-made structure, cars, etc., are explicitly excluded. The reason is that on the one hand humans quickly ﬁxate on these objects [Henderson, 2003], thus making specialized object detection algorithms necessary, and, on the other hand, object detection is a ﬁeld of its own (see e.g. [Leibe, 2004]) and is in our opinion complementary to scene understanding and description. The review of the relevant literature in the ﬁeld of content-based image retrieval, image understanding, scene classiﬁcation, and human visual perception in the previous chapter revealed a set of requirements that are described in the following. The envisioned image representation shall be: semantic: The reduction of the semantic gap between the image representation of the human and the image representation of the machine is of prime importance. We especially aim for an image representation that is more intuitive for the user. Often retrieval systems require the user to choose the settings of multiple parameters that are only meaningful to experienced users and thus should be avoided. descriptive: As mentioned in the introduction, image description is the most intuitive means of communications for humans. Therefore, the goal is a vocabularysupported access to images that replaces the common query-by-example paradigm with a query-by-keyword paradigm. In addition, the use of descriptive vocabulary allows for keyword-based relevance feedback. region-based: Natural scenes contain a large amount of semantic detail that in our 19

20

Chapter 3. Semantic Scene Modeling opinion can only be modeled by a region-based approach. This entails that the features are extracted from local image regions, and that the images are semantically annotated on regions level to supply the descriptive vocabulary for querying.

segmentation-free: Image segmentation algorithms such as the mean-shift algorithm [Comaniciu and Meer, 2002] or the NCuts algorithm [Shi and Malik, 1997] still lead to uncontrollable over- and undersegmentation of semantically contiguous regions. For that reason, automatic image segmentation should be avoided and substituted by a regular subdivision of the images. global from local: In addition to the local image description, we aim for a global image representation based on the local information. This global representation allows for a global, semantic comparison of natural scenes. inspired by human perception: The result of any image retrieval or image description system will usually be presented to a human user. In our opinion, it is therefore important to support decisions on the system design by knowledge about the human perception of natural scenes. For that reason, we relate at many points throughout this thesis to human scene perception. Especially, the ﬁnal retrieval results based on the proposed image representation should to be substantiated through psychophysical experiments. evaluated quantitatively: Finally, the proposed image representation has to be evaluated quantitatively, especially with respect to its semantic representativeness. On the one hand, this refers to the evaluation concerning the human perception as mentioned before. On the other hand, the goal is to assess the semantic applicability, the robustness, the strengths, and the weaknesses of the image representation through clearly deﬁned and quantiﬁable tasks. The last requirement is closely connected with the question whether to employ supervised or unsupervised learning methods. As already mentioned in Section 2.1, the drawback of unsupervised or semi-supervised methods is the fact that the extraction of semantics can be incidental as in the case of some visual dictionary approaches. Also, the annotation accuracies are undesirably low as in the approaches modeling word-region co-occurrences. For these reasons, we employ supervised learning methods for the image modeling in order to obtain good modeling performance and in order to be able to evaluate the representation thoroughly. Certainly, the long term goal is to extent the supervised approach through semi-supervised or unsupervised learning methods. In the following section, we propose an image representation, the semantic modeling, that fulﬁlls the above mentioned requirements. The semantic modeling is tested and evaluated in two applications. The search for speciﬁc image content illustrates the descriptive properties of the semantic modeling. In addition, it shows that the image representation allows for assessing, that is predicting and optimizing, the retrieval performance. Through the categorization of natural scenes, it is shown that the image representation is also suited for modeling the characteristic content of scene categories.

3.1. Semantic Modeling Database Images

21 Concepts sky

water

grass

trunks foliage

field

rocks flowers sand

Semantic Modeling 10x10 grid

Concept Occurrence Vector 47.5% sky sky sky sky sky sky sky sky sky sky sky water 23.5% sky sky sky sky sky sky sky sky sky sky sky sky sky sky sky sky sky sky sky 0.0% sky grass sky sky sky sky sky sky sky sky sky sky rocks rocks rocks trunks 0.0% sky sky rocks rocks rocks sky sky rocks sky sky sky rocks foliage 0.0% rocks rocks rocks rocks rocks rocks rocks sky sky sky 0.0% rocks rocks rocks rocks rocks rocks rocks water rocks field water water sand sand water water water water water water rocks 20.0% sand sand water water flowers 0.0% sand sand water water water water water water water water 9.0% sand sand sand water water water water water water water sand

Region Annotation (semantic concepts)

Figure 3.1: Image representation through semantic modeling Section 3.2 gives an overview of the two applications. In Section 3.3, the databases used throughout this thesis are introduced.

3.1

Semantic Modeling

As argued above, we aim for a region-based, that is local, semantic image description. For that reason, the image analysis proceeds in two stages. In the ﬁrst stage, local image regions are classiﬁed by concept detectors into semantic concept classes. In order not to be dependent on an unsatisfying image segmentation, the local image regions are extracted on a regular grid of 10x10 regions. Inﬂuenced by the psychophysical studies of Mojsilovic and Rogowitz (e.g. [Mojsilovic et al., 2004], see also Section 2.2) and through the analysis of the semantic similarities and dissimilarities of the images used in this thesis, nine local semantic concept si , i = 1 . . . M, M = 9 were determined as being discriminant for the desired retrieval tasks. These local semantic concepts are s = [sky, water, grass, trunks, foliage, field, rocks, flowers, sand]. With these nine semantic concepts, the database images can be annotated to 99.5%. That is, on average only half an image region per image can not be assigned to one of the nine concept classes. Figure 3.1 depicts on the right an exemplary annotation of an image with the concepts

22

Chapter 3. Semantic Scene Modeling

sky, water, rocks and sand. Note that image regions that contain two semantic concepts in about equal amounts have been doubly annotated with both concepts. In Chapter 6, the concept annotation and classiﬁcation is discussed in more detail. In the second stage, the region-wise information of the concept detectors is combined to a global image representation. For each local semantic concept, its frequency of occurrence is determined. This allows to make a global statement about the amount of a particular concept being present in the image, e.g. “This image contains 9% sand.". In addition, the local image information is summarized in a semantics-based feature vector. The so-called concept-occurrence vector (COV) is essentially a normalized histogram of the concept occurrences in an image (see Figure 3.1). The strength of the image representation through COVs is that they can also be computed on several overlapping or non-overlapping image regions thus increasing their descriptive content. The information about which concepts appear at the top or bottom of an image is important for the image representation whereas right/left modeling is of minor interest because a mirrored image contains few additional semantic image information. We obtain a semi-local, spatial image representation by computing and concatenating the COVs of r = [1, 2, 3, 5, 10] horizontally-layered image regions resulting in a feature vector of length N(r) = [9, 18, 27, 45, 90]. When using r = 2 image areas (top-bottom), the COV of the image in Figure 3.1 is COV = [44.5, 0, 0, 0, 0, 0, 5.5, 0, 0, 3, 23.5, 0, 0, 0, 0, 14.5, 0, 9]T . The advantages of the semantic modeling are manifold. Only through the use of named concept classes in the ﬁrst stage of the retrieval system, the semantic detail of nature images can eﬀectively be modeled and be used for description. Furthermore, both for the training as well as for the evaluation of image understanding and image retrieval systems, ground-truth is necessary. The semantic content of the local image regions is far less complex than that of full images, making the acquisition of the required ground-truth much easier. Since the local semantic concepts corresponds to “real-world” concepts, one way of searching images is to include or exclude a particular concept, e.g. “At least x% of rocks” or “No images with water.”. This makes the system well suited for a pre-ﬁltering stage. Searching for concepts that describe the main parts of the desired image limits the search space for succeeding example-based retrievals. The analysis through the concept detectors is also capable of constraining the context of the depicted image scenes. Many content-based image retrieval systems incorporate a relevance feedback step in order to more accurately model the user query (e.g. [Zhang, 2001]). The vocabulary-based image description simpliﬁes a potential relevance feedback. It enables the user to more verbally reﬁne the query (e.g. “More grass, less sand ”).

3.2

Natural Scene Retrieval

The image representation based on semantic modeling enables several ways to access natural scenes. Images can be described on medium semantic level: “The scene contains a large amount of rocks, sky and some water ”. This description can be used to search for speciﬁc image content: “Find images with at least x% of water ”. The image description can also be employed for modeling semantic scene categories by learning which amount of

3.2. Natural Scene Retrieval

23

each concept are typically present in given scene categories. The scene categorization and the search for image content been implemented in this thesis and are further explained in the following two sections.

3.2.1

Search for Image Content

Through the search for speciﬁc image content, the descriptive properties of the proposed image representation are illustrated. In the search for image content, queries are posed in the way “Search images with x% of concept y.”. The goal is in the ﬁrst place to provide a more advanced query mode than query by example. An additional aim is to unburden the user from any complicated parameter settings. Since the query mode is closer to the way humans describe images, the semantic gap between the image understanding of the user and the computer decreases. This vocabulary-supported retrieval system is also well suited as pre-ﬁltering step for subsequent example-based image retrieval systems. A valid user query consists of the concept being searched for (e.g. sky ) and a user interval U = [Ulow %, Uup %] (e.g. U = [20%, 40%]) that speciﬁes the percentage of the image to be covered by the concept. The query mode search for image content is covered in Part I of this thesis. In Chapter 4, it is shown that the image representation in combination with the query mode allow to statistically model the retrieval performance. Given the performance characteristics of the employed semantic concept detector and the user interval, a closed form expression for the probability of precision and the probability of recall in the search for image content is derived. The approach is extended in Chapter 5. Here, several methods for optimizing the retrieval performance, that is precision, recall, or both are developed and compared. The experimental results show that depending on the prior knowledge of concept detector and concept distribution, the precision of the retrieval can be increased by up to 60% and the recall by up to 25%.

3.2.2

Scene Categorization

Scene categorization is a special case of image retrieval where the query consists of the scene category being searched for. Since scenes, that is full images, contain very complex semantic details, scene categorization is an appropriate task for testing the semantic representativeness of the proposed image representation. Part II of this thesis covers the modeling, retrieval, and ranking of natural scene categories. A special emphasis has been placed on the evaluation in which semantic situations the approaches succeed or fail. In Chapter 7, it is shown that the semantic content of scene categories can be modeled eﬀectively using the concept-occurrence vectors. We extensively study the categorization and category retrieval performance when employing the proposed image representation by comparing a representative approach based on prototype with a discriminative approach based on SVMs. Experimental results illustrate in addition that the semantic modeling is suited for ranking the database images based on their semantic

24

Chapter 3. Semantic Scene Modeling

Restaurant

fastfood restaurant fancy restaurant

Store

department store grocery store

School

high school elementary school

Home

singlefamily home apartment

Park

city park neighborhood park

City

midwestern city industrial city

Beach

lake beach ocean beach

Mountains

Sierra mountains Rocky mountains

Indoors

Images

Outdoors

Figure 3.2: From [Tversky and Hemenway, 1982]: Taxonomy of environmental scenes typicality. This idea is further extended in Chapter 8 where the human performance in categorizing and ranking the scenes of our database is evaluated. Finally, in Chapter 9, a perceptually plausible distance measure is developed. The distance measure in combination with the semantic modeling leads to an automatic typicality ranking of natural scenes that correlates highly with the human ranking of those scenes. Selection of Scene Categories The selection of semantic scene categories for the categorization experiments has been strongly inﬂuenced by work in psychophysics. In psychophysics, a hierarchical structure from very general to speciﬁc, e.g. animal, mammal, dog, German shepherd, has been suggested as a particularly important way of organizing concepts. The most important level of such a taxonomy of categories is the so-called basic level. It has been found through psychophysical experiments (especially [Rosch et al., 1976, Rosch, 1978]) that the basic level, a middle level of speciﬁcity, is the most natural, preferred level when for example naming particular concepts. Also, basic-level categories are easier and faster to learn. There are a number of additional advantages [Murphy, 2002] that motivate the use of basic-level categories for categorization experiments. For that reason, we reviewed relevant literature for deﬁning a set of basic-level scene categories for our categorization experiments (see also Section 2.2). Tversky and Hemenway were the ﬁrst to construct a taxonomy of abstract categories such as environments or scenes [Tversky and Hemenway, 1983]. Up to that point, research had mainly covered more concrete entities, such as objects with an identiﬁable shape. Through psychophysical experiments, they reported in their seminal work in-

3.2. Natural Scene Retrieval

25 Portraits People ’outdoors’ People

Humans

People ’indoors’ Outdoor scenes w/ people Crowds of people Cityscapes Outdoor architecture Technoscenes Objects indoors

Manmade

Objects indoors

Indoor scenes with objects

Objects outdoors Waterscapes w/ human influence Landscapes w/ human influence Waterscapes Landscapes w/ mountains Sky/Clouds Natural Scenes Winter and snow Green landscapes and greenery Landscapes w/ fields and foliage Plants, flowers, fruits, vegetables Natural Objects

Texture

Animals and wildlife Textures and patterns

Closeups of plants plants on small scale plants indoors water animals closeups of animals animals in nature

Figure 3.3: From [Mojsilovic et al., 2004]: Semantic scene categories doors and outdoors to be superordinate-level categories, with the outdoors category being composed of the basic-level categories city, park, beach and mountains, and the indoors category being composed of restaurant, store, street and home (Figure 3.2). The experiments of Rogowitz et al. [Rogowitz et al., 1997] revealed two main axes in which humans sort photographic images: human vs. non-human and natural vs. artiﬁcial. Note here the diﬀerent impact of employing categories that require a clear decision for one of the categories and axes that connect categories but allow a, here semantic, transition between two or more categories. These semantic axes were further extended in [Mojsilovic et al., 2004] and resulted in the 20 scene categories shown in Figure 3.3. Since we decided not to employ any images with artiﬁcial objects or humans, we

26

Chapter 3. Semantic Scene Modeling Photographic Images

indoors

outdoors artificial

coasts

natural

rivers/lakes

forests

plains

mountains

sky/clouds

Figure 3.4: Employed scene taxonomy selected the non-human/natural coordinate as superordinate for our experiments. In addition, the natural, basic-level categories [Tversky and Hemenway, 1983] and the natural scene categories of [Mojsilovic et al., 2004] were combined and extended to coasts, rivers/lakes, forests, plains, mountains and sky/clouds. The scene taxonomy employed in this thesis is illustrated in Figure 3.4.

3.3

Databases

In this thesis, several image databases were employed for computational and psychophysical experiments. In general, the goal was to use images and categories of high semantic diversity. Obviously, every scene category is characterized by a high degree of diversity and potential ambiguities since it depends strongly on the subjective perception of the viewer. Thus, it is very important that the databases also include images that are ambiguous in that sense and not to only select unambiguous images. In the following the employed databases are summarized. natural700 Database with 700 natural scenes from the Corel Database consisting of 144 coasts-, 111 rivers/lakes-, 103 forests-, 131 plains-, 179 mountainsand 34 sky/clouds-images. Images are present both in landscape and in portrait format (Size 720x480/480x720 pixels). A sample of images for each category can be seen in Figure 3.5. The three columns of images on the left correspond to typical examples for each category illustrating the diversity of those categories. The right column of Figure 3.5 shows images which are far less typical but which are – arguably – still part of the respective category. Obviously, those examples are more diﬃcult to classify and literally correspond to borderline cases. natural660 Subset of natural700 -Database without the sky/clouds-category with 660 images both in landscape and in portrait format (144 coasts, 110 rivers/lakes, 103 forests, 131 plains, 174 mountains).

3.3. Databases

27

less typical

sky/clouds mountains

plains

forests

rivers/lakes

coasts

typical

Figure 3.5: natural700 -Database: Examples for each category. Three columns on the left: typical images. Rightmost column: less typical image. PP250 Subset of the natural660 -Database used for the psychophysical experiments with 250 images (50 coasts, 46 rivers/lakes, 50 forests, 53 plains, 51 mountains). mixed1073 Mixed database of 1073 images including natural and artiﬁcial, indoor and outdoor scenes used for the experiments in Chapter 4 and 5.

Part I Performance Prediction and Optimization

29

Performance Prediction

Performance evaluation of content-based image retrieval (CBIR) systems with respect to precision and recall allows to compare diﬀerent systems and to analyze how those systems perform depending on the application. Knowing the performance of diﬀerent algorithms and systems also permits their combination and their integration into larger and more powerful CBIR systems. Some systems for example are better to coarsely limit the search space and therefore might be used as front-end of a larger system. Other system might work well on small, preprocessed subsets of the database. In the following two chapters, we present methods for the performance prediction and performance optimization of the search for speciﬁc image content. Since the query mode enables the user to describe the image he or she is looking for in a “human way” by a vocabulary of keywords, the retrieval has been named vocabulary-supported image retrieval. As mentioned in Chapter 3 and depicted in Figure 4.1, the retrieval proceeds in two stages with the local semantic concept detectors located in Stage I of the system and the information processing and actual retrieval located in Stage II of the system. In the current chapter, a closed-form expression for the prediction of precision and recall in our retrieval system is derived. The goal of the performance prediction is to make a forecast on the performance of the retrieval depending on certain parameters. Often, performance prediction of image retrieval systems is not feasible because no suﬃcient and consistent ground truth is available. Thanks to the semantic modeling and the resulting splitting of the retrieval process in two stages, in our system, ground truth becomes obtainable. The reason is that the semantic content of image regions containing sky, water, rocks, etc. is less ambiguous than the semantic content of whole images, and that it becomes much easier to acquire a suﬃciently large amount of semantically consistent annotations. In the following, the retrieval performance is evaluated using precision and 31

32

Chapter 4. Performance Prediction

Retrieval System

Input to Retrieval System

User Query ‘concept’ User Interval [Ulow%,Uup%]

Concept Detector (precision/recall)

Image Database

Image Analysis sky

User Interval [Ulow%,Uup%]

grass

foil.

build

Image Retrieval

Stage I Set of Concept Detectors

Stage II

Search for Image Content

Retrieval Results

Figure 4.1: Two-stage image retrieval system recall where precision is deﬁned as the fraction of the retrieved images that are relevant and recall as the fraction of the relevant images that are retrieved. The performance of the vocabulary-supported retrieval depends on the performance of the concept detectors in Stage I of the system, and the local semantic concept distribution in the database. In the next section, the performance of the concept detectors in the ﬁrst stage of the retrieval system is modeled. In Section 4.2, the probability of precision and recall are derived based on a probabilistic formulation. The expressions are discussed through experiments in Section 4.3.

4.1

Performance of the Concept Detectors

A valid user query consists of the concept being searched for and the user interval U = [Ulow %, Uup %] specifying the amount of the image covered by the desired concept. Since we divide the image into a grid of 10x10 image regions, Ulow and Uup correspond also to the number of image regions covered by the desired concept. In Stage I of the retrieval system, there exists a multitude of detectors for various concepts. That implies especially that there might be multiple concept detectors for one single concept with diﬀerent performance characteristics. The concept detectors make a binary decision whether an image region contains the particular concept. If an image region is covered by sky, the region is said to be “positive” relative to sky. According to the user query, the appropriate concept detector is selected from the detectors, the image regions are analyzed and the classiﬁcation results per image region are passed to Stage II. This analysis stage can be performed oﬄine. The performance characteristics of

4.2. Mathematical Framework

33

DETECTOR

Detected

Not Detected

Concept present

True Positive: p

False Negative: 1 − p

Concept not present

False Positive: 1 − q

True Negative: q

Table 4.1: Possible outcomes of the concept detection per image region the concept detectors are modeled by the probability p for correctly detecting a positive image region (true positives) and the probability q for correctly detecting a negative image region (true negatives) (see Table 4.1). The concept detectors are usually trained oﬄine. In Section 5.2, we will discuss the learning of the concept detectors in more detail. As will be argued in Chapter 5, the goal is to have multiple concept detectors with varying performance characteristics for each concept. During the optimization step, the best detector of this set of concept detectors will be selected. Multiple detectors per concept can be obtained by using one classiﬁer with diﬀerent conﬁdence thresholds or by using diﬀerent classiﬁers. In general, any classiﬁer with known performance characteristics can be employed such as for example the semantic classiﬁers of [Town and Sinclair, 2000] or the texture models for automatic annotation of [Picard and Minka, 1995].

4.2

Mathematical Framework

In this section, a closed-form expression for the probability of retrieval precision and recall in our two-stage system is derived. The prerequisite for the performance prediction is that the performance of the employed concept detector is known in form of the detectors’ parameters p and q. For now, we assume the concept distribution P (NP ) to be known. This assumption will be relaxed in Section 5.1.3. The derivations are based on the assumption that the concept detectors decide independently on each image region. Thus, the probability ptrue (k) that k positive image regions are correctly returned for an image that in fact has NP positive image regions is binomially distributed. Respectively, the probability pf alse (k) that k image regions out of (N − NP ) negative image regions are retrieved is also binomially distributed. N denotes the total number of image regions and NP the number of positive image regions in an image. NP k (4.1) ptrue (k) = p (1 − p)NP −k k N − NP pf alse (k) = (4.2) (1 − q)k q N −NP −k k ptrue (k) is in fact the probability for the so-called true positives, that is of correctly retrieved positive image regions, while pf alse (k) is the probability for false positives. False positives are image regions that do not contain the concept but are retrieved. If a

34

Chapter 4. Performance Prediction

total of i positive image regions is to be retrieved, both the true positives and the false positives add up to the total number of detected image regions. Thus, the probability to retrieve i positive image regions, given that a particular image has in fact NP true positive image regions, is: P (Nretr = i|NP ) =

i

ptrue (i − j) · pf alse (j)

(4.3)

j=0

Similarly, if the interval U = [Ulow %, Uup %] of positive image regions is searched for, Eq.(4.3) has to be summed over this interval to obtain the probability P (Nretr ∈ U|NP ).1 P (Nretr ∈ U|NP ) =

Uup i

ptrue (i − j) · pf alse (j)

(4.4)

i=Ulow j=0

Pretr (U) is the probability to retrieve images (not image regions!) that satisfy the query U relative to the image database. Thus, the probability P (Nretr ∈ U|NP ) has to be weighted with the concept distribution P (NP ) in order to obtain Pretr (U): Pretr (U) =

N

P (Nretr ∈ U|NP ) · P (NP )

(4.5)

NP =0

For the computation of precision and recall, the probability for relevant images Prelevant (U) and for true-positive images Ptrue_pos (U) is necessary. Both probabilities depend on the user interval U. Ptrue_pos (U) is the probability that the retrieved images are in fact relevant for the user interval U. For the computation, P (Nretr ∈ U|NP ) is only weighted with the part of the concept distribution P (NP ) that lies inside the user interval U = [Ulow %, Uup %]. Ptrue_pos (U) =

Uup

P (Nretr ∈ U|NP ) · P (NP )

(4.6)

NP =Ulow

The probability that images satisfy the user query depends on the user interval U and the concept distribution: Prelevant (U) =

Uup

P (NP )

(4.7)

NP =Ulow

Finally, Eq. (4.5)- Eq. (4.7) lead to a closed-form expression for the probability of precision and the probability of recall. Ptrue_pos (U) Pretrieved (U) Ptrue_pos (U) Precall (U) = Prelevant (U)

Pprecision(U) =

1

(4.8) (4.9)

Since N = 100 in our experiments, the percentages Ulow % and Uup % can be treated as integers in the summations. Otherwise, a normalizing constant would be necessary.

4.2. Mathematical Framework

35

Figure 4.2: Multiple retrievals for the query: “20-40% of sky.”, p = 0.90, q = 0.80

and hence, Uup

N =Ulow

Pprecision(U) = PN

NP =0

Uup Precall (U) =

P (Nretr ∈ U|NP )P (NP )

P (Nretr ∈ U|NP )P (NP ) P (Nretr ∈ U|NP )P (NP ) Uup NP =Ulow P (NP )

NP =Ulow

(4.10) (4.11)

Thus, with Eq. (4.10) and Eq. (4.11), precision and recall of the retrieval can be predicted. The input to the performance prediction is a user interval U, the performance characteristic p and q of the employed concept detector and an estimate of the concept distribution P (NP ).

36

Chapter 4. Performance Prediction

Figure 4.3: Multiple retrievals for the query: “20-35% of water.”, p = q = 0.90

4.3

Experiments and Discussion

The expressions for precision and recall have been validated on the mixed1073 -Database. The images in the database have been divided into a regular grid of 10x10 image regions, and the image regions have been manually annotated with the concepts sky, water, grass, buildings, face and cars. All simulations in the following are based on this ground truth. Depending on the selected detector parameters p and q, the annotations are randomly falsiﬁed. Figures 4.2 and 4.3 show the results for two region-based queries. The queries are “Find images with 20-40% sky.” and p = 0.90, q = 0.80 (Figure 4.2) and “Find images with 20-35% water ” and p = q = 0.90 (Figure 4.3). The prediction for the sky -query is 22.7% precision and 54.5% recall, and the prediction for the water -query is 56.4% precision and 82.4% recall. The retrieval has been performed 133 times for the sky query and 152 times for water -query. Each point in the two plots corresponds to the achieved performance in one of the iterations. The statistics in the lower left-hand corner summarize the mean and the standard deviation of the two sets of retrievals. The mean

4.3. Experiments and Discussion

37

of both precision and recall are very close to the predicted values. The higher variance of the water -query results from the fact that fewer images containing water are in the database. We also observe a slightly higher variance of the predicted recall in both experiments. The reason is that due to the smaller amount of images, the estimation of Prelevant (U) has a higher variance than the estimation of Pretrieved (U) (compare Eq. 4.8 and Eq. 4.9). The experiments show that the predicted values of precision and recall are very close to precision and recall in the experiments. Note that performance prediction goes a step further than performance evaluation since it makes a statement about the expected success of the retrieval before it took place. Performance evaluation based on benchmark sets only allows to analyze the retrieval afterwards. That implies that performance prediction opens additional application opportunities. For example, it can be decided beforehand whether certain queries result in the required precision or recall needed for an application. Being able to predict the system performance entails the question if we can improve the performance of the system by changing the way of retrieving or by changing some parameter. This performance optimization is the topic of the next chapter.

Performance Optimization

Performance optimization refers in our context to the increase of precision, recall, or both. In the following four sections, we will introduce several methods for performance optimization in our two-stage retrieval system. As will be shown, it is possible to optimize precision and recall separately as well as jointly depending on the user’s request. Figure 5.1 shows a schematic view of how the performance optimization ﬁts into our retrieval system. In comparison with Figure 4.1, the system has been extended by the query optimization. The query optimization includes the selection of the best concept detector in the ﬁrst stage of the system and, in Stage II of the system, the optimization of the internal system interval which becomes an additional input to the retrieval system. Note that the user input, that is the user query, does not change compared to Figure 4.1. There exist diﬀerent methods to increase the retrieval performance on both stages of the retrieval system. In the next section, we will only introduce methods that concern the Stage II of the retrieval system. In Section 5.2, we discuss the optimization potential of the concept detectors in the ﬁrst stage. In Section 5.3, it will be shown that a joint optimization of both stages is most beneﬁcial. In addition, it is possible to improve the retrieval performance by compensating in a probabilistic manner for the expected errors of the concept detectors. This so-called query mapping is introduced in Section 5.4. All optimization methods are evaluated by experiments at the end of each section. In addition, the features and prerequisites of the methods are compared in Section 5.5. 39

40

Chapter 5. Performance Optimization

Retrieval System

Query Optimization

User Query ‘concept’ User Interval [Ulow%,Uup%]

Detector Selection

Input to Retrieval System Concept Detector (precision/recall)

Image Database

sky

User Interval [Ulow%,Uup%]

Interval Optimization

Stage I

Image Analysis grass

foil.

build

Set of concept detectors

Stage II

Image Retrieval

Search for Image Content

System Interval [Slow%,Sup%]

Retrieval Results

Figure 5.1: Two-stage retrieval system with query optimization

5.1

Performance Optimization in Stage II

The derivation of performance optimization for Stage II of the retrieval system requires the introduction of an internal parameter: the system interval S = [Slow %, Sup %]. Since the detectors’ decisions are only correct with a certain probability (modeled by p and q), the retrieval performance will vary if the system is queried internally with a query S = [Slow %, Sup %] that diﬀers from the user interval U = [Ulow %, Uup %]. Intuitively, if the probability is high that the detector makes a false positive decision, it is necessary/sensible to internally raise the lower limit of the user interval Ulow % to Slow % = Ulow % + X%, X > 0 in order to retrieve in the end more images that correctly lie inside the user interval. The following will formalize this intuition and determine a system interval S ∗ that is typically = U for internal use that optimizes the retrieval performance. Note that at all points the evaluation of the relevant images Prel (U) is always computed relative to the user interval. Firstly, Equations (4.4)-(4.11) have to be extended by the internal parameter S. From now on, the probability Pretr to retrieve images depends only on the system interval S instead of the user interval U because the actual retrieval of the images in the database is governed only by S. Pretr (S) =

N

P (Nretr ∈ S|NP ) · P (NP )

(5.1)

NP =0

where P (Nretr ∈ S|NP ) =

Sup i

ptrue (i − j) · pf alse (j)

(5.2)

i=Slow j=0

The probability for true-positive images Ptrue_pos depends on both S and U. The retrieval is performed according to the system interval S, but is evaluated according to

5.1. Performance Optimization in Stage II

41

1 0.9 0.8

Pprecision

0.7 0.6 0.5 0.4 0.3

p = 100%, q = 100% p = 95%, q = 98% p = 90%, q = 96% p = 85%, q = 94% p = 80%, q = 92%

0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precall

Figure 5.2: Prediction of precision and recall for “10-30% sky ” when varying the detectors’ performance p and q and the system interval S the user interval U. Although internally a diﬀerent interval is used, the measure for the success of the retrieval is still the user interval. Thus, (4.6) becomes: Ptrue_pos (U, S) =

Uup

P (Nretr ∈ S|NP ) · P (NP )

(5.3)

NP =Ulow

A similar reasoning holds for the probability for relevant images Prelevant . Here, only the user interval U decides whether an image is relevant for the retrieval. Thus, Eq. (4.7) is still valid: Uup Prelevant (U) = P (NP ) (5.4) NP =Ulow

In summary, the probabilities for retrieval precision and recall become (compare to Eq. (4.10) and Eq. (4.11)): Uup P (Nretr ∈ S|NP )P (NP ) N =U (5.5) Pprecision(U, S) = PN low P (N ∈ S|N )P (N ) retr P P NP =0 Uup NP =Ulow P (Nretr ∈ S|NP )P (NP ) (5.6) Precall (U, S) = Uup P (N ) P NP =Ulow Figure 5.2 illustrates the inﬂuence of the system interval S. The tested query is “Find images with 10-30 % sky ”. The ﬁve curves correspond to ﬁve diﬀerent sets for p and q as indicated in the legend of the ﬁgure. As before the manual annotations were randomly falsiﬁed depending on p and q. Each curve point corresponds to a diﬀerent system interval. From left to right Slow and Sup are varied in the following way: S = [Slow %, Sup %] ∈ {[18%, 22%], [14%, 26%], [10%, 30%], [6%, 34%], [2%, 38%]} while the

42

Chapter 5. Performance Optimization Maximum Precision Joint Optimization 100%

100%

Precision in %

Maximum Recall

50%

100%

User Query

Recall in %

100%

Figure 5.3: Predicted search space for “20-40% of sky ”, p = 0.90, q = 0.80. user interval is U = [10%, 30%] in all cases. As expected, the precision is very high when the system interval is narrow whereas the recall is low. By increasing the width of the system interval, the recall can be increased at the cost of the precision. The decrease of the precision is much faster for smaller values of p and q. This behavior is due to the fact that the user interval covers only about 20% of the image. Thus, the probability for the detection of false positives is much higher than the probability for the detection of false negatives. As a result, many of the retrieved images are not relevant and the precision drops.

5.1.1

Optimization Algorithm

Eq. (5.5) and Eq. (5.6) are closed-form expressions for precision and recall depending on the user interval and the system interval. This implies that the equations can be evaluated prior to retrieval allowing to optimize the expected performance prior to retrieval. Because the equations do not allow to ﬁnd a closed-form expression for the system interval S = [Slow %, Sup %] as a function of user interval and desired performance, we use a recursive algorithm for obtaining the system interval that optimizes the retrieval performance. The algorithm allows to choose an optimization constraint: The user interval can be optimized for maximum recall, maximum precision or joint maximization of precision and recall. It is also possible to indicate a minimum value for precision and recall. During joint optimization of precision and recall, the distance to the optimum, that is the point with 100% precision and 100% recall, is minimized: dmin = min(1 − Pprecision (U, S))2(1 − Precall (U, S))2 S

(5.7)

5.1. Performance Optimization in Stage II

43

where the system interval S is varied, and the user interval stays constant. The algorithm proceeds in two steps. In the ﬁrst step, a set of system intervals is generated that are most probably of interest to the user. Starting from the user interval U = [Ulow %, Uup %], precision and recall of that point and its four neighbors [Ulow % ± 1%, Uup % ± 1%] are calculated and stored in a hash table. Recursively, those of the four neighbors that improve the current performance are used as starting point and the hash table is updated. Figure 5.3 depicts the complete search space, that is the precision-recall pairs for all possible system intervals, given the user query “20-40% of sky ” and the detector parameters p = 0.90, q = 0.80. Each point in the graph corresponds to a diﬀerent set of system intervals S = [Slow %, Sup %]. Note that two points that are close to each other in the plot do not necessarily have similar system intervals. In the second step, the algorithm selects the point in the search space that meets the users’ constraints. In Figure 5.3, the two gray lines identify the desired minimum performance of 50%. The predicted performance of the user interval is marked by a black circle while the possible solutions are marked by gray circles in the enlarged part of Figure 5.3. From left to right these are: “Maximum Precision”, “Joint Optimization (of precision and recall)” and “Maximum Recall”.

5.1.2

Results: Stage II Performance Optimization

Figure 5.4 shows the retrieval results corresponding to the query 20-40% of sky and p = 0.90, q = 0.80. The user selected the joint optimization of precision and recall. Note the diﬀerence to most other retrieval systems. Since here the semantic concept sky is searched for, the retrieved images are visually very diverse but satisfy perfectly the user query. Only the ﬁrst nine retrieved images are displayed. On top of the display some statistics are summarized: the precision was predicted to increase from 22.7% to 80.9%. The actual retrieval resulted in a precision of 80.2%. The prediction for the optimized recall was 80.8% which is 25% higher than the recall of the original user interval. In the actual retrieval the recall reached 83.2%. Thus, for this particular query, the precision could be improved by about 60% and the recall by 25%. The bars visualize the relation between non-relevant (dark gray/red), relevant (medium gray/green, left of the nonrelevants) and retrieved (light gray/yellow) images. The length of the bars correspond to the amount of images. From left to right, the light-gray bar for the retrieved images overlaps more with the medium-gray bar of the relevant images and less with the darkgray bar of the non-relevant images. This depicts the increase in precision and recall graphically. Figure 5.5 visualizes some optimization results for the optimization constraint “maximize recall with precision>50%”. The queries are: 1. 10-30%, grass, p = q = 0.9 2. 30-50%, grass, p = q = 0.8 3. 10-30%, buildings, p = q = 0.9.

44

Chapter 5. Performance Optimization

Figure 5.4: Retrieval results for “20-40% of sky ”, p = 0.90, q = 0.80.

The retrieval using the user interval is marked in black whereas optimized retrieval is marked in gray with the arrows pointing from the non-optimized to the optimized case. The optimized system intervals are S = [17%, 58%] for query 1, S = [37%, 56%] for query 2 and S = [15%, 42%] for query 3. Figure 5.5 shows that the optimization constraints have been met. The precision increased signiﬁcantly at all three queries. The recall can only increase if the stronger constraint “precision>50%” is already met. In summary, the experiments of the Stage II performance optimization have two main results. Firstly, depending on the user query, a gain of up to 60% in precision and up to 25% in recall can be reached. These results are obtained by probabilistic analysis and the resetting of an internal parameter, the system interval. The performance gain did not require the use of better classiﬁers. Secondly, the experiments show that the predicted value of the performance is closely met by the true retrieval performance.

5.1. Performance Optimization in Stage II

45

Maximization of Recall with Precision > 50% 100 90 80

Precision in %

70

[10%,30%] "grass", p=q=0.9 [30%,50%] "grass", p=q=0.8 [10%,30%] "buildings", p=q=0.9 [10%,30%] "grass", p=q=0.9, optimized [30%,50%] "grass", p=q=0.8, optimized [10%,30%] "buildings", p=q=0.9, optimized

60 50 40 30 20 10 0 0

20

40

60

80

100

Recall in %

Figure 5.5: sion>50%”

5.1.3

Retrieval with optimization constraint:

“Maximize recall with preci-

Approximate Performance Optimization

As mentioned before, up to now, the assumption was that the concept distribution P (NP ) is known. Thus, the results of the previous sections were obtained with the complete knowledge about the concept distribution used in Eq. (5.5) and Eq. (5.6). Figure 5.6a shows the concept distribution P (NP ) of the concept sky. However, it is not realistic to have the entire distribution at hand. So the dependency of the performance prediction and the performance optimization was tested using two approximate distributions. In the ﬁrst test, the actual distribution of the concepts was completely neglected. Instead, it was assumed that the number of image regions per image containing the particular concept, that is the positive image regions, are uniformly distributed: Punif orm (NP ) = N 1+1 , where N is the maximum number of positive image regions. The distribution is depicted in Figure 5.6b. In the second test, it was assumed that the a-priori probability is available if a particular concept is present in an image or not. That leads to a two-class frequency distribution. Class A is the number of images that do not contain the concept at all. Class B is the number of images that do contain one or more image regions of the desired concept. The uniform distribution of the previous paragraph has been weighted with the two-class distribution. That is, Ptwo_class (0) contains the information of class A and the information of class B has be divided equally to Ptwo_class (NP ), with NP = 1 . . . 100 (see Figure 5.6c). In Table 5.1, the performance optimization based on the two approximate distributions is compared to the benchmark optimization results obtained with the complete distribution. The comparison includes the system interval that resulted from the diﬀerent optimizations, the predicted performances that are based on the three distributions, and the actual retrieval that was carried out with the optimized system intervals. The

46

Chapter 5. Performance Optimization 0.55

P(NP)

P(N P)

P(N P)

0.55

0.014 0.01 0.045

0

20

40

NP

60

80

100

a) Complete concept distribution "sky"

0

20

40

NP

60

80

b) Uniform approximation of a)

100

0

20

40

NP

60

80

100

c) Twoclass approximation of a)

Figure 5.6: Complete concept distribution P (NP ) for sky and its approximations goal is to jointly maximize precision and recall. The exemplary queries are 10-30% sky with p = q = 0.90, 20-40% water with p = q = 0.85, 30-50% sky with p = q = 0.75 and 30-50% grass with p = q = 0.75. The main observations of the experiments are the following: • The system intervals in the third column, that resulted from the performance optimizations based on either of the two approximate distributions, are always close to the reference result based on the full distribution. That is Slow and Sup diﬀer by only 1% or 2% from the reference. Accordingly, the results of the actual retrieval are similarly close to the reference retrieval. • The performance prediction based on the approximate distributions diﬀers from the actual retrieval, especially when the detectors’ performance speciﬁers p and q are small. • The prediction of the precision is more sensitive to approximations in the concept distribution than the prediction of the recall. Partly, the diﬀerence between prediction and actual retrieval exceeds 20%. • Although the performance prediction based on the approximate distribution is not always correct, the results of the actual retrieval are close to the reference results. The reason is that the optimized system intervals are very close to the reference. The correctly estimated system intervals thus lead to a certain robustness with respect to the prediction. • Over all 120 experiments, the optimized system interval, and thus the actual retrieval, are slightly better for the two-class than for the uniform distribution. It can be concluded that the optimized system intervals are so close to the benchmark that the actual retrieval results nearly reach the reference values. This is the case even though the performance prediction based on the approximate concept distributions is worse than in the reference cases. Also, often the diﬀerence between the reference values and the retrieval based on the approximate distributions is smaller than the standard deviation of the retrieval. In the case that the true distribution is ’sparse’, the two-class

5.2. Performance Optimization in Stage I

47

User Interval

Employed distribution for Optimization

System Interval

Prediction Precision Recall

Retrieval Mean Precision Recall

sky 10-30% p=q=0.90

complete distribution uniform distribution two-class distribution

[18%,35%] [18%,34%] [18%,34%]

85.4% 85.5% 82.8%

88.9% 86.4% 86.4%

85.2% 86.9% 86.7%

88.9% 86.3% 86.6%

water 20-40% p=q=0.85

complete distribution uniform distribution two-class distribution

[29%,44%] [28%,43%] [29%,43%]

78.6% 77.8% 79.6%

80.8% 84.6% 81.7%

78.7% 76.6% 79.8%

81.2% 83.6% 79.1%

sky 10-30% p=q=0.75

complete distribution uniform distribution two-class distribution

[31%,47%] [28%,43%] [31%,46%]

62.9% 64.0% 49.4%

74.6% 81.0% 71.9%

62.9% 53.6% 64.1%

74.6% 81.3% 71.8%

grass 30-50% p=q=0.75

complete distribution uniform distribution two-class distribution

[41%,53%] [39%,51%] [40%,51%]

50.0% 62.8% 63.1%

67.1% 77.7% 73.7%

51.2% 38.4% 45.1%

67.4% 73.4% 66.9%

Table 5.1: Uniform/two-class vs. complete distribution: Joint optimization of precision and recall distribution produces better system intervals. The outcome of experiments for other user constraints, such as e.g. the maximization of only the recall, is comparable.

5.2

Performance Optimization in Stage I

The performance optimization can be extended to Stage I of the retrieval system. Since this corresponds to the concept detectors, the ﬁrst part of this section covers the training of the concept detectors and the second part the optimization depending on these detectors. As stated before, the ﬁrst stage of the retrieval system is composed of multiple detectors for various concepts. Having several detectors per concept available, during performance optimization, the one optimizing the overall system performance can be selected. Therefore, it is desirable to actually have multiple detectors per concept with varying performance. In this section, we present a method to train concept detectors that have varying performance characteristics depending on the concept and the feature set. For this task, we use AutoClass [Cheeseman et al., 1988], an unsupervised Bayesian classiﬁcation system that includes the search for the optimal number of classes.

5.2.1

Training of the Concept Detectors

The training of the concept detectors is performed oﬀ-line. For this purpose, 4000 image regions hand-labeled with sky, water, grass and buildings are used. Naturally, the image regions of any concept classes can be very diverse. For example, a sky -patch might

48

Chapter 5. Performance Optimization Concept: "grass" 100

90

90 Precision in %

Precision in %

Concept: "sky" 100

80

70

60

50 0

col64 tex64 coltex128 col64+coltex128 tex64+coltex128 20

40 60 Recall in %

(a) Various sky detectors

80

70

60

80

100

50 0

col64 tex64 coltex128 col64+coltex128 tex64+coltex128 20

40 60 Recall in %

80

100

(b) Various grass detectors

Figure 5.7: P recisiondet vs. Recalldet of various detectors comprise cloudy, rainy or sunny sky regions as well as sky regions during sunset. The image regions are represented by 43 -bin RGB-color histograms (col64), 43 -bin histograms of third-order MR-SAR texture features (tex64) [Mao and Jain, 1992] and (2 ∗ 43 )-bin histograms (coltex128) that are combined of the 43 -bin RGB-color histogram and the 43 -bin texture histograms. Depending on the feature set, AutoClass ﬁnds between 100 and 130 clusters in the data set of all 4000 image regions. In a supervised manner, it can be determined which of the concept classes are represented by which cluster. Each cluster contains multiple classes resulting in diﬀerent class probabilities for each cluster. Depending on the feature set, the highest class probability in each cluster ranges from 0.25 to 1. The availability of the class probabilities for each cluster provides us with three methods to obtain multiple classiﬁers. Firstly, in order to improve the precision of the concept detectors, only clusters with a class probability higher than a certain threshold are accepted. Obviously, this leads to a loss in recall. However, precision and recall of the concept detectors can thus be precisely controlled. Secondly, the classiﬁcation using one feature set often performs much better for one class than for another. Thus, it is advantageous to use several feature sets. Thirdly, the classiﬁcations of two feature sets can be combined by means of the cluster precision: all cases are classiﬁed twice and the vote of the cluster with the higher precision counts. The performance of various sky - and grass-detectors for diﬀerent feature sets and feature combinations is shown in Figure 5.7. Here, “col64+coltex128” denotes the combination of the col64- and the coltex128-feature set according to the class probabilities per cluster. As expected, the feature sets and combinations perform diﬀerently for different classes. For the sky -detector, the color features are not discriminant which lies in the fact, that the sky class is very diverse in color. For the grass-detector, the texture features fail. This indicates that the employed texture features catch primarily the structure on small scale and not the larger scale structure that exists in grass image regions.

5.3. Joint Two-Stage Performance Optimization

49

The combination of two classiﬁcations as described above leads to an improvement in performance. In summary, the “tex64+coltex128”-detector performs best for sky -image regions, whereas grass-image regions are detected best with the “col64+coltex128”detector.

5.2.2

Results: Stage I Performance Optimization

Using the envelopes of the curves of Figure 5.7, thirteen {P recisiondet , Recalldet } pairs that correspond to diﬀerent concept detectors (here: sky and grass) can be obtained. These are transformed in corresponding p and q values. In order to identify the optimal concept detector for a given user interval, Eq. (4.10) and Eq. (4.11) are evaluated for each of these detector {P recisiondet , Recalldet } pairs. In Figure 5.8 and in Table 5.2, the result of the Stage I performance optimization is summarized. The diagrams show the inﬂuence of the detectors on the retrieval performance of a set of grass queries. The queries are: 10-30%, 20-40%, 40-60% and 50-90% of grass. Figure 5.8 shows the predicted retrieval performance for each of the four queries and for each of the thirteen grass-detectors. The points that belong to the same query but diﬀerent detectors form an ellipsoidal curve. Here, the points in the lower left-hand corner correspond to the detector with the highest precision, whereas the points in the lower right-hand corner correspond to the detectors with a low detector recall. The circles mark the best retrieval performance for each query. The corresponding detectors’ performance is listed in Table 5.2. Notice that the best detector is diﬀerent for each query: for example, the query 10-30% grass will be executed best with the {P recisiondet = 94%, Recalldet = 72%}-grass-detector and the query 50-90% grass with the {P recisiondet = 88%, Recalldet = 83%}-grass-detector. This supports the fact, that the retrieval performance can be improved by providing multiple detectors for the same concept.

5.3

Joint Two-Stage Performance Optimization

After having discussed the performance optimization in Stage I (Section 5.2) and the performance optimization in Stage II (Section 5.1), the goal of this section is to carry out the two-stage performance optimization. There are two methods to combine the optimization stages: • Serial Combination: Depending on the optimization constraints, determine the best concept detector in Stage I as done in the middle column of Table 5.2. With the performance characteristics of that detector carry out the Stage II optimization in order to ﬁnd the optimum system interval. • Interleaved Combination: Carry out the Stage II optimization for all detectors that are available for requested concept in Stage I. Depending on the results, select the optimum system interval S and the optimum concept detector for the retrieval.

50

Chapter 5. Performance Optimization

No optimization in Stage II 100 90 80

10-30% 20-40% 40-60% 50-90%

grass grass grass grass

Precision in %

70 60 50 40 30 20 10 0 0

20

40

60 Recall in %

80

100

Figure 5.8: Retrieval Optimization in Stage I: Predicted retrieval precision and recall with various grass-detectors.

Query grass

10-30% 20-40% 40-60% 50-90%

The best concept detector in Stage I

The overall best concept detector

P recisiondet

Recalldet

P recisiondet

Recalldet

94% 94% 88% 88%

72% 72% 85% 83%

98% 94% 94% 85%

61% 72% 72% 89%

Table 5.2: Best detectors for various grass queries Figure 5.9 corresponds to Figure 5.8 after the second-stage performance optimization was carried out. The exemplary queries are the same as in Section 5.2.2. The optimization constraint is “joint optimization of precision and recall”. The retrieval performance in Figure 5.9 has improved substantially compared to Figure 5.8. The circles mark the best overall retrieval performance for each query. These correspond to the retrieval performance for the interleaved combination of the optimization stages (see Table 5.3). The concept detectors corresponding to these best retrieval performances are listed in Table 5.2 right column. These results correspond to an interleaved combination of the optimization stages. For the serial combination of the optimization stages, the detectors of Table 5.2 middle column are used, and the Stage II optimization is carried out. The ﬁrst observation is that for most queries the detector in Table 5.2 middle column is not the same detector as in the right column. Thus, the performance of the overall retrieval will also diﬀer. The overall retrieval performance for serial and for interleaved combination is

5.4. Performance Optimization by Query Mapping

51

With optimization in Stage II 100 90 80

10-30% 20-40% 40-60% 50-90%

grass grass grass grass

Precision in %

70 60 50 40 30 20 10 0 0

20

40

60 Recall in %

80

100

Figure 5.9: Joint Retrieval optimization in Stage I and II: Predicted retrieval precision and recall with various grass-detectors. Query grass

10-30% 20-40% 40-60% 50-90%

Serial Combination

Interleaved Combination

Precision

Recall

Precision

Recall

86% 77% 70% 85%

90% 79% 85% 87%

88% 77% 71% 89%

91% 79% 85% 89%

Table 5.3: Comparison of serial vs. interleaved combination of the optimization stages analyzed in Table 5.3. As anticipated, the table shows that in all cases the interleaved combination of the optimization stages results in a better retrieval performance of 1-4% increase in precision and 1-3% increase in recall. Obviously, the interleaved combination is computationally more demanding than the serial combination because in the interleaved combination, the optimization algorithm in Stage II has to be evaluated for each detector present for a particular concept. In the case that the application is time critical, it might thus be advantageous to decide for the serial combination despite the lower performance gain.

5.4

Performance Optimization by Query Mapping

Up to now, the approach for performance optimization was to predict the system’s performance depending on the user interval, the detectors’ parameters and the concept distribution. Via the outcome of the prediction, a system interval S = [Slow %, Sup %] was

52

Chapter 5. Performance Optimization

100 90

Results: Query Mapping

80

Precision in %

70

Results: Full Optimization

60 50 40 30

User Queries

20 10 0 0

10-30% sky, p=q=0.90 20-40% water, p=q=0.85 20

40 60 Recall in %

80

100

Figure 5.10: Optimization by query mapping: comparison to full optimization obtained that, in the end, optimized the system’s performance. In other words, the user interval was mapped to an internal system interval in order to compensate for the wrong decisions of the concept detectors depending on the concept distribution. However, the concept distribution is usually not fully available. Therefore, it needs to be estimated or approximated as shown in Section 5.1.3. Another approach is to compensate for the expected error of the concept detectors in probabilistic sense. As will be shown in the remainder of this section, this still results in a large performance gain, although implicitly a uniform concept distribution is assumed. Generally, the decision of an concept detector on a particular image region is only correct with the probabilities p and q. For that reason, the decision on the complete image is also inﬂuenced by those two parameters. The inﬂuence of p on the decision per image is larger when the user is looking for images covered with the concept by more than 50% and vice versa. In Section 4.2, the behavior of an concept detector was modeled by a binomial distribution. The expected value of a random variable that is binomially distributed with the parameters n and p is E(Xn ) = pn. Consequently, the expected values for retrieving truly positive image regions Eq. (4.1) and for retrieving falsely positive image regions Eq. (4.2) are:

E{Xtrue,retrieved } = pNP E{Xf alse,retrieved } = (1 − q)(N − NP )

(5.8) (5.9)

5.5. Summary of Part I

53

and the expected amount of positive image regions (true positives plus false positives) that are retrieved out of NP indeed positive ones is: E{Xretrieved |Np } = pNP + (1 − q)(N − NP )

(5.10)

Using Eq. (5.10), we obtain a valid mapping from a user interval U = [Ulow , Uup ] to a system interval S = [Slow , Sup ]. If we assume, that there are NP = Ulow concepts in an image, Eq. (5.10) returns the number of image regions that are expected to be retrieved if the detector performs with the parameters p and q. This expected value can be used as new lower limit for the system interval Slow because it compensates for the errors of the concept detector. The new Slow takes into account that, on average and independent of the concept distribution, the detectors make wrong decisions. The reasoning for Sup is similar. The complete mapping equations are the following: Slow = pUlow + (1 − q)(N − Ulow ) Sup = pUup + (1 − q)(N − Uup )

(5.11) (5.12)

Implicitly, Eq. (5.11) and (Eq. 5.12) are based on the assumption that the concepts are uniformly distributed. Nonetheless, even with this strong assumption, the performance gain is immense (see Figure 5.10). For the exemplary query 10-30% sky and p = q = 0.90, the mapped system interval is S = [18%, 34%] and on average the precision is increased from 41% to 87% and the recall from 77% to 87%. The “optimal” system interval is S = [18%, 35%] and the one obtained with the uniform distribution is S = [18%, 34%]. This shows that the system intervals are very similar. It also demonstrates the above mentioned assumption. In Figure 5.10, the retrieval results of the optimization by query mapping and by using the full concept distribution as a reference are plotted. The query [20%,40%] water and p = q = 0.85 leads to a mapped system interval S = [29%, 43%] and to an average increase in precision from 27% to 80%. With the complete distribution the optimized query is S = [29%, 44%] and with the uniform distribution S = [28%, 43%]. The recall decreases in this case on average from 86% to 78%. This example demonstrates the limitations of the query-mapping approach. With the mapping of the user interval to a system interval, precision and recall can be maximized only jointly. This can also lead to a decrease of one of the values. With the algorithms that were presented in Section 5.1.1 and Section 5.1.3, precision and recall can also be optimized separately which is in many situations more desirable.

5.5

Summary of Part I

In the last two chapters, we introduced several methods for the performance prediction and the performance optimization of our retrieval system. With the closed-form expressions for precision and recall, the performance of the system can be predicted before the

54

Chapter 5. Performance Optimization

Maximization of Precision & Recall

Maximization of Recall

Estimated System Interval

Prediction

Estimated System Interval

Prediction

Complete distribution

++

++

++

++

Two-class distribution

+

Precision: Recall: +

+

Precision: Recall: +

Uniform distribution

+

Precision: Recall: +

∅

Precision: Recall: +

Query Mapping

+

N.A.

N.A.

N.A.

Table 5.4: Comparison of the methods for performance prediction and optimization retrieval takes place, and the internal system interval can be optimized according to a set of optimization constraints. The four prediction and optimization methods are listed and compared in Table 5.4. The best results for both performance prediction and optimization are achieved if the complete concept distribution is available. In that case, the performance can be optimized for maximum precision as well as for maximum recall and for joint maximization of precision and recall. The predicted performance is always close to the actual one and the determined system interval is indeed optimal. Note that the system interval is the performance measure for the quality of the optimization. Since the complete concept distribution may not be available, the two-class and the uniform distribution have been evaluated for the performance prediction and optimization. Here, again the optimization is possible for all goals, that is for maximum precision, maximum recall or joint maximization of precision and recall. In nearly all cases the optimized system intervals are so close to the benchmark that the actual retrieval results are similar to the reference values. The performance prediction however is not as good as before, because, in particular, the precision prediction degrades. The prediction is slightly more reliable for the two-class distribution than for the uniform distribution since more information is available. In the fourth method, the system interval is obtained through a mapping that depends solely on the detectors’ performance values. For that reason, a performance prediction is not possible, hindering for example the optimization for maximum recall. Even though absolutely no information about the concept distribution is used, the optimized system interval for joint optimization of precision and recall is as good as with the uniform distribution. Being able to predict the retrieval performance renders opens up the possibility to combine our system with other retrieval system. In particular, the vocabulary-based retrieval is suited as a pre-ﬁltering system to reduce the retrieval search space. For these

5.5. Summary of Part I

55

kind of applications, a high recall is desirable. The proposed performance optimization method in combination with the “maximum recall”-optimization constraint ensures high recall even for a required minimum precision.

Part II Categorization and Retrieval of Natural Scenes

57

Concept Detectors

The second part of this thesis covers the categorization, retrieval and ranking of natural scenes represented by semantic modeling. The current chapter discusses the concept detectors for the semantic classiﬁcation of image regions. Chapter 7 introduces two approaches to scene categorization and compares thoroughly the performance of the proposed image and category representations. Chapters 8 and 9 analyze the typicality ranking of natural scenes obtained by humans and obtained by the proposed semantic modeling. The current chapter proceeds as follows. The next section deﬁnes the scope of the concept detectors. Section 6.2 and Section 6.3 summarize the employed low-level features and the classiﬁcation methods, respectively. The concept detectors are evaluated experimentally in Section 6.4. Section 6.5 discusses and summarizes the ﬁndings of this chapter.

Figure 6.1: Image segmentation 59

60

Chapter 6. Concept Detectors

sky

water

grass

trunks

foliage

field

rocks

flowers

sand

Figure 6.2: Semantic concept classes

6.1

Local Semantic Classification

The purpose of the concept detectors is the semantic classiﬁcation of local image regions. The image regions are extracted from the images on a regular grid of 10x10 regions with size 48x72 or 72x48 pixels (see Figure 6.1). The images of the natural700 -database, that is 70’000 local image regions (700 images * 100 regions), have been fully annotated with the nine semantic concepts sky, water, grass, trunks, foliage, field, rocks, flowers and sand. Note that using these nine semantic concepts more than 99.5% of the 70’000 image regions could be annotated consistently. On average only half of an image region per image could not been assigned to one of the nine classes thus validating indirectly the choice of the local semantic concepts. The visual diversity of the resulting concept classes is illustrated in Figure 6.2. sky is clearly not always just blue, but also overcast or partly cloudy, or foliage includes leaves at many scales and in many seasonal color

6.1. Local Semantic Classification Concept Class sky water grass trunks foliage fields rocks flowers sand OVERALL

61 # image 25.2% 12.0% 5.8% 2.7% 22.5% 6.9% 18.6% 3.4% 2.9% 100%

regions 15’296 7’293 3’503 1’625 13’709 4’188 11’310 2’049 1’745 60’718

Table 6.1: Sizes of each concept class

ranges. The ﬁgure illustrates that, without any context, some of the displayed image regions are very hard to classify even for a human. In order to be robust to “unclean” image regions that are due to the ﬁxed grid segmentation, e.g. water -regions with a tiny bit of sand, regions containing up to 25% of a diﬀerent concept were accepted as training and testing data. Nevertheless, image regions that contain two semantic concepts in about equal amounts, such as for example region (4, 4) or region (5, 4) in Figure 6.1 that contain both sky and rocks, have been doubly annotated with both concepts. These regions are not used for training or test of the concept detectors. As a result, 60’718 singly annotated image regions are available for training and test of the concept detectors (see Table 6.1). However, for the categorization and retrieval experiments in the following chapters, all unseen image regions are classiﬁed with the trained concept detectors. The expectation is that doubly annotated image regions are assigned to either of the two classes which is equally good. Since not all concepts are present in all images, the class sizes vary from 1’625 up to 15’296 image regions thus posing quite a challenge for the training of the concept detectors. Sky appears in nearly every image whereas e.g. sand is only present in certain coasts or plains scenes. Automatic segmentation and classiﬁcation of the resulting image segments would be an alternative to classifying image regions of ﬁxed size. Unfortunately, state of the art segmentation algorithms such as normalized cuts [Shi and Malik, 1997] or mean shift segmentation [Comaniciu and Meer, 1999] still tend to uncontrollably over- or undersegment the scenes making a succeeding classiﬁcation very diﬃcult. The use of ﬁxed size image regions thus reduces the amount of possible misclassiﬁcations because the regions are more likely to contain only one concept. The pixelwise segmentation and classiﬁcation of Konishi and Yuille is a third approach to a semantic classiﬁcation of image regions [Konishi and Yuille, 2000]. In their work they use the empirical joint probability distribution of the ﬁlter responses at multiple scales using a set of color, texture, and edge ﬁlters. The pixelwise information allows to

62

Chapter 6. Concept Detectors

diﬀerentiate surprisingly well between the six classes edge, vegetation, air, road, building, and other. However, the in-class variation is by far lower than in Figure 6.2. Especially the well-classiﬁed classes air, road, and vegetation show visually hardly any variation. On the other hand, classes such as grass and foliage have been subsumed in vegetation. Since the subsequent categorization relies on such additional semantic detail, a pixelwise approach would most probably fail in our setting.

6.2

Features

The development of the concept detectors was not the main focus of this thesis. The goal was rather to evaluate the strength of an image representation based on semantic local concepts. Therefore, several standard low-level color and texture features have been tested and evaluated. The feature parameters such as the number of histogram bins have been determined in extensive pre-tests and are not discussed further. In the case, where two of more histograms are concatenated, each histogram has ﬁrst been normalized such that the feature types have about equal weights. linHSIhist84 84-bin linear HSI color histogram (hue: 36 bins, saturation: 32 bins, intensity: 16 bins). edh72 72-bin edge direction histogram. glcm32 24 features of the gray-level co-occurrence matrix (32 gray levels): contrast, energy, entropy, homogeneity, inverse diﬀerence moment and correlation for the −→ −→ −→ −−−→ displacements 1, 0, 1, 1, 0, 1 and −1, 1 [Jain et al., 1995]. linRGBhist96 96-bin linear RGB color histogram (each channel: 32 bins). linHSIhist84_edh72 Concatenation of linHSIhist84 and edh72. linRGBhist96_edh72 Concatenation of linRGBhist96 and edh72. linHSIhist84_glcm32 Concatenation of linHSIhist84 and glcm32. linHSIhist84_glcm32_edh72 Concatenation of linHSIhist84, glcm32 and edh72.

6.3

Classification Methods

We compared two methods for the classiﬁcation of the local semantic concepts, k-Nearest Neighbor (kNN) classiﬁers and Support Vector Machines (SVM). Although the amount of training data is large, the classiﬁers are challenged with the inequality in the class sizes and the visual similarity of image regions that belong to diﬀerent classes.

6.3. Classification Methods

6.3.1

63

k-Nearest Neighbor

The Nearest Neighbor method is a non-parametric classiﬁcation method with the goal to maximize the posterior probability [Kroschel, 1996] P (Cj |x) =

fX|Cj (x|Cj )Pj fX (x)

(6.1)

given N feature vectors xi and M classes Cj . The probabilities are estimated by deﬁning (V ) a volume element V and counting the data points Nj belonging to each class Mj in V . The probability estimates are thus Nj Pˆ (Cj ) = N

(6.2)

for the class prior and for the density functions fˆX (x)|xV fˆX|Cj (x|Cj )|xV

N (V ) NV (V ) Nj = Nj V =

(6.3) (6.4)

leading to the following estimate of the posterior: PˆN N (Cj |x) =

(V )

Nj N (V )

(6.5)

k-Nearest Neighbor classiﬁcation is a version of the Nearest Neighbor classiﬁcation in which the size of the volume element V is selected such that it contains exactly k data points. Thus, the estimated posterior probability becomes kj PˆkN N (Cj |x) = k

(6.6)

The data point x is assigned to the class Cj that maximizes the posterior in Equation 6.6. In the context of this thesis, three distance measures have been compared for the kNN classiﬁcation: Euclidean distance, Mahalanobis distance and histogram intersection [Swain and Ballard, 1991]. Location Prior Due to the extraction on a regular 10x10 grid, the location of the local image regions has a clear spatial structure. Thus, it might be rewarding to use this information for the classiﬁcation. Sky is usually found on top of an image, whereas grass is more likely to be on the bottom and foliage to be on the sides. This behavior is displayed in Figure 6.3. For each semantic concept, the plot shows the probability for the concept to appear at location z = (lx , ly ). Note that the top of the image appears at the right in the plots.

64

Chapter 6. Concept Detectors

Figure 6.3: Location priors per semantic concept

The color scale reaches from dark blue (low probability) via light blue and orange to dark red (high probability). When including the location prior for the K = 100 locations z = (lx , ly ) in the kNN classiﬁcation, Eq. 6.1 becomes dependent on z: P (Cj |x, z) =

fX|Cj (x|Cj , z)P (Cj |z) fX|z (x|z)

(6.7)

where Nj (z) Pˆ (Cj |z) = N(z) N(z)(V ) fˆX|z (x|z)|xV = N(z)V Nj (z)(V ) fˆX|Cj ,z (x|Cj , z)|xV = Nj (z)V

(6.8) (6.9) (6.10)

6.4. Experiments

65

This leads to changed expressions for the posterior probabilities in Equations 6.5 and 6.6 Nj (z)V ) N(z)( V ) kj (z) PˆkN N,locprior (Cj |x, z) = k(z) PˆN N,locprior (Cj |x, z) =

(6.11) (6.12)

Thus, the decision is only based on the data points at a certain location z.

6.3.2

Support Vector Machines

Support Vector Machines solve the following optimization problem [Vapnik, 1995] given a set of N feature-label pairs (xi , yi), xi Rn , yi{1, −1}: 1 ξi min w T w + C w,b,ξ 2 i=1 l

(6.13)

with the constraints yi(w T Φ(xi ) + b) ≥ 1 − ξi ξi ≥ 0

(6.14)

The training data points xi are mapped into a higher dimensional space by Φ, deﬁned by the kernel function K(xi , xj ) ≡ Φ(xi )T Φ(xj ). We selected the radial basis function (RBF) kernel for the experiments in this thesis: K(xi , xj ) = exp(−γxi − xj 2 ) γ > 0

(6.15)

The SVM ﬁnds a linear separating hyperplane with a maximal margin in the higher dimensional space. C > 0 is a penalty parameter. Thus, two parameters, γ and C, have to be determined experimentally prior to the classiﬁcation. For the experiments, the LIBSVM package [Chang and Lin, 2001] was employed. The package oﬀers an eﬃcient multi-class support using internally the one-against-one approach [Hsu and Lin, 2002]. That means, M = 9 classes result in M (M2 −1) = 36 single classiﬁers. A new data point is tested by each of the 36 classiﬁers with the "winning" class obtaining a vote. The data point is allocated to the class that has the highest number of votes.

6.4

Experiments

All experiments have been performed with 10-fold cross-validation on image level. That is, regions from the same image are either in the test or in the training set but never in diﬀerent sets. This is important since image regions from the same semantic concept tend to be more similar to other (for example neighboring) regions in the same image than to regions in other images.

66

Chapter 6. Concept Detectors

Mahalanobis

SSD

Classification accuracy in %

75

75

70

70

65

65

65

60

60

60

55

55

55

70

50 5

HSI84 edh72 glcm32 RGB96

15

29

49

50 5

15

k

29

49

50 5

Mahalanobis

SSD

70

70

65

65

65

60

60

60

55

55

55

50 5

15

29 k

49

50 5

15

49

Intersection 75

HSI84/edh72 RGB96/edh72 HSI84/glcm32 HSI84/glcm32/edh72

29 k

75

70

15

k

75 Classification accuracy in %

Intersection

75

29

49

k

50 5

15

29

49

k

Figure 6.4: Comparison of the kNN classiﬁcation accuracies (solid: without location prior, dashed: with location prior)

6.4.1

Results of the kNN classification

The classiﬁcation accuracies of the kNN classiﬁcation using the features introduced in Section 6.2 are displayed in Figure 6.4. The ﬁrst row of plots displays the results of the regular color or texture features and the second row the results of the combined color plus texture features. The parameter on the x-axis is the number of nearest neighbors k. In each plot the solid line corresponds to the classiﬁcation accuracies without the location prior and the dashed line to the classiﬁcation accuracy with the location prior. From left to right three diﬀerent distance measures are employed, which are the Mahalanobis distance, the Euclidean or sum-squared distance (SSD) and the histogram intersection. The results show that the classiﬁcation accuracy increases with k and that the combined color plus texture features perform better than the lone features. It can also be seen that in most cases, the histogram intersection outperforms the SSD and that the SSD outperforms the Mahalanobis distance. The location prior causes in all cases a performance gain: The classiﬁcation accuracy increases from 1.2% up to 4.3%. Table 6.2 displays the best classiﬁcation accuracies for each feature with and with-

6.4. Experiments Feature linHSIhist84 linRGBhist96 edh72 glcm32 linHSIhist84_edh72 linRGBhist84_glcm32 linHSIhist84_glcm32 linHSIhist84_glcm32_edh72

67 No location prior 61.0% 59.4% 54.4% 53.2% 67.1% 64.9% 63.9% 66.5%

Location prior 64.0% 62.3% 56.1% 55.4% 68.3% 66.2% 65.8% 67.6%

Table 6.2: Maximum kNN classiﬁcation accuracies with and without location prior (distance measure: histogram intersection) out location prior. The maximum classiﬁcation rate without employing the location prior is 67.1%. It is reached using the linHSIhist84_edh72-feature with k = 29 and the histogram intersection as distance measure. The corresponding confusion matrix is displayed in Table 6.4. The highest classiﬁcation rate including the location prior is 68.3%. This result is obtained by using the linHSIhist84_edh72-feature, k = 49, histogram intersection and the location prior. Table 6.5 shows the confusion matrix for this result. The comparison of Table 6.4 and 6.5 shows that the location prior leads to an increase in classiﬁcation accuracy of 1.2%. Looking more precisely, the precision rises in all nine classes, but only the recall of the larger classes (besides sky ) proﬁts from the location prior. The classiﬁcation rate of the smallest classes trunks, flowers, and sand decreases or at most remains constant. One reason might be that the amount of data for these classes is not suﬃcient to reliably estimate the location probability Pˆ (z|Cj ). We also experimented with larger neighborhoods such as the statistics of the 4-neighbors or the 8-neighbors. But the use of these larger neighborhoods lead to a decrease in classiﬁcation accuracy especially for the smaller classes. The reason is that this method rewards concepts in large contiguous regions and penalizes scattered concepts. Since often the small classes build clusters of at maximum four image regions and only the already well classiﬁed concepts such as sky or rocks appear in contiguous regions, this approach did not result in any accuracy increase.

6.4.2

Results of the SVM classification

The most promising features of the previous section were selected for the training of multi-class SVM classiﬁers. For each feature, an extended parameter search was performed prior to the classiﬁcation: the cost parameter C was tested on a logarithmic scale from 21 to 28 , and the RBF parameter γ was tested for values between 21 and 2−5 . The resulting classiﬁcation accuracies for each feature and the corresponding parameter set are summarized in Table 6.3. The table shows that for all features the SVM classiﬁcation performs better than the kNN classiﬁcation even when employing the clas-

68

Chapter 6. Concept Detectors Feature linHSIhist84 edh72 glcm32 linHSIhist84_edh72 linHSIhist84_glcm32_edh72

Classification accuracy 62.3% 56.2% 57.1% 69.4% 71.7%

C 8 2 128 8 128

γ 0.5 2 0.5 0.25 0.03125

Table 6.3: SVM classiﬁcation accuracies and parameters siﬁcation prior. The highest SVM classiﬁcation accuracy with 71.7% is obtained with the linHSIhist84_glcm32_edh72-feature. Table 6.6 shows the corresponding confusion table. The comparison of Table 6.5 and 6.6 shows that the SVM classiﬁcation besides a higher classiﬁcation accuracy leads especially to a signiﬁcantly higher precision in all but one case (trunks). In addition, the recall is higher in all but two cases (water and foliage).

6.5

Discussion

The classiﬁcation experiments result in a best kNN classiﬁer (with location prior) with 68.3% classiﬁcation rate and a best SVM classiﬁer with 71.1% classiﬁcation rate. The kNN classiﬁer is based on the linHSIhist84_edh72-feature and will be referred to in the following as F1 whereas the SVM classiﬁer is based on the linHSIhist84_glcm32_edh72feature and will be referred to as F2. For comparison, both classiﬁers will be employed in the following although the experiments have clearly shown that the SVM classiﬁers outperform the kNN classiﬁers even when employing the location prior. For the subsequent categorization it is rewarding to analyze the misclassiﬁcations on concept level in more detail. Especially the fact that we deal with “semantic” data demands that we have a closer look at the classiﬁcation results. The following observations are valid for both the kNN and the SVM classiﬁer. The most apparent behavior of the classiﬁers is the fact that all confusion matrices show a strong correlation between the class size and the classiﬁcation result. Sky, foliage, and rocks are the largest classes and are classiﬁed with the highest class-wise accuracies. Sand, trunks, and flowers are the smallest classes and have also the smallest classiﬁcation accuracies. In addition to the pure dependency on the class size, the classiﬁcation confusions show that the members of smaller classes are often confused with the semantically most similar larger class. That is, grass and flowers are almost exclusively confused with foliage but not vice versa. Similarly, field and sand are frequently confused with rocks, but also not vice versa. The fact that it is more diﬃcult to classify small concept classes is also the main argument against using more semantic concepts. Tests with additional semantic concepts such as snow for snowy mountains, or mountains for mountains in the far background

6.5. Discussion

True class

Overall 67.1% sky water grass trunks foliage field rocks flowers sand

Precision

69

Classifications in % trunks foliage field

sky

water

grass

91.9 13.5 0.2 0.4 0.6 1.0 3.0 0.8 12.0

5.1 63.9 6.3 0.8 1.3 8.2 5.1 0.6 16.7

0.0 1.7 29.3 3.0 2.1 5.5 0.6 1.8 4.7

0.1 0.0 1.3 27.1 0.9 1.6 1.3 0.2 0.5

0.4 6.9 50.8 44.1 85.4 20.4 24.9 56.2 2.8

88.7

65.4

53.7

51.8

59.2

rocks

flowers

sand

0.3 3.1 6.9 5.1 1.2 32.9 6.2 2.8 14.7

1.8 10.2 4.2 18.7 7.3 28.1 58.1 4.4 35.0

0.0 0.0 0.9 0.9 1.3 1.7 0.2 32.8 0.2

0.4 0.7 0.3 0.0 0.0 0.7 0.6 0.4 13.6

43.4

59.2

67.7

49.5

Table 6.4: Confusion matrix of the best kNN concept classiﬁcation without location prior (linHSIhist84_edh72, k = 29, histogram intersection) that can not be assigned to either rocks or foliage did not result in higher classiﬁcation rates. On the contrary, subsequent categorizations based on ten or eleven instead of nine semantic concepts achieved lower accuracy. On the other hand, a reduction of the number of semantic concepts by not using small classes such as sand or trunks also resulted in a degraded categorization. These small, but not too small classes provide the necessary information to discriminate between semantically similar categories. Obviously, the obtainable classiﬁcation accuracies depend strongly on the consistency and accuracy of the manual annotations. Although best care and attention was directed to that problem, annotation ambiguities are hardly avoidable. For example, the annotation of rocks and foliage is quite challenging. Imagine an image with rocky and forested hills in far distance: is it rather rocks or foliage? For that reason, it is not surprising that rocks or foliage are confused in both directions. Another major confusion appears between trunks and foliage. This results mainly from the fact that each trunks region contains also a fair amount of leaves whereas most foliage regions also include some branches or parts of trunks. In order to improve the classiﬁcation rate, also a semantic hierarchical classiﬁcation approach was tested. The idea was to subsume all those classes that are often confused and to train SVM classiﬁers for only three or four classes resulting in higher accuracies. In a ﬁrst step, the image regions were classiﬁed into the classes sky, water, plants and ground. In a second step, the plants-regions were further split into foliage, trunks, flowers and grass and the ground -regions into sand, fields and rocks. This two-level hierarchy did not result in a substantial classiﬁcation improvement and was for that reason not further tested. Another idea often proposed for boosting the classiﬁcation accuracy is the use of diﬀerent or “better” low-level features. Putting eﬀort into features, feature selection, and

70

Chapter 6. Concept Detectors

True class

Overall 68.3% sky water grass trunks foliage field rocks flowers sand

Precision

Classifications in % trunks foliage field

sky

water

grass

91.4 8.9 0.9 0.6 0.6 2.0 3.4 0.6 6.4

5.5 68.2 6.3 0.6 0.9 7.9 4.1 0.7 19.9

0.0 2.2 33.4 0.4 1.9 4.5 0.6 2.0 5.8

0.0 0.0 0.5 26.2 0.8 1.4 1.1 0.2 0.3

0.4 6.1 45.7 48.4 86.2 19.1 24.6 56.4 2.1

90.7

67.2

58.3

56.7

60.2

rocks

flowers

sand

0.2 3.9 8.1 5.7 1.0 35.1 5.5 2.5 15.5

2.3 9.9 4.3 17.4 7.4 27.9 59.9 5.6 35.9

0.0 0.0 0.8 0.7 1.2 1.5 0.3 31.7 0.3

0.2 0.8 0.2 0.0 0.0 0.6 0.5 0.4 13.8

45.0

59.5

68.0

54.8

Table 6.5: Confusion matrix of the best kNN concept classiﬁcation with location prior (linHSIhist84_edh72, k = 49, histogram intersection)

True class

Overall 71.7% sky water grass trunks foliage field rocks flowers sand

Precision

Classifications in % trunks foliage field

sky

water

grass

95.3 11.1 0.3 0.6 0.4 0.6 3.1 0.7 10.3

2.6 66.1 6.1 0.6 1.5 6.9 5.1 1.2 16.1

0.0 2.6 43.1 0.6 2.4 6.2 0.3 2.7 1.8

0.0 0.0 0.7 27.5 1.5 1.3 0.9 1.5 0.3

0.2 4.7 37.5 38.8 81.1 17.0 13.6 43.9 0.9

90.9

70.7

62.3

51.6

67.0

rocks

flowers

sand

0.2 2.7 4.9 3.6 1.0 37.7 4.2 4.6 9.9

1.3 11.4 5.7 27.8 10.9 27.6 71.5 2.7 30.6

0.0 0.1 1.2 0.4 1.1 0.3 0.4 42.3 0.0

0.5 1.6 0.5 0.2 0.0 2.3 0.9 0.5 30.2

54.2

62.2

76.5

55.7

Table 6.6: Confusion matrix of the best SVM concept classiﬁcation (linHSIhist84_glcm32_edh72, C = 128, γ = 0.03125) feature combination methods would clearly improve the classiﬁcation, and also open the way to introducing more semantic concepts. On the other hand, a better classiﬁcation accuracy would not change the main conclusions of the categorization and ranking tasks, and was thus not focus of our research. One could also try to model the confusions between the classes explicitly: How likely is it to have detected a grass region if the neighboring region is covered with foliage? Through the location prior, solely the location of a concept but not its interrelation with

6.5. Discussion

71

other concepts was modeled. Figure 6.3 shows that besides sky no concept possesses a location distribution that is very discriminant. The location distributions of grass, field and sand are similar, as well as those of trunks and foliage. On the other hand, the modeling of the concept interrelations would probably increase the classiﬁcation rate in some cases. In other cases, the smaller classes might suﬀer in favor of the larger classes as observed with the location prior. In addition, the amount of data needed for modeling the concept interrelations rises exponentially with the number of concepts. An unsupervised clustering of the image regions would lead to an entirely diﬀerent approach. Here, clustering methods would be used to automatically detect visually similar concepts. The main disadvantage of such an approach is the absence of semantically named image regions and the guaranty that the clusters indeed correspond to semantic concepts. We consider both points as being indispensable. Nevertheless, we did some preliminary experiments with clustering the image regions given the class for the classes grass, water, foliage, rocks, and mountains. The results were interesting because the resulting clusters had a semantic meaning such as quiet water, white water, waves, etc. for water, various scales for foliage, or the degree of ruggedness for rocks. In general, a non-supervised approach is an option, but would lead to entirely diﬀerent concepts with no direct semantic interpretation due to the semantic diversity of the employed concept classes.

Scene Categorization and Retrieval

A scene category is deﬁned as a number of scenes that are considered semantically similar. Scene categorization thus refers to the task of grouping images or scenes into a set of given categories. In this chapter, the goal is to automatically sort images into or to retrieve images from one of the six categories coasts, rivers/lakes, forests, plains, mountains or sky/clouds (see also Figures 3.4 and 3.5) based on the semantic modeling introduced in Chapter 3. Two inherently diﬀerent categorization and retrieval approaches, a Prototype and a Support Vector Machine (SVM) approach were tested. In addition, the semantic modeling was compared to direct feature extraction from the images. Figure 7.1 shows schematically the processing of image information. Besides the two processing path with and without semantic modeling, also the categorization and retrieval accuracy of the semantic modeling with manually annotated and with automatically classiﬁed image regions was compared. The reason for that additional comparison is the fact that the results based on annotated image regions serve as benchmark for the categorization and retrieval with semantic modeling. Results based on annotated image regions are the best results that can be reached with the proposed semantic modeling. The representative approach based on category prototypes is described in Section 7.1.1 and the discriminative approach based on SVMs in Section 7.1.2. The extensive categorization experiments using annotated and classiﬁed image regions, and with and without the semantic modeling step are described in Section 7.2. By introducing an acceptance threshold, the categorization task can be reformulated as retrieval task. Section 7.3 deals with the category retrieval experiments. Finally, in Section 7.4, the ﬁndings of this chapter are summarized and discussed in detail. The basis to all categorization and retrieval experiments is the semantic model73

74

Chapter 7. Scene Categorization and Retrieval Scene Retrieval Scene Categorization

Database Images

Prototype Approach global Feature Vector*

or

SVM Approach

10x10 grid

Concept Occurrence Vector

Category Representation

Concepts sky water grass

Region Annotation (semantic concepts)

annotated Image Regions

trunks foliage field

Feature Vector* per Image Region

classified Concept Classification Image Regions

rocks flowers

Semantic Modeling Step

sand

*84-bin HSI color histogram + 72-bin edge-direction histogram (+ 24 GLCM features)

Figure 7.1: Overview scene categorization and retrieval ing introduced in Section 3.1. It provides the compact, semantic image representation through the concept-occurrence vectors. In this chapter, the eﬃciency and capability of this image representation is tested.

7.1 7.1.1

Categorization Approaches Representative Approach: Category Prototypes

Category prototypes have been discussed in detail in the psychophysics community [Murphy, 2002]. A category prototype is an example which is most typical for the respective category, even though the prototype does not necessarily have to be an existing category member. The prototype theory claims that humans represent categories by prototypes and judge the category membership of a new item by calculating the similarity to that prototype. Rosch and Mervis [Rosch and Mervis, 1975] propose that a category prototype is not a single best example for the category but rather a summary representation. This summary representation is a list of weighted attributes through which

7.1. Categorization Approaches

75

Prototypes of coasts (solid) and forests (dashed)

Occurrence in %

100 80 60 40 20 0 sky

water

grass

trunks foliage fields rocks flowers sand

Prototypes of rivers/lakes (solid) and sky/clouds (dashed)

Occurrence in %

100 80 60 40 20 0 sky

water

grass

trunks foliage fields rocks flowers sand

Prototypes of plains (solid) and mountains (dashed)

Occurrence in %

100 80 60 40 20 0 sky

water

grass

trunks foliage fields rocks flowers sand

Figure 7.2: Prototypes and standard deviations of the six scene categories the category membership can be determined. Thus, important attributes that might determine the category membership by themselves have a high weight. But also having many less important attributes with lower weights renders an item a category member. The image representation through concept-occurrence vectors provides a representation that is very close to the above mentioned attribute list. Each image is described by the frequency of occurrence of an semantic concept such as sky, rocks etc. It is thus straightforward to deﬁne the category prototype in our case as the average COV over all members of a category. Nc 1 c p = COV (j) (7.1) Nc j=1 where c refers to one of the six categories and Nc to the number of images in that category. Figure 7.2 displays those category prototypes and the standard deviations for each category using one image area (for a explanation of the image areas refer to Section 3.1). This ﬁgure reveals which semantic concepts are especially discriminant for which category. For example, forests are characterized through a large amount of foliage

76

Chapter 7. Scene Categorization and Retrieval Based on Annotated Image Regions

Without semantic modeling step

100

100 SVM Prot

90

80 Cateogorization Rate in %

Cateogorization Rate in %

80 70 60 50 40 30

70 60 50 40 30

20

20

10

10

0

SVM F1 SVM F2 Prot F1 Prot F2

90

1

2 3 5 Number of image areas

10

0

1

3 5 Number of image areas

10

Figure 7.3: Categorization rates vs. Image Areas - Based on annotated image regions (left)and without semantic modeling step (right). The y-axis shows the categorization rate and the x-axis the number of employed image areas: 1 ↔ global image, 2 ↔ top/bottom, 3 ↔ top/middle/bottom, 5 ↔ top/upper middle/middle/lower middle/bottom and 10 ↔ ten equally sized rows. and trunks. In contrast, mountains can be diﬀerentiated when a large amount of rocks is detected. The standard deviations of the semantic concepts show the relative importance of the various concepts. For example, there might be images with only few rocks that still belong to the mountains category. The attributes of the prototype hold the information which amount of a certain concept is typically present in an image of a particular scene category. For example, a rivers/lakes-image usually does not contain any sand. Therefore, the sand -attribute of the rivers/lakes-prototype is close to zero. In our system, the distance of an image to the prototypical representation is determined by computing the sum-squared distance (SSD) between the COV of the image and the prototype. This corresponds to an unweighted attribute list. In the next chapter, when addressing the typicality of image, we also discuss the introduction of attribute weights. Note that the length of the COVs and the prototypes, respectively, depend on the number of used image areas, and thus inﬂuence the absolute value of the distance. An image is assigned to the category it has the smallest distance to.

7.1.2

Discriminative Approach: Multi-Class SVM

A discriminative and thus very diﬀerent approach to scene categorization is the use of Support Vector Machines (SVMs). SVMs have widely been used in recent years and have also in related research proved to be capable tools for classiﬁcation and categorization [Joachims, 2002, Wang and Zhang, 2001]. For the same reasons as in Chapter 6, that is the eﬃcient multi-class implementa-

7.2. Scene Categorization

77

Learned on annotated data

Learned on classified data

100

100 SVM F1 SVM F2 Prot F1 Prot F2

90

80 Cateogorization Rate in %

Cateogorization Rate in %

80 70 60 50 40 30

70 60 50 40 30

20

20

10

10

0

1

2 3 5 Number of image areas

10

SVM F1 SVM F2 Prot F1 Prot F2

90

0

1

2 3 5 Number of image areas

10

Figure 7.4: Categorization rates vs. Image Areas - Based on classiﬁed image regions. Left plot: Learned on annotated data. Right plot: Learned on classiﬁed data. The y-axis shows the categorization rate and the x-axis the number of employed image areas: 1 ↔ global image, 2 ↔ top/bottom, 3 ↔ top/middle/bottom, 5 ↔ top/upper middle/middle/lower middle/bottom and 10 ↔ ten equally sized rows. tion, we employ the LIBSVM package [Chang and Lin, 2001] for the SVM-based categorization experiments. As mentioned before, LIBSVM implements a one-against-one multi-class scheme that results in 15 two-class SVMs for the six scene categories. When determining the category of an unseen image, the COV of the image is tested by each of the 15 classiﬁers. Each “winning” category obtains a vote and the image is assigned to the category with the largest number of votes.

7.2

Scene Categorization

The goal of the experiments was ﬁrst to evaluate the Prototype vs. the SVM approach, second to determine the number of image areas best for categorization, and third to determine if the semantic modeling step, that is the use of concept-occurrence vectors, is in fact useful. The categorization performance is primarily evaluated via the overall categorization accuracy, but also in later steps via the categorization accuracy per category and the rank statistics. All experiments are 10-fold cross-validated. Parameters were selected such that the performance on average, that is over all 10 cross-validation rounds is maximized. The images for each cross-validation round were selected randomly with the constraint that an about equal amount of images of each category is present in each training set. Figures 7.3 and 7.4, and Tables 7.1 to 7.6 summarize the categorization results for all experiments. The following sections discuss those results in detail.

78

Chapter 7. Scene Categorization and Retrieval (a) Confusion Matrix

coasts rivers/lakes forests plains mountains sky/clouds OVERALL

80.3 18.0 0.0 0.8 0.6 0.0

14.1 73.0 1.9 0.0 2.2 0.0

0.7 3.6 95.1 0.8 0.6 0.0

3.5 0.9 1.9 91.6 6.7 5.9

0.7 3.6 1.0 5.3 89.4 0.0

(b) Rank Statistics 0.7 0.9 0.0 1.5 0.6 94.1

1 80.3 73.0 95.1 91.6 89.4 94.1 86.4

2 97.1 95.5 98.1 98.5 98.3 100.0 97.7

3 99.3 96.4 99.0 98.5 98.9 100.0 98.6

4 99.3 99.1 100.0 100.0 100.0 100.0 99.7

5 100.0 100.0 100.0 100.0 100.0 100.0 100.0

6 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Table 7.1: Categorization based on annotated image regions - SVM Approach, 5 image areas (a) Confusion Matrix coasts rivers/lakes forests plains mountains sky/clouds OVERALL

64.8 18.9 0.0 1.5 0.0 0.0

4.2 55.9 0.0 0.0 2.2 0.0

9.2 10.8 96.1 4.6 2.8 0.0

6.3 2.7 2.9 89.3 7.8 0.0

0.7 10.8 0.0 1.5 85.5 0.0

(b) Rank Statistics

0.9 1.0 3.1 1.7 100

1 64.8 55.5 96.1 89.3 85.5 100 79.6

2 90.8 88.3 99.0 97.7 93.9 100.0 94.1

3 99.3 98.2 99.0 98.5 96.6 100.0 98.6

4 100.0 100.0 99.0 100.0 97.8 100.0 99.3

5 100.0 100.0 99.0 100.0 100.0 100.0 99.9

6 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Table 7.2: Categorization based on annotated image regions - Prototype Approach, 10 image areas

7.2.1

Categorization based on annotated image regions

The experiments based on annotated image regions serve as benchmark: assuming that the manual annotations are consistent, the experiments reveal what is the best performance of the semantic modeling that can be expected. So ﬁrstly, the categorization performance based on manually annotated image regions was tested. The result for 1, 2, 3, 5, and 10 image areas is depicted on the left side of Figure 7.3. The plot suggests quite clearly that the SVM approach performs better than the Prototype approach, and that an increase in the number of image areas leads to a slight improvement in the case of the SVM approach and a larger improvement in the case of the Prototype approach. Tables 7.1 and 7.2 display the confusion matrices of the best SVM categorization and the best Prototype categorization, respectively. The best SVM performance is with 86.4% categorization rate clearly better than the best Prototype performance with 79.6% categorization rate. The confusion matrices also show that especially the two categories coasts and rivers/lakes are frequently confused. In addition to the confusion matrices, the two tables display the rank statistics of the categorization. The rank statistics allow to analyze how close to each other are the ﬁrst and the second, third, etc. best candidate for the categorization. Being close to each other in the COV space means either that the image or the category representation are not representative, or, as in our case, that the corresponding images are semantically hard to categorize. The large jump from 86.4% to 97.7% in Table 7.1(b), or from 79.6% to 94.1% in Table 7.2(b) suggests that part of images is indeed quite close in the COV space. This is especially true for the coasts and the rivers/lakes category. Visual inspection of the mis-categorized

7.2. Scene Categorization

79

(a) Confusion Matrix coasts rivers/lakes forests plains mountains sky/clouds OVERALL

71.1 28.8 1.0 4.6 3.9 8.8

12.0 42.3 2.9 0.8 3.4 0.0

0.7 6.3 89.3 5.3 0.0 0.0

6.3 5.4 3.9 71.0 5.0 0.0

9.2 17.1 2.9 17.6 86.6 0.0

(b) Rank Statistics 0.7 0.0 0.0 0.8 1.1 91.2

1 71.1 42.3 89.3 71.0 86.6 91.2 74.1

2 87.3 82.0 95.1 87.8 95.5 97.1 90.3

3 96.5 93.7 96.1 97.7 98.9 100.0 97.0

4 99.3 98.2 99.0 100.0 98.9 100.0 99.1

5 100.0 99.1 100.0 100.0 100.0 100.0 99.9

6 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Table 7.3: Categorization based on classiﬁed image regions - SVM Approach, 3 image areas, SVM classiﬁcation of linHSIhist84_glcm32_edh72-feature (a) Confusion Matrix coasts rivers/lakes forests plains mountains sky/clouds OVERALL

62.7 20.7 0.0 8.4 2.2 0.0

12.7 33.3 1.9 1.5 0.0 0.0

4.9 10.8 95.1 15.3 3.4 0.0

4.2 1.8 0.0 50.4 5.6 0.0

12.7 29.7 2.9 22.1 87.7 0.0

(b) Rank Statistics 2.8 3.6 0.0 2.3 1.1 100

1 62.7 33.3 95.1 50.4 87.7 100.0 68.7

2 88.7 78.4 97.1 67.2 92.7 100.0 85.9

3 99.3 97.3 97.1 84.7 96.1 100.0 95.1

4 100.0 99.1 98.1 99.2 98.9 100.0 99.1

5 100.0 100.0 99.0 100.0 100.0 100.0 99.9

6 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Table 7.4: Categorization based on classiﬁed image regions - Prototype Approach, 10 image areas, SVM classiﬁcation of linHSIhist84_glcm32_edh72-feature images shows that those images are even for human observers hard to categorize. We will come back to this observation in the conclusion of this chapter.

7.2.2

Categorization based on classified image regions

The next set of tests regards the categorization of automatically classiﬁed image regions. The image regions were classiﬁed as described in Chapter 6 using the kNN classiﬁer based on the linHSIhist84_edh72-feature (F1-features) and the SVM classiﬁer based on the linHSIhist84_glcm32_edh72-feature (F2-features). The learning of the category prototypes and the SVMs can be done either on the annotated or on the classiﬁed data. Both possibilities are evaluated separately and displayed in Figure 7.4. On the left are the results where the parameters were learned on the annotated data and on the right the results with the parameters learned on the classiﬁed data. In each group of bars, the ﬁrst two correspond to the SVM approach, one based on F1-features classiﬁcations and one based on F2-features classiﬁcations, and the second two to the Prototype approach, again one based on F1-features classiﬁcations and one based on F2-features classiﬁcations. As mentioned in the previous chapter, the classiﬁers were trained and tested on the singly annotated image regions only. For the categorization experiments, all 100 image regions had to be classiﬁed. In the case of doubly annotated image regions, a classiﬁcation into either of the two classes is acceptable. The experiments showed that about 65% of the 12.6% double annotated regions were classiﬁed into one of the two classes. This classiﬁcation rate is in the same range as for the singly annotated regions,

80

Chapter 7. Scene Categorization and Retrieval (a) Confusion Matrix coasts rivers/lakes forests plains mountains sky/clouds OVERALL

54.2 15.3 1.0 10.7 6.7 8.8

13.4 45.9 4.9 1.5 7.8 2.9

2.8 9.0 81.6 6.9 1.1 0.0

12.0 5.4 2.9 61.1 8.9 0.0

15.5 22.5 9.7 18.3 74.3 0.0

(b) Rank Statistics 2.1 1.8 0.0 1.5 1.1 88.2

1 54.2 45.9 81.6 61.1 74.3 88.2 65.0

2 71.8 74.8 86.4 86.3 91.1 91.2 83.0

3 95.8 89.2 92.2 90.8 96.6 94.1 93.4

4 98.6 97.3 96.1 97.0 99.4 100.0 98.0

5 100.0 100.0 100.0 99.2 100.0 100.0 99.9

6 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Table 7.5: Categorization without semantic modeling step - SVM Approach, 10 image areas, linHSIhist84_glcm32_edh72-feature (a) Confusion Matrix coasts rivers/lakes forests plains mountains sky/clouds OVERALL

51.4 17.1 1.9 9.9 8.4 17.6

12.7 41.4 3.9 4.6 14.0 5.9

2.1 10.8 84.5 13.7 4.5 0.0

15.5 12.6 3.9 48.9 22.3 0.0

7.0 8.1 4.9 22.1 46.4 0.0

(b) Rank Statistics 11.3 9.9 1.0 0.8 4.5 76.5

1 51.4 41.1 84.5 48.9 46.4 76.5 54.1

2 70.4 73.0 89.3 78.6 73.7 88.2 76.9

3 87.3 89.2 92.2 93.1 91.6 91.2 90.7

4 94.4 100.0 99.0 96.2 99.4 94.1 97.6

5 100.0 100.0 100.0 99.2 100.0 100.0 99.9

6 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Table 7.6: Categorization without semantic modeling step - Prototype Approach, 3 image areas, linHSIhist84_edh72-feature and thus acceptable as input for the categorization. As in the previous section, a higher number of image areas tends to result in a higher categorization rate although 3, 5, and 10 image areas often perform very similar. The SVM approach clearly outperforms the Prototype approach. The gain in categorization accuracy relative to the Prototype approach is up to 7%. Another interesting result is the fact, that SVMs are performing better when trained on the classiﬁed data whereas the Prototype approach leads to better results when learned on the annotated data. This illustrates the diﬀerence between the discriminative SVM approach and the representative Prototype approach. The SVMs model the misclassiﬁcations imposed on the data and are thus able to better categorize the classiﬁed data, whereas the prototype representation of the categories is more robust to misclassiﬁcations when trained on the annotated data. The last observation is that the F2-features lead in all cases to better categorization results than the F1-features. This result is not very surprising considering that the classiﬁcation rate based on the F2-features is 3.4% higher than that based one the F1-features (compare Tables 6.5 and 6.6). The highest categorization rate with the SVM approach is 74.1%. The corresponding confusion matrix and rank statistics are displayed in Table 7.3. The best categorization is achieved with 3 image areas and with image regions SVM classiﬁed based on the linHSIhist84_glcm32_edh72-feature (F2-features). The best Prototype categorization results in 68.7% categorization accuracy by using 10 image areas and also region classiﬁcation with the F2-features. The confusion matrix and rank statistics are shown in Table 7.4. The confusion matrices and the rank statistics show for the two approaches show

7.3. Scene Retrieval

81

here that also based on classiﬁed image regions the coasts and the rivers/lakescategory are often confused and that the performance jump between the ﬁrst and the second best are for those two categories among the highest. In addition, images of the plains-category are often confused with mountains.

7.2.3

Categorization without semantic modeling step

The last set of experiments was conducted in order to decide whether the semantic modeling step, that is a data reduction from image features to concept-occurrence vectors, is in fact a wise step or whether it harms the achieved categorization accuracies. For that reason, the same two features that were best for the concept classiﬁcation, that is the linHSIhist84_edh72-feature and the linHSIhist84_glcm32_edh72-feature, were extracted directly from the image. In the case of one image area, this leads to a feature vector of 156 or 180, respectively, per image. In order to have a fair comparison, the feature vectors are also extracted on 3, 5, and 10 image areas and concatenated. The results of these experiments are depicted on the right of Figure 7.3. Besides the observations that have been made before (SVM approach better than prototype approach and more image areas better than fewer image areas), the diagram shows clearly that the categorization without semantic modeling step performs worse than with the use of concept-occurrence vectors. Even the best result without the semantic modeling (65.0% with the SVM approach, 10 image areas) is below the worst result in the previous section (70.1%, 1 image area). Besides that, the length of the necessary feature vectors poses problem for the training given the amount of our training data. The confusion matrices and the rank statistics of the best SVM-based and best prototype-based categorization is printed in Table 7.5(a) and Table 7.6(a), respectively. The confusion matrices show that when not employing the semantic modeling, there are confusion between all classes, not only between semantically similar classes as before. The performance jump between the ﬁrst and the second best is still large, but does not reach the same level as with the semantic modeling. These results show quite clearly that the semantic modeling step leads to a meaningful image representation. Although the dimensionality is reduced by factor 20, the categorization performance is better by nearly 10% (compare Tables 7.5 and 7.3).

7.3

Scene Retrieval

Scene categorization is most likely to be used as a pre-ﬁltering step before a queryby-example retrieval process, narrowing down the search space and thus increasing the subsequent hit rate. For that reason it is obvious to reformulate the scene categorization problem into a scene retrieval problem. The retrieval query is here: “Find coastsimages.” or “Find mountains-images.”. By this reformulation, precision and recall (see Chapter 4) of each of the six retrieval tasks become accessible and can be analyzed. In the case of the SVM approach, the minimum number of votes inﬂuences precision and

Chapter 7. Scene Categorization and Retrieval 100

100

90

90

80

80

70

70

Precision in %

Precision in %

82

60 50 40 30

50 40 30

coasts rivers/lakes forests plains mountains sky/clouds

20 10 0 0

60

10

20

30

40

50

coasts rivers/lakes forests plains mountains sky/clouds

20 10

60

70

80

90

100

0 0

10

20

30

40

Recall in %

50

60

70

80

90

100

Recall in %

a) SVM Approach, 5 image areas

b) Prototype Approach, 10 image areas

Figure 7.5: Precision vs. recall based on annotated image regions SVM approach annotated image regions classiﬁed image regions no semantic modeling step

coasts

rivers lakes

forests

plains

mountains

sky clouds

78% 64% 56%

39% 36% 32%

93% 86% 77%

56% 48% 48%

90% 79% 69%

95% 85% 73%

Table 7.7: Equal error rate performance for SVM approach recall of the retrieval, whereas in the case of the Prototype approach, precision and recall can be controlled by varying the accepted distance to the prototype. In the following, the best parameter settings of the previous section (marked red in the overview of Figure 7.8) have been used for the scene retrieval. For that reason, the number of employed image areas varies for each retrieval experiment. As before, the retrieval performance based on annotated image regions, based on classiﬁed image regions, and without semantic modeling step are compared. In addition, we evaluate the performance of the SVM approach vs. the Prototype approach. The precision-recall graphs are displayed in Figures 7.5-7.7, and Tables 7.7 and 7.8 show the equal error rate (EER) performance of the retrieval experiments. The EER performance is the value at which precision and recall are equal. The following sections describe those results in detail.

7.3.1

Retrieval based on annotated image regions

In the ﬁrst experiment, we compared the performance of the SVM approach vs. the Prototype approach based on annotated image regions. The results of the experiment

83

100

100

90

90

80

80

70

70

Precision in %

Precision in %

7.3. Scene Retrieval

60 50 40 30

50 40 30

coasts rivers/lakes forests plains mountains sky/clouds

20 10 0 0

60

10

20

30

40

50

coasts rivers/lakes forests plains mountains sky/clouds

20 10

60

70

80

90

100

0 0

10

20

30

40

Recall in %

50

60

70

80

90

100

Recall in %

a) SVM Approach, 3 image areas

b) Prototype Approach, 10 image areas

Figure 7.6: Precision vs. recall based on classiﬁed image regions Prototype approach annotated image regions classiﬁed image regions no semantic modeling step

coasts

rivers lakes

forests

plains

mountains

sky clouds

60% 58% 26%

51% 40% 33%

77% 83% 51%

36% 30% 11%

78% 70% 32%

91% 87% 10%

Table 7.8: Equal error rate performance for Prototype approach are depicted in Figure 7.5a) for the SVM approach and Figure 7.5b) for the Prototype approach. The EER performance for these retrievals is summarized in Tables 7.7 and 7.8. When using annotated image regions as input, the SVM approach outperforms the Prototype approach in 5 of the 6 cases. Obviously, rivers/lakes and plains are the most diﬃcult categories. Rivers/lakes is the only category that is better retrieved by the Prototype approach. Both approaches have diﬃculties in modeling the plainscategory for small recall values. This is surprising because the categorization accuracy of this category is quite high with around 90%. All other categories are retrieved with good to very good accuracy with the SVM approach.

7.3.2

Retrieval based on classified image regions

In the next experiment, images with automatically classiﬁed local regions were considered (for the parameters see Figure 7.8). The retrieval result is depicted in Figure 7.6a) for the SVM approach and in Figure 7.6b) for the Prototype approach. Here, the SVM approach is performing equally well or better than the Prototype approach in 4 out of

100

100

90

90

80

80

70

70

60

%

Chapter 7. Scene Categorization and Retrieval

60

Precision in

Precision in %

84

50

rivers/lakes forests plains mountains sky/clouds

50 40

40 30

30 coasts rivers/lakes forests plains mountains sky/clouds

20 10 0 0

coasts

10

20

30

40 50 60 Recall in %

20 10

70

80

a) SVM Approach, 5 image areas

90

100

0 0

10

20

30

40 50 60 Recall in %

70

80

90

100

b) Prototype Approach, 3 image areas

Figure 7.7: Precision vs. recall without semantic modeling step

6 cases (see Tables 7.7 and 7.8 for the EER performance). For those four categories coasts, forests, mountains and sky/clouds, the average loss compared to the annotated image regions is 10%.

7.3.3

Retrieval without semantic modeling step

Finally, in the last retrieval experiment, we also tested the retrieval performance when not employing the semantic modeling step, and instead extracting the features (linHSIhist84_glcm32_edh72) directly from the image. As in Section 7.2.3, the goal is to support the use of the concept-occurrence vectors by showing that the semantic modeling step indeed improves the retrieval performance. The retrieval results are depicted in Figure 7.7. Already the global visual comparison of Figure 7.7 with Figure 7.6 shows that the retrieval without semantic modeling step performs worse than the retrieval with semantic modeling and based on classiﬁed image regions. The Prototype approach fails completely. The reason might be the small number of training images (630 images in each cross-validation round) compared to the length of the feature vector (540 bins in the case of 3 image areas). The SVM approach performs better than the Prototype approach, but still worse the with semantic modeling. The summary in Table 7.7 shows that all categories are retrieved better with the semantic modeling. For the four well retrieved classes coasts, forests, mountains and sky/clouds, the semantic modeling increases the equal error rate performance on average by 11%.

7.4. Discussion

85

CATEGORIZATION

r=5

SVM

Prototypes

86.4%

r=10

79.6%

SVM

Prototypes (all SSD)

SSD Learn on annotated data

F1 r=5

SVM

Prototypes

64.7%

F1 r=3

F2 r=10 65.0%

1) r=10

67.7%

1) r=10

62.7%

2) r=10

72.6%

2) r=10

68.7%

54.1% SSD Learn on classified data

F2 r=10 41.9% SSD

1) r=5

73.0%

1) r=10

62.7%

2) r=3

74.1%

2) r=10

67.0%

F1

Semantic Concepts

F2

Features

Region Classification kNN

annotation

1) F1 location prior k=49

68.3%

SVM 2)

F2

71.7%

Features Features F1 linHSIhist84_edh72

F1

F2 linHSIhist84_glcm32_edh72

F2

Images

All categorizations have been 10fold crossvalidated.

Figure 7.8: Schematic overview of all tested categorization approaches

7.4

Discussion

In this chapter, several experiments concerning the categorization and the retrieval of natural scenes have been carried out. Figure 7.8 summarizes the categorization results for the various experimental setups that have been described in the previous sections. In Figure 7.9, the retrieval performance when employing the semantic modeling based on annotated as well as classiﬁed image regions, and the retrieval performance without semantic modeling step is illustrated per category. In the following, we draw some conclusions from these results. One of the goals for the experiments was to evaluate the performance of the discriminative, that is the SVM Approach, vs. the representative, the Prototype Approach. The experiments give a nearly unanimous answer to that question: The SVM Approach outperforms the Prototype approach both in the categorization and in the retrieval experiments. In the categorization task, using annotated image regions the performance increase is 6.8%, using classiﬁed image regions 5.4% and without semantic modeling step even 10.9% over the Prototype Approach (see Figure 7.8). Also, the precisionrecall graphs of the retrieval experiments in Section 7.3 support this result that the SVM approach leads to better categorization and retrieval performance. The other question was “Is the semantic modeling step doing any good with respect to the categorization performance and with respect to the retrieval performance?”. The result is, yes, the semantic modeling approach leads on average to higher categorization rates and to better retrieval performance. The best categorization based on classiﬁed

86

Chapter 7. Scene Categorization and Retrieval rivers/lakes

90

90

90

80

80

80

70

70

70

60 50 40

Precision in %

100

60 50 40

60 50 40

30

30

30

20

20

20

10

10

10

0 0 10 20 30 40 50 60 70 80 90 100 Recall in %

0 0 10 20 30 40 50 60 70 80 90 100 Recall in %

plains

0 0 10 20 30 40 50 60 70 80 90 100 Recall in %

mountains

sky/clouds

100

100

90

90

90

80

80

80

70

70

70

60 50 40

60 50 40

Precision in %

100

Precision in %

Precision in %

forests

100

Precision in %

Precision in %

coasts 100

60 50 40

30

30

30

20

20

20

10

10

10

0 0 10 20 30 40 50 60 70 80 90 100 Recall in %

0 0 10 20 30 40 50 60 70 80 90 100 Recall in %

0 0 10 20 30 40 50 60 70 80 90 100 Recall in %

Figure 7.9: Comparison of SVM based scene retrieval: Semantic Modeling + annotated image regions (dashed), Semantic Modeling + classiﬁed image regions (solid), No Semantic Modeling (dash-dot) image regions (classiﬁcation rate = 71.7%) has an accuracy of 74.1%, whereas the best categorization without semantic modeling reaches only 65.0% categorization rate (Figure 7.8). Figure 7.9 also shows clearly for all categories that the retrieval performance with the semantic modeling step is higher than without semantic modeling step. Thus, besides being a means to semantically describe natural scenes, and besides leading to a dimensionality reduction by factor 20, the semantic modeling achieves also a higher categorization and retrieval performance. Obviously, the categorization performance based on classiﬁed image regions is lower than the benchmark based on annotated image regions. A closer analysis of the confusion matrix in Table 7.3 and the confusion matrix of the concept classiﬁcation in Table 6.6 shows that the categorization performance is strongly correlated with the performance of the concept classiﬁer that is most discriminant for the particular categories. Three of the six categories have been categorized with high accuracy: forests, mountains and sky/clouds. The main reason is that sky, foliage and rocks have been classiﬁed with high accuracy and thus lead to a good categorization of forests, mountains and sky/clouds. Critical for the categorization especially of the category plains is the classiﬁcation of fields. Since fields is frequently confused with either foliage or rocks, plains is sometimes mis-categorized as forests or mountains. Another semantic concept that is critical for the categorization is water. If too much water misclassiﬁed,

7.4. Discussion

87

rivers/lakes images are confused with forests or mountains depending on the amount of foliage and rocks in the image. If too much water has incorrectly been detected in rivers/lakes images, they are confused with coasts. If more semantic concepts were employed in the current system, the categorization performance would decrease. As mentioned in Chapter 6, the use of more semantic concepts would lead to even smaller classes and, due to the size of the classes, a low classiﬁcation accuracy. A low classiﬁcation accuracy would in turn lead to a lower categorization accuracy. We experimented with ten and eleven instead of nine local semantic concepts and did not observe a categorization improvement. However, also the decrease to seven or eight semantic concepts did not improve categorization. The smaller semantic concept classes are necessary for a ﬁnal discrimination between visually similar categories such as sand for the diﬀerentiation between rivers/lakes and coasts. In another experiment, the conﬁdence ratings of the concept detectors instead of the binary decisions per concept class were used for computing the concept-occurrence vectors. Surprisingly, this changed the categorization accuracies only by few percent. For some scene categories and number of image areas, categorization was 1-2% better and in as many cases categorization was 1-2% worse. For that reason, the following experiments are based on binary decision of the concept detectors. First experiments on automatic clustering of the concept-occurrence vectors showed that scenes sharing a large amount of one particular concept are clustered together. In many cases this is indeed semantically meaningful, e.g larege amounts of foliage result in forests-clusters or large amounts of rocks in mountains-clusters. However, in some cases, this main concept is the only semantic similarity that the scenes share, e.g. sand in coasts- and in plains-images. Here, it becomes indispensable to model the interrelations between the semantic concepts in one type of scene category in order to obtain fully semantically meaningful scene clusters. In addition to the categorization rate, we also analyzed the rank statistics. With the SVM approach, an image is allocated to the category with the maximum number of votes and with the Prototype approach, an image is allocated to the category with the smallest distance to the prototype. In Tables 7.1(b) - 7.6(b), the categorization rates are displayed when not only using the best, but also the second best, third best, etc. for the categorization. The result is surprising: When accepting also the second best for the categorization, the overall categorization rata jumps on average by 16.8%. When analyzing the number of votes or distances, respectively, of the second best vs. the best image, it appears that these values are actually quite close. Does that mean that the images are also semantically very close to both, that is the best and the second best category? Figure 7.10 shows for each category exemplary images where the second best is actually the “correct” category. The “correct” category is written in parentheses. One might argue that the person that did the annotation of the images did a good job or not. But the goal is not to model the opinion of one single annotating person. In fact, the images show that there is sometimes no “correct” or “incorrect” answer when categorizing images. How much foliage makes a rivers/lakes-image to a forestsimage and vice versa? How far must a mountain be so that the image moves from the

88

Chapter 7. Scene Categorization and Retrieval

rivers/lakes (coasts)

plains (coasts)

sky/clouds (coasts)

coasts (rivers/lakes)

forests (rivers/lakes)

mountains (rivers/lakes)

rivers/lakes (forests)

rivers/lakes (forests)

forests (plains)

mountains (plains)

sky/clouds (plains)

rivers/lakes (mountains)

plains (mountains)

plains (mountains)

plains (sky/clouds)

plains (sky/clouds)

Figure 7.10: Examples of mis-categorized scenes in each category, “correct” category in parentheses (SVM approach on annotated image regions)

7.4. Discussion

89

Prototype 1 d 1

d2

Prototype 2

d

Figure 7.11: Normalized distance computation for scene ordering mountains-category to the plains-category. The conclusion we draw from these observations is that pure hard-decision categorization should not be the goal. Rather, some sort of typicality ranking as propagandized in psychophysics [Murphy, 2002] should be performed.

7.4.1

Semantic Typicality Transitions between Categories

In order to study the ordering of scenes in more detail, we performed an additional experiment. Using the Prototype approach with annotated image regions, three pairs of prototypes were selected: rivers/lakes and forests, forests and mountains, mountains and rivers/lakes. The sum-squared distance of the database images to the selected prototypes was computed, and those images were chosen that lie inside a constrained region between two prototypes as visualized for the two-dimensional case in Figure 7.11. With the intention to obtain a normalized distance between two prototypes, the concept-occurrence vector of the selected images was projected onto the connecting line between the prototypes. Thus, the normalized distance relative to prototype 1 is Dprototype1 = dd1 . This normalized distance measure allows to sort the images “between” two prototypes. Figures 7.12 to 7.14 show the sorting result for the three prototype pairs. In each ﬁgure, the reference prototype is on the left and the normalized distance D is displayed below the images. The ﬁgures illustrate that with the semantic modeling and the concept-occurrence vectors, a semantic ordering of natural scenes can be obtained. From left to right in Figure 7.12, the concepts that are typical for a rivers/lakes, that is mainly water, decrease whereas the typical forest items, that is foliage, greenery, or trunks increase. The same happens in Figure 7.13, where the scenes change from a forests via forested mountains to rocky mountains. In Figure 7.14, the transition goes back from mountains to rivers/lakes. Note the diﬀerence in the transition with respect to Figure 7.12. In Figure 7.14, the intermediate scenes are waterscapes with a mountain in the background whereas in Figure 7.12, the border of the lake or river is forest. This illustrates, that we are indeed able to separate these scenes semantically.

90

Chapter 7. Scene Categorization and Retrieval

D = 0.06

D = 0.29

D = 0.34

D = 0.81

D = 0.95

Figure 7.12: Typicality transition from rivers/lakes to forests with normalized typicality value

D = 0.05

D = 0.11

D = 0.48

D = 0.62

D = 0.87

Figure 7.13: Typicality transition from forests to mountains with normalized typicality value

D = 0.11

D = 0.40

D = 0.67

D = 0.77

D = 0.82

Figure 7.14: Typicality transition from mountains to rivers/lakes with normalized typicality value

Semantic Typicality of Natural Scenes: Human Studies

The experiments in the previous chapter have shown that natural scene categorization is possible in general, and that it can be performed automatically with acceptable categorization accuracy. However, a closer analysis of the mis-categorized scenes suggests that actually not a hard-decision scene categorization should be the goal, but a soft-decision typicality ranking. Often, high categorization accuracies are the primary evaluation goal in scene categorization. But many natural scenes are semantically in fact ambiguous and lie literally at the borderline between two categories. The categorization accuracy only reﬂects the accuracy with respect to the subjective opinion of the particular person that performed the annotation. Other people might decide diﬀerently. For that reason, we argue that the attention should rather be directed at modeling the semantic typicality of a scene with respect to a particular category. The results of Section 7.4.1 show promising results concerning the typicality transitions of images represented by concept-occurrence vectors between two categories. But since the ranking results will ﬁnally be presented to humans, it is essential to also analyze the human typicality perception of natural scenes. Typicality eﬀects are also discussed in psychophysics. Murphy [Murphy, 2002] states that typicality diﬀerences are probably the strongest and most reliable eﬀects in the categorization literature. Typical items are learned before atypical ones [Rosch et al., 1976] and category learning is faster when taught on typical items than if taught on atypical items [Mervis and Pani, 1980]. Also in our categories, there are more and less typical items. Figure 3.5 displays on the left hand side three columns of images that, in our opinion, are more typical for the respective category than the images on the right hand side. 91

92

Chapter 8. Semantic Typicality of Natural Scenes: Human Studies

In order to study the human perception of the scenes employed in this dissertation, we conducted two psychophysical experiments1 . Some of the questions to be answered are: “How well can the images be sorted in our scene categories”, “To which extent are our scenes ambiguous, and how does that inﬂuence the human categorization consistency?”, “Do measurable typicality diﬀerences between the categories exist?”, “How consistent are these typicality rankings across the participants”, etc. The ﬁrst experiment, described in Section 8.1, is a categorization experiment, whereas in the second experiment described in Section 8.2, the perceived typicality of scenes was tested. For the experiments, we randomly selected about 50 images of each of the ﬁve scene categories coasts, rivers/lakes, forests, plains and mountains (PP250 -database, see Section 3.3).

8.1 8.1.1

Psychophysical Experiment 1: Categorization Participants

Twenty undergraduates from the University of Zurich volunteered in this experiment. All had normal or corrected to normal vision.

8.1.2

Materials and Procedure

250 natural scenes of the Corel image database (720x480 and 480x720 pixels) served as stimuli. The experiments were conducted in a dimly lit room. The viewing distance was maintained by a head rest so that the center of the screen was at eye height and the length and width of displayed scenes covered 12◦ and 17◦ , respectively. Each scene was presented for 2 seconds on the 15” screen in random order. Simultaneously, at the bottom of the monitor the ﬁve labels coasts, rivers/lakes, forests, plains and mountains2 were presented until a button was pressed. Participants had to assign each presented scene to one of the ﬁve categories as fast and accurately as possible by pressing the corresponding response button on a serial six-button-response pad. Figure 8.1 displays the instructions the participants received prior to the experiment. The box in Figure 8.1 show a screenshot of the experiment.

8.1.3

Results

The result of Experiment 1 is a set of twenty categorizations per image. Very importantly for the succeeding experiments, the results of Experiment 1 provide ground-truth based on more than one annotating person: Each scene image is assigned to the category the majority of participants selected. 1

The psychophysical experiments in this chapter were conducted in collaboration with Adrian Schwaninger and Franziska Hofer from the Department of Psychology at the University of Zurich, Zurich, Switzerland. 2 Since the study was conducted at the University of Zurich, the category names have been translated into German as follows: K¨ uste, Gew¨ asser, Wald, Feld/Ebene and Gebirge.

8.1. Psychophysical Experiment 1: Categorization

93

Instruktion Instruktion Du siehst jeweils folgendes auf dem Bildschirm:

1

2

3

Küste

Feld/Ebene

Wald

4 Gebirge

5 Gewässer

Bitte entscheide so schnell und genau wie möglich, zu welcher der fünf Kategorien die Szene gehört.

Figure 8.1: Instructions (in German) for the Experiment 1: Categorization

In Figure 8.2, the categorizations are evaluated in more detail. For each category separately as well as for the full database of 250 images, it is analyzed how well the human participants agree on the categorization of the images. The left column of Figure 8.2 shows the histogram over the number of agreeing responses. That is, if for an image all 20 participants agreed on the category this counts for bin 20, if only 19 participants have the same opinion it counts for bin 19 and so on. The middle column shows the cumulative distribution of the left column and the right column shows which percentage of each category or the full databases was selected with an answer distribution over 1, 2, 3, 4, or 5 categories. The graphs exhibit that the images in the categories coasts, rivers/lakes and mountains are rather ambiguous categories. A very small amount of the images in these categories have been selected unanimously and, in addition, the number of images that have been assigned to three or more diﬀerent categories is quite high (see right column). The participants diﬀer much less in the categories forests and plains. In both categories, the majority of images has been selected by all 20 participants. Relative to the full database, Figure 8.2 reveals that only 16% of all images have been assigned to only one category by all participants. 43% of the images have been assigned to two categories, and 41% to three and four categories. These results show clearly that the majority of the images in our database are semantically ambiguous. Figures 8.3 to 8.6 illustrate some human categorizations results visually. The decision whether to agree or not to agree is left to the viewer. Figure 8.3 shows images that were assigned to the same category by all 20 participants. This suggests that these

94

Chapter 8. Semantic Typicality of Natural Scenes: Human Studies

20

0 5 10 15 20 Number of agreeing responses

1 0.8 0.6 0.4 0.2 0

0

0

5

5

5

10

10

10

10

15

15

15

15

15

Percentage

20

0

5

10

Percentage

1 0.8 0.6 0.4 0.2 0

20

0

5

1 0.8 0.6 0.4 0.2 0

Percentage

1 0.8 0.6 0.4 0.2 0

20

0

1 0.8 0.6 0.4 0.2 0

Percentage

15

1 0.8 0.6 0.4 0.2 0

20

1 0.8 0.6 0.4 0.2 0

Percentage

10

15

Percentage 5

10

15

Percentage 0

5

10

15

1 0.8 0.6 0.4 0.2 0

Percentage 0

5

10

15

1 0.8 0.6 0.4 0.2 0

1 0.8 0.6 0.4 0.2 0

20

20

20

20

20 Percentage

0.5 0.4 0.3 0.2 0.1 0

0

5

10

Percentage

0.5 0.4 0.3 0.2 0.1 0

0

5

Percentage

plains

0.5 0.4 0.3 0.2 0.1 0

0

1 0.8 0.6 0.4 0.2 0

Percentage

coasts rivers/lakes forests

0.5 0.4 0.3 0.2 0.1 0

OVERALL

0.5 0.4 0.3 0.2 0.1 0

mountains

Cumulative Distribution 0.5 0.4 0.3 0.2 0.1 0

0 5 10 15 20 Number of agreeing responses

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1 0.8 0.6 0.4 0.2 0

1 2 3 4 5 Number of selected categories

Figure 8.2: Evaluation of Experiment 1

images are very typical examples for the respective categories. The images of Figure 8.4 lie between two categories. About half of the participants assigned the images to one category and the other half the other category. In Figures 8.5 and 8.6, the decisions are even more distributed when the 20 participants select 3 and 4 diﬀerent categories. Many of the images in Figures 8.4, 8.5, and 8.6 are in fact semantically quite similar to the images in Figure 7.10 that were not categorized correctly by our retrieval system. This result supports our claim that hard-decision scene categorization is not possible within a real-world database and that the attention should rather be directed to ranking the images according to their typicality.

8.1. Psychophysical Experiment 1: Categorization

coasts

95

rivers/lakes

plains

forests

mountains

Figure 8.3: Unanimously categorized images of each category

45% forests 55% plains

45% plains 55% mountains

60% coasts 40% rivers/lakes

Figure 8.4: Images assigned to two categories Although some images are quite ambiguous, among the human observers, a high degree of consistency is found for all scene categories. That means, the participants agree on the ambiguity or non-ambiguity of the presented scenes. The inter-rater consistency is estimated by calculating Cronbach’s α among participants for each category. Usually, the coeﬃcient α is used to estimate the test reliability by determining the internal consistency of the test or the average correlation of items within the test, respectively. Here, we employ the measure for estimating the average correlation of the human subjects. The formula for Cronbach’s coeﬃcient α is [Bortz, 1999]: N 2 σi N α= (8.1) · 1 − i=1 2 N −1 σtotal where N = the number of human participants, σi2 = the variance of each participant,

96

Chapter 8. Semantic Typicality of Natural Scenes: Human Studies

25% forests 40% plains 35% mountains

10% rivers/lakes 55% forests 35% mountains

75% rivers/lakes 10% coasts 15% mountains

Figure 8.5: Images assigned to three categories

5% coasts 55% rivers/lakes 10% forests 30% mountains

5% coasts 75% rivers/lakes 5% plains 15% mountains

Figure 8.6: Images assigned to four categories 2 = the variance over all participants. and σtotal Values above 0.90 indicate high consistency among individuals (e.g. [Kline, 2000]). As can be seen in Table 8.1(a) the values for α were between 0.972 (rivers/lakes) and 0.992 (forests), which shows a large agreement between the human observers for categorizing natural scenes.

8.2

Psychophysical Experiment 2: Typicality

Each semantic category contains typical and rather atypical exemplars, which should apply to natural scenes as well. The important question for an image retrieval system is whether the order of typicality is perceived consistently among individuals. If this is not fulﬁlled, a time consuming procedure in order to initially adapt an image retrieval system to each individual is unavoidable. In contrast, if diﬀerent individuals agree on the typicality of natural scenes, a general image retrieval system would make sense, at least for initial ﬁltering of images. In Experiment 2, typicality values of scenes were assessed

8.2. Psychophysical Experiment 2: Typicality (a) Experiment 1

97 (b) Experiment 2

α

α

rs

coasts

0.982

0.981

0.693

rivers/lakes

0.972

0.981

0.780

forests

0.992

0.971

0.805

plains

0.988

0.968

0.679

mountains

0.980

0.943

0.645

Table 8.1: Inter-rater reliabilities for the categorization experiment (Experiment 1) and the typicality experiment (Experiment 2) using the same sample of scenes and categories as in the categorization experiment of Section 8.1, and the consistency among participants was measured.

8.2.1

Participants

Ten undergraduates from the University of Zurich volunteered in this experiment. None of them had participated in Experiment 1. All had normal or corrected to normal vision.

8.2.2

Materials and Procedure

The same material as in Experiment 1 was used. Scene presentation occurred in a random order simultaneously with ﬁve bars labeled coasts, rivers/lakes, forests, plains and mountains2 . The arrangement of the labels was counterbalanced across participants. For each category, participants had to judge the typicality of the scene from 1 (very atypical) to 50 (very typical) by using the bar sliders displayed below the image. Thereafter, they could initiate the next trial by pressing the space bar. Figure 8.7 contains the instructions the participants received prior to the experiment. The box in Figure 8.7 displays a screenshot of Experiment 2.

8.2.3

Results

This experiment provides us with a mean typicality rating between 1 (very atypical) and 50 (very typical) for each of the 250 images with respect to each of the ﬁve categories. Since the mean typicality is not a metric and can thus not be used for further evaluations in the context of this thesis, the typicality ranking of the scenes based on the mean typicality is computed. Also in this experiment there is a high consistency between the participants of the study. The consistency of the typicality judgments were assessed using Cronbach’s α (see Eq. 8.1) as well as the averaged Spearman’s rank correlation between participants for each category. Spearman’s rank correlation is identical to Pearson’s correlation (also

98

Chapter 8. Semantic Typicality of Natural Scenes: Human Studies

Instruktion Instruktion Instruktion Du siehst jeweils folgendes auf dem Bildschirm:

Sehr typisch

Sehr untypisch Küste Feld / Ebene Wald Gebirge Gewässer

Bitte entscheide, wie typisch die Szene für die jeweilige Kategorie ist.

Figure 8.7: Instructions (in German) for Experiment 2: Typicality Ranking

product-moment correlation, correlation coeﬃcient) when inserting variables that are in ordinal scale as is the case with rank orders. Spearman’s rank correlation can also be computed via the following equation [Bortz, 1999]: 2 6· M j=1 dj rs = 1 − , M · (M 2 − 1)

(8.2)

where M = the number of data points, that is images in our case, and dj is the rank diﬀerence between the two compared rankings. In the case that the two rankings exhibit ties, the computation of rs becomes slightly more complicated. Ties are present if multiple items in ranking A have the same rank in ranking B, or vice versa, or both. With ties, Spearman’s rank correlation becomes [Bortz, 1999]:

rs =

2· 2·

M 3 −M 12

M 3 −M 12

2 −T −U − M j=1 dj M 3 −M , −T · −U 12

(8.3)

8.2. Psychophysical Experiment 2: Typicality coasts

99

rivers/lakes

forests

100

100

100

80

80

80

60

60

60

40

40

40

20

20

20

0

0

0

10 20 30 40 50 Typicality rating (1=low,50=high)

10 20 30 40 50 Typicality rating (1=low,50=high)

plains

mountains

100

100

80

80

60

60

40

40

20

20

0

0

10 20 30 40 50 Typicality rating (1=low,50=high)

10 20 30 40 50 Typicality rating (1=low,50=high)

10 20 30 40 50 Typicality rating (1=low,50=high)

Figure 8.8: Experiment 2: Histogram over typicality ratings for each category where 1 3 (t − tl ), T = 12 l=1 l

(8.4)

1 3 U = (u − ul ), 12 l=1 l

(8.5)

k(A)

k(B)

and k(A); k(B) = number of tied rank groups in ranking A or B, respectively, tl = number of ties in each rank group of ranking A, ul = number of ties in each rank group of ranking B. For assessing the inter-rater correlation of Experiment 2, Spearman’s rank correlation of two typicality rankings has been computed for each combination of two participants and averaged. Table 8.1(b) shows the results on the inter-rater consistency of Experiment 2. Both Cronbach’s α lying between 0.943 (mountains) and 0.981 (coasts) as well as the averaged rank correlation between participants, lying between 0.645 (mountains) and 0.805 (forests), were very high. These results indicate that there is a large agreement between participants concerning the typicality ranking of the scenes used in this experiment. Figure 8.8 shows the histogram over the mean typicality rating for each category. The typicality ratings for the coasts, rivers/lakes and also plains are clearly bimodal.

100

Chapter 8. Semantic Typicality of Natural Scenes: Human Studies coasts

rivers/lakes

forests

10

10

10

8

8

8

6

6

6

4

4

4

2

2

2

0

0

0

10 20 30 40 50 Typicality rating (1=low,50=high)

10 20 30 40 50 Typicality rating (1=low,50=high)

plains

mountains

10

10

8

8

6

6

4

4

2

2

0

0

10 20 30 40 50 Typicality rating (1=low,50=high)

10 20 30 40 50 Typicality rating (1=low,50=high)

10 20 30 40 50 Typicality rating (1=low,50=high)

Figure 8.9: Experiment 2: Histogram over typicality ratings for each category including only images belonging to the respective category (ground truth from Experiment 1) That is, images are rated as being either typical or atypical for the respective category but not much in between. On the other hand, the typicality ratings of forests and mountains are nearly continuous. That is, the images in our database form kind of a typicality continuum with respect to the mountains-category. By using the ground truth obtained in Experiment 1, the results of the two psychophysical experiments can be related to one another. The histograms in Figure 8.9 are obtained when only those images are included that actually belong to the respective category as decided by majority vote in Experiment 1. As expected, mainly the bins corresponding to high typicality are ﬁlled, although there are some outliers at smaller typicality values. It must be noted here that Experiment 1 and Experiment 2 cannot be fully compared since in Experiment 1 the users were asked to judge the shown image relative to one, their “favorite” category, whereas in Experiment 2, the participants made a judgement with respect to each of the ﬁve categories. This explains why the typicality ratings in Figure 8.9 are partly quite distributed.

8.3

Summary of Psychophysical Studies

We can draw several conclusions from the psychophysical experiments presented in this chapter. Given an arbitrary natural scene, humans are able to categorize that image

8.3. Summary of Psychophysical Studies

101

into one of ﬁve categories. In addition, Experiment 1 shows that the participants are very consistent in this task. Experiment 1 also conﬁrmed that our database contains ambiguous images. This is normal and therefore desired since any real-world database will not only contain images that can clearly be assigned to one or another category. The visual comparison of some of the ambiguous images and the images mis-categorized in Chapter 7 suggests a semantic similarity between these two groups of images. That means, if humans are on average not sure about how to categorize certain images, we cannot expect an automatic categorization system to make hard decisions about the category membership of those images. From experiment 2 we know, that humans are also able to consistently rate the typicality or atypicality of a scene relative to the ﬁve categories. The psychophysical study shows that the participants agree to a high degree in the typicality rating task. In summary, some of our database images are ambiguous concerning the given scene categories. Despite this fact, humans are able to sort all database images into one ore more categories with the important result that even assignments of one image to multiple categories are done consistently. In addition, the human participants are consistently rating the typicality of images. These ﬁndings provide a basis for an automated typicality ranking system that will be discussed in the next chapter.

Perceptually Plausible Ranking of Natural Scenes

In the previous chapters, we gradually motivated that it is not wise to aim for a harddecision categorization of natural scenes. One reason is that the system is to be used in the context of content-based image retrieval where the retrieval results are usually returned with a retrieval rank. In addition, the experiments with the human subjects showed that many images cannot be clearly assigned to one category. About half of the database are borderline cases, that is images that are not very typical for either category but lie somewhere in between two or even three categories. This leads us to the concept of typicality. Obviously, the ranking of retrieved images is to be based on the visual similarity or the typicality of the images, not on some abstract similarity based on some color or texture feature. As mentioned before, humans seem to identify typical and less typical members of a category, and we have shown in the previous chapter that the human participants were consistently able to rate the semantic typicality of our database images. Thus, the goal is to automatically rank natural scenes according to their typicality relative to a scene category. In addition, the goal is to achieve a high correlation between the resulting machine-generated ranking and the ranking of humans. In this chapter, we develop a perceptually plausible distance measure (PPD) that generates typicality rankings of natural scenes that correlate highly with the typicality ranking of humans. The core of the computational model for the ranking task is the semantic modeling used throughout this thesis: Images are divided into a grid of 10x10 image regions, and these image regions are either annotated or classiﬁed into semantic concept classes si , i = 1 . . . M, M = 9. Concept-occurrence vectors, that is histograms over the frequencies of occurrence of each concept, serve as image representation. As in 103

104

Chapter 9. Perceptually Plausible Ranking of Natural Scenes

Chapter 7, we compare the ranking performance based on category prototypes with the ranking performance based on SVMs. The images are ranked according to their distance to the prototype or according to their distance from the hyperplane, respectively. For the experiments, the same database as for the psychophysical experiments is used (PP250 ). The ground-truth for the images is the result of the psychophysical Experiment 1 in Chapter 8.1: The images are assigned to the category the majority of participants selected. The data of the psychophysical Experiment 2 serves as benchmark for the ranking experiments. Each participant rated the typicality of each of the 250 images relative to each of the ﬁve categories. Thus, we use the mean typicality to sort the images in descending order of typicality relative to each category. Since the mean typicality is not a metric, we can only use the ranking of the scenes, and not their typicality value for evaluation [Vogel, 2000]. For that reason, the machine typicality ranking is compared to this human typicality ranking through Spearman’s rank correlation rs (Eq. 8.3). The chapter is organized as follows. In the next section, the typicality ranking performed by the Prototype approach and the sum-squared distance is analyzed. Based the results, Section 9.2 proposes a perceptually plausible distance measure that is also employed in combination with the Prototype approach. The two distance measures are compared and discussed in Section 9.3. Section 9.4 evaluates the typicality ranking performance of the SVM approach and Section 9.5 summarizes the results of this chapter.

9.1

Typicality Ranking using the Prototype Approach and the SSD

As in Section 7.1.1, each scene category is represented by the mean over the conceptoccurrence vectors COV of all images belonging to the respective category. This leads to a prototypical representation pc of the scene categories where the semantic concepts si act as attributes and their occurrences as attribute scores. The typicality of a scene relative to a category c is computed by the sum-squared distance (SSD) that compares the scene representation COV to the prototype pc of the respective category: N (r)

dcSSD (r)

=

(COVj (r) − pcj (r))2

(9.1)

j=1

In this equation, r = [1, 2, 3, 5, 10] refers to the number of image areas and N(r) = [9, 18, 27, 45, 90] to the corresponding length of the concept-occurrence vector (compare Section 3.1). Since only the typicality ranking is of interest, the images are ranked according to dcSSD (r). The goal of the experiments is to evaluate the capacity of the semantic image retrieval system to rank natural scenes similarly to humans. All experiments have been 5-fold cross-validated. In each round, 45 th of each category have been used as training set for the computation of the prototype. The images of the test set were ranked using the learned prototype and the SSD and correlated with the corresponding human typicality

9.2. Typicality Ranking using a Perceptually Plausible Distance Measure

105

ranks. The reported Spearman’s rank correlation is the average over all cross-validation rounds. As before, the tests with the manually annotated image regions serve as a benchmark for the maximum correlation performance that can be reached with our computational model. In an additional experiment, we determine the correlation performance when using classiﬁed image regions as input. Here, the SVM concept classiﬁer based on the linHSIhist84_glcm32_edh72-feature with 71.7% classiﬁcation rate (see Chapter 6 and Table 6.6) is employed. In both cases, the prototypes are learned based on annotated image regions. Figure 9.1 left shows the obtained correlation between the human typicality ranking and the machine typicality ranking using annotated image patches as input. Each group of bars belongs to one category noted below the plot. In each category, r = [1, 2, 3, 5, 10] horizontally-layered image areas have been tested. The black line with the error bars displays the average inter-individual rank correlation from the psychophysical Experiment 2 and its standard deviation for each category (compare Table 8.1(b)). The results are partially promising. The machine ranking performs with the Prototype Approach and the SSD for forests and mountains at least as good as the average inter-individual correlation, and rivers/lakes lie inside the 1σ-interval. But the ranking of coasts and plains correlates poorly with that of the humans. The results of the experiment with the classiﬁed image regions as input are shown in Figure 9.1 right. In addition, Table 9.1 displays for each category separately the best correlation over all image areas for the experiments with annotated as well with classiﬁed image regions. Although the classiﬁcation rate of the concept classiﬁer is with 71.7% not extremely high, coasts, forests and plains seem to be quite stable with respect to the misclassiﬁcations of image regions. The maximum correlation reached in these categories is less than 0.03 below that of the experiment with the annotated image regions. Also the ranking of rivers/lakes looses only little correlation whereas the ranking of mountains breaks down with a correlation diﬀerence of 0.3. In summary, the performance of the computational model employing the Prototype approach with SSD is encouraging but not fantastic. With annotated image regions, we obtain for rivers/lakes, forests, and mountains a correlation that is in the same range as the inter-rater correlation. That means, the system behaves similar to the average of a group of humans. But with the more realistic case of classiﬁed image regions, most correlations lie quite below the average inter-rater correlations.

9.2

Typicality Ranking using a Perceptually Plausible Distance Measure

The results of the previous section suggest that the semantic concepts in each category should be weighted diﬀerently. In the previous section, the typicality ranking was performed using the SSD as distance measure. When introducing a set of weights w c(r) in Eq. 9.1

106

Chapter 9. Perceptually Plausible Ranking of Natural Scenes Input: annotated image patches, SSD

Input: classified image patches, SSD

1

1 1 area 2 areas 3 areas 5 areas 10 areas

0.9

rs

rs Spearmans rank correlation

0.8

Spearmans rank correlation

0.8 0.7

0.7

0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0

1 area 2 areas 3 areas 5 areas 10 areas

0.9

coasts rivers/lakes forests

plains

0

mountains

coasts rivers/lakes forests

plains

mountains

Figure 9.1: Spearman’s rank correlation: Prototype Approach with SSD coasts

rivers/lakes

forests

plains

mountains

annotated regions

0.57

0.72

0.86

0.32

0.77

classiﬁed regions

0.54

0.66

0.82

0.34

0.46

Table 9.1: Best correlation rs in each category using the SSD

N (r)

dcP P D (r)

=

wjc (r)(COV (r)j − pc (r)j )2 ,

(9.2)

j=1

then the SSD corresponds to the case where the weight vector is composed of only ones. The weights w c(r) model in fact the relative importance of the local semantic concepts in each category. In the case of the SSD, all local semantic concepts in all categories are given the same weight. This is not necessarily the best method. Flowers for example are very discriminative for the plains-category, but hardly appear in any other category (see Figure 7.2). For that reason, the relative importance of the concept flowers should be increased for the plains-category and decreased for all other categories. The goal must thus be to adapt the concept weights w c(r) depending on the category and also depending on the number of image areas r. Grass is not very discriminative when appearing on the bottom of an image but all the more when appearing on the top of an image. The typicality scores of the human participants in psychophysical Experiment 2 provide us with powerful information to learn the concept weights. For each num-

9.2. Typicality Ranking using a Perceptually Plausible Distance Measure Input: annotated image patches, optimized weights

Input: classified image patches, optimized weights

1

1 1 area 2 areas 3 areas 5 areas 10 areas

0.9

rs

0.8

0.7 0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

coasts rivers/lakes forests

plains

1 area 2 areas 3 areas 5 areas 10 areas

0.9

Spearmans rank correlation

Spearmans rank correlation

rs

0.8

0

107

0

mountains

coasts rivers/lakes forests

plains

mountains

Figure 9.2: Spearman’s rank correlation: Prototype Approach with PPD coasts

rivers/lakes

forests

plains

mountains

annotated regions

0.64

0.80

0.87

0.72

0.74

classiﬁed regions

0.59

0.75

0.77

0.58

0.60

Table 9.2: Best correlation rs in each category using the PPD ber of image areas r and each category c, the weights w c(r) are optimized such that the correlation between the machine typicality ranking and the average human typicality ranking in the training set is maximized. For this optimization procedure, we solve a constrained minimization problem where the weights w c(r) are adapted to ﬁnd the minimum of 1 − rs (typicalityhuman , typicalitymachine ) under the constraint that 0.0001 < wjc (r) < 10000. For the starting condition w0c (r), we use a vector of only ones plus a random jitter. The optimization procedure is performed repeatedly on the training set each time saving the set of weights that lead to maximum rs . Thus, the distance dcP P D (r) becomes perceptually and psychophysically meaningful because the weights are learned from the average human typicality score. dcP P D (r) is abbreviated in the following PPD for perceptually plausible distance. The images of the test set in each cross-validation round are ranked using the distance with the weights optimized on the training set, and Spearman’s rank correlation to the human ranking is computed. As before, the reported correlation is the average correlation over all cross-validation rounds. The results of the experiments with the PPD and annotated image regions as input is displayed in Figure 9.2 left. In addition, Table 9.2 shows the best correlation for each

108

Chapter 9. Perceptually Plausible Ranking of Natural Scenes

coasts, r = 0.57

rivers/lakes, r = 0.72

s

50

50

40

40

40

30 20

3 areas

50

10

30 20 10

10

20

30

40

50

30 20 10

10

plains, r = 0.32

20

30

40

50

10

20

30

40

50

mountains, r = 0.77

s

s

50

50

40

40 1 areas

2 areas

forests, r = 0.86

s

3 areas

10 areas

s

30 20 10

30 20 10

10

20

30

40

50

10

20

30

40

50

Figure 9.3: Scatter plots of all categories: machine ranking vs. human ranking. Prototype Approach with SSD. Annotated image regions

category. In comparison with Table 9.1, the correlation performances of all categories but mountains make a large jump. Especially, the correlation of plains increases by 0.4. Except coasts, in all categories the correlation between the machine and the human ranking exceeds the inter-rater reliability of the psychophysical Experiment 2 (black line with error bars), and all categories lie inside the 1σ-interval. In other words, the very consistent typicality judgments of the participants in Experiment 2 can be modeled with our system. Also, the achieved human-machine correlations follow closely the varying inter-individual correlations between categories. That is, forest-scenes are ranked very consistently by humans and also by the machine, whereas the performance for mountains is a little worse (lower average correlation and higher variance) both for humans and for the machine. The weights were optimized on the training set using annotated image regions as input. Figure 9.2 right shows the results when ranking images with classiﬁed image regions using these weights. On average, the obtained correlations using classiﬁed image regions lie less than 0.1 below the correlations obtained with annotated image regions. In all cases Spearman’s rank correlation is very close to the 1σ-interval of the interindividual correlation. In addition, the correlations follow again the variations of the inter-individual correlations between categories. This indicates that the computational model and the full image analysis and typicality ranking system are indeed perceptually

9.3. Comparison SSD vs. PPD

109

coasts, r = 0.64

rivers/lakes, r = 0.80

s

50

50

40

40

40

30 20

3 areas

50

10

30 20 10

10

20

30

40

50

30 20 10

10

plains, r = 0.72

20

30

40

50

10

20

30

40

50

mountains, r = 0.74

s

s

50

50

40

40 10 areas

3 areas

forests, r = 0.87

s

2 areas

5 areas

s

30 20 10

30 20 10

10

20

30

40

50

10

20

30

40

50

Figure 9.4: Scatter plots of all categories: machine ranking vs. human ranking. Prototype Approach with PPD. Annotated image regions plausible. The averaged variance of Spearman’s rank correlation over the cross-validation rounds is small with 0.0026 for the annotated images regions and 0.0076 for the classiﬁed image regions. The variances correspond to 0.34% of the average rank correlation for the annotated image regions and 1.2% of the average rank correlation for the classiﬁed image regions. This suggests that the results are stable, that the training sets represent the data well and that the method generalizes.

9.3

Comparison SSD vs. PPD

The experiments of the previous two sections clearly show the superiority of the psychophysically-plausible distance (PPD) over the sum-squared distance (SSD). With annotated image regions as input, the correlation performance of the PPD is up to 0.4 higher than with the SSD. Only for mountains, the correlation decreases by 0.02 which might be due to some overﬁtting in the training phase. But it is worthwhile noting that in all cases the rank correlation is clearly inside the 1σ-interval of the inter-individual correlation. The performance increase when using classiﬁed image regions as input is of higher importance since it suggests a high robustness of our approach. The correlations achieved with the PPD are up to 0.25 (plains) higher than with the SSD. The mountains-

110

162

0 sky

Chapter 9. Perceptually Plausible Ranking of Natural Scenes coasts

water grass trunks foliage field rocks flowers sand

2.6

0 sky

plains

forests 1.7

water grass trunks foliage field

rocks flowers sand

0 sky

water grass trunks foliage field

rocks flowers sand

mountains 4.5

2300

0 sky

rivers/lakes

water grass trunks foliage field rocks flowers sand

0 sky

water grass trunks foliage field

rocks flowers sand

Figure 9.5: Concept weights resulting from the optimization procedure for each category (1 image area). category shows an especially interesting behavior. Based on annotated image regions the ranking performance is better when using the SSD (SSD: 0.77, PPD: 0.74), but based on classiﬁed image regions the PPD outperforms the SSD (SSD: 0.46, PPD: 0.60). Figure 9.3 and Figure 9.4 show the correlation increase from the SSD to the PPD visually. The ﬁgures display for each scene category the scatter plot of the machine ranking vs. the human ranking. In each category, the number of image areas was selected that leads to the highest correlation value. The scatter plots achieved with the PPD are clearly closer to a line than those achieved with the SSD. The point clusters at low and high ranks especially for coasts, rivers/lakes and plains result from the bimodal typicality score distribution in Experiment 2 (see Figure 8.8). Also the analysis of the weights that result from the learning procedure is insightful. Figure 9.5 shows the optimized weights of each category when employing one image area. Note the diﬀerent maximum values in each plot. The plots show which semantic concepts are especially discriminant for the particular categories. The most outstanding concept is certainly flowers for the plains-category. Only few images in the database contain larger amounts of flowers, and those images are in fact plains-scenes. Thus, the optimization procedure catches this important discriminant factor. The high weighting of flowers is most probably responsible for the strong correlation increase of the plainscategory from the SSD to the PPD. A similar argument as for flowers is true for sand in the coasts-category. The few sand -regions in the database belong in most cases to coasts-images. Interestingly, also sky is weighted quite high for the ranking of coasts-images, especially in comparison to the rivers/lakes-category. The reason is that the coasts-images in the database contain in fact larger amounts of sky than the rivers/lakes-images and also images from other categories. Thus, sky helps to detect coasts-scenes.

9.4. Typicality Ranking with the SVM Approach

111

Input: annotated image patches, SVM

Input: classified image patches, SVM

1

1 1 area 2 areas 3 areas 5 areas 10 areas

0.9

0.9

rs

rs Spearmans rank correlation

0.8

Spearmans rank correlation

0.8 0.7

0.7

0.6 0.5 0.4 0.3

0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0

1 area 2 areas 3 areas 5 areas 10 areas

coasts rivers/lakes forests

plains

0

mountains

coasts rivers/lakes forests

plains

mountains

Figure 9.6: Spearman’s rank correlation: SVM Approach coasts

rivers/lakes

forests

plains

mountains

annotated regions

0.58

0.39

0.79

0.75

0.24

classiﬁed regions

0.56

0.32

0.73

0.57

0.30

Table 9.3: Best correlation rs in each category with the SVM Approach Finally, the question on how many image areas to use is not easily answered. It seems that each number of image areas, that is r = [1, 2, 3, 5, 10] achieves for one category the best rank correlation. With the PPD, for most categories r = 2 is a good choice and for mountains r = 10.

9.4

Typicality Ranking with the SVM Approach

Typicality ranking with SVMs is challenging since SVMs are discriminative methods. They are trained to maximize the margin between two classes instead of learning a representation of the classes as in the Prototype approach. Often, the distance of a classiﬁed data point to the hyperplane is used as measure of conﬁdence. Points close to the hyperplane are certainly borderline cases or even support vectors. However, the meaningfulness of larger distances to the separating hyperplane is debatable. Besides, the Prototype approach is also psychophysically more intuitive. Nevertheless, for the reason of completeness, we performed the typicality ranking experiments also using SVMs (LIBSVM, [Chang and Lin, 2001]). In contrast to Chapter

112

Chapter 9. Perceptually Plausible Ranking of Natural Scenes

7, here, ﬁve single-class SVMs (Category vs. NonCategory) are trained in order to obtain distances for each semantic category separately. The experiments proceed in the following order. The PP250 -database is divided into ﬁve cross-validation sets with ﬁve training sets of size 200 and ﬁve test sets of size 50. In the training phase, 1008 diﬀerent SVM model are trained based on annotated image regions (28 settings for the cost parameter C times 36 settings for the RBF parameter γ). Of these models, the model is selected that averaged over all cross-validation rounds achieves the highest accuracy in classifying the particular category. Note that the best SVM parameter set can be diﬀerent for the ﬁve scene categories and for the ﬁve sizes of image areas. With this model and using the distance to the learned hyperplane, all test images with annotated image regions are ranked, and the mean correlation to the human rankings over the cross-validation rounds is reported. Similarly, the test images with classiﬁed image regions are ranked with the model learned based on annotated image regions, and the rank correlation is computed. Figure 9.6 shows on the left the experimental results with annotated image regions as input and on the right with classiﬁed image regions as input. In Table 9.3, for both cases, the maximum rank correlation in each category is listed. In contrast to the PPD, the SVMs are not especially optimized to rank images similarly to humans. For that reason, the obtained rank correlations can only be compared to the Prototype Approach using the SSD. From Figure 9.6 it becomes apparent that the SVMs are in fact not performing good in typicality-ranking natural scenes. The only category that is ranked better than in the previous section is plains when using annotated image regions.

9.5

Discussion

In this chapter, we presented three diﬀerent approaches to the typicality ranking of natural scenes and evaluated their psychophysical plausibility by evaluating Spearman’s rank correlation between the averaged human rankings and the machine rankings of the scenes. The experiments show that the Prototype approach with the psychophysicallyplausible distance measure PPD achieves on average the highest correlations with the human ranking of natural scenes. In particular, the diﬀerence in rank correlation between the use of annotated image regions and the use of classiﬁed image regions is lowest. This result is very encouraging because the employed concept classiﬁer has only an accuracy of 71.7%. It seems that the image representation through the semantic modeling combined with the PPD is quite robust to misclassiﬁcations on concept level. The proposed category representation based on prototypes has the advantage of not representing binary decisions about the semantic concepts being present in the image or not (“Yes, there are rocks.” vs. “No, there are no rocks.”). Instead it represents soft decisions about the degree to which a particular semantic concept is present (“There are some rocks.”, “There are many rocks.”). The distances of the category members to the prototypical representation thus allow to assess the typicality of these images without excluding them from the category. There might be mountains images that hardly contain any rocks. They do belong to the mountains category, but they are

9.5. Discussion

113

much less typical than mountains images that contain a larger amount of rocks. In fact, they might be quite close to the borderline of being forest images. With the learned weights in the PPD, the relative importance of the semantic concept is modeled separately for each category and for each number of image areas. Although the Support Vector Machines performed best in the classiﬁcation, categorization and retrieval experiments of the previous chapters, the performance for the typicality ranking is below that of the Prototype approach. The result is not surprising when keeping in mind that a SVM is in fact a discriminative method. During training, a hyperplane is found that maximizes the margin between two classes. Images that are close to the hyperplane and the margin are most probably also semantically borderline cases, but the distance from the hyperplane has no meaning besides in-class or outof-class. This is especially true for data points that are far away from the separating hyperplane. Thus, there is no reason to expect that the ranking based on the distance from the hyperplane correlates well with the human ranking. Prototypes, on the other hand, are a representative method where the distance from the prototype signiﬁes the similarity to the prototype. The distance can thus be used for a typicality ranking. On the next two pages, we visualize the ranking performance of our system by plotting the top 10 ranked images for diﬀerent situations. For coasts and rivers/lakes, exemplary images are displayed in Figures 9.7 to 9.12, and the remaining categories are displayed in Appendix A. The ﬁrst ﬁgure on each page shows the average ranking obtained from the human participants, the second ﬁgure the ranking obtained with the Prototype Approach and the PPD when using annotated image regions as input and in the third ﬁgure when using classiﬁed image regions as input. The third ﬁgure on each page thus displays the ranking result obtained from a fully automatic retrieval system. Below each image in the second and third ﬁgure, its corresponding rank as well as the human rank are printed. The images in the Figure 9.7 and Figure 9.10 show that even for the humans the ranking of the images is a non-trivial task. Some of the images appear both in the Top 10 of the coasts- and in the Top 10 of the rivers/lakes-category1 . Figure 9.8 and Figure 9.11, respectively, illustrate three things: Firstly, many of the images are the same as in the Top 10 human-ranked images. Secondly, the ranking of the images corresponds closely to the ranking in the ﬁrst set of images. And thirdly, the “new” images, that is images that do not appear in Figure 9.7 and Figure 9.10, respectively, are semantically hardly distinguishable. The observations for Figure 9.9 and Figure 9.12, respectively, are very similar. In these ﬁgures, images with automatically classiﬁed image regions were ranked. We detect only one “real” mistake in the ranking: rank 9 in Figure 9.9.

1

Note here also that in the psychophysical experiments rivers/lakes has been translated to Gew¨ asser which has a slightly diﬀerent meaning.

114

Chapter 9. Perceptually Plausible Ranking of Natural Scenes

human rk=1

human rk=6

human rk=2

human rk=3

human rk=4

human rk=5

human rk=7

human rk=8

human rk=9

human rk=10

Figure 9.7: Human Rankings: Top 10 ranked coasts-images

1, human rk=6

2, human rk=1

3, human rk=18

4, human rk=7

6, human rk=19

7, human rk=9

8, human rk=10

9, human rk=3

5, human rk=13

10, human rk=4

Figure 9.8: Prototype Approach with PPD, annotated image regions: Top 10 ranked coasts-images.

1, human rk=6

2, human rk=11

3, human rk=13

4, human rk=9

5, human rk=4

6, human rk=2

7, human rk=10

8, human rk=18

9, human rk=37

10, human rk=3

Figure 9.9: Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked coasts-images.

9.5. Discussion

human rk=1

human rk=6

115

human rk=2

human rk=7

human rk=3

human rk=8

human rk=4

human rk=9

human rk=5

human rk=10

Figure 9.10: Human Rankings: Top 10 ranked rivers/lakes-images

1, human rk=2

6, human rk=4

2, human rk=12

7, human rk=14

3, human rk=10 4, human rk=21

8, human rk=20

9, human rk=11

5, human rk=15

10, human rk=19

Figure 9.11: Prototype Approach with PPD, annotated image regions: Top 10 ranked rivers/lakes-images.

1, human rk=2

6, human rk=5

2, human rk=20

3, human rk=17

7, human rk=11 8, human rk=14

4, human rk=15

5, human rk=10

9, human rk=9

10, human rk=13

Figure 9.12: Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked rivers/lakes-images.

Conclusion and Perspective

10.1

Conclusion

Semantics intensive image retrieval has gained increased interest in recent years. One reason is that approaches based on low-level features have reached their limit in retrieving speciﬁc semantic image content. Another reason is that average users are coming into the market with the wish to organize the pictorial data from their personal digital camera. This group of users does not have the experience for choosing from a multitude of possible settings concerning feature parameters they have never heard of. Instead, they rely on a more intuitive, that is semantic access to images. Having users in mind that describe the images preferably in natural language, semantics intensive image retrieval implies a query mode based on keywords. Being more intuitive for the user, a keyword-based retrieval system provides a simple method for query formulation and for relevance feedback. Such a keyword-based retrieval system includes the image analysis on a region level for a more complete image understanding and for the possibility to query for the spatial arrangement of region content. Nevertheless, also a full image representation based on semantic vocabulary is necessary for the comparison of global image similarity. Latest research shows promising results for approaches that automatically infer region annotations from global image labels (e.g. [Duygulu et al., 2002, Barnard et al., 2002]), or global image labels from category annotations [Li and Wang, 2003]. The advantage of these semi-supervised approaches is that they require comparably few training examples for the annotation of large amounts of image data. On the other hand, the performance with respect to precision, recall, and accuracy of these methods is too low for eﬀective 117

118

Chapter 10. Conclusion and Perspective

use in an image retrieval system. In this dissertation, the semantic representation of natural images has been studied in detail. Through review of relevant work in the area of psychophysics, special eﬀort has been made to base design decisions on human perception of natural scenes. The proposed image representation is based on the classiﬁcation of local semantic concepts. Image regions are extracted on a regular grid of 10x10 regions, and the resulting local patches are classiﬁed into a set of nine semantic concept classes. Images are represented through the frequency of occurrence of the semantic concepts. This so-called semantic modeling constitutes a compact, semantic image representation for the categorization, retrieval, and ranking of natural scenes. The proposed image representation allows to address two important challenges in content-based image retrieval: the performance evaluation of retrieval systems and the top-down modeling of scene categories. As shown in Part I of this dissertation, the classiﬁcation errors of the semantic concept detectors as well as the search for speciﬁc image content can be modeled statistically. With this model, precision and recall of our vocabulary-supported image retrieval can be predicted as well as optimized. Depending on the knowledge about concept detectors and concept distribution, the optimization methods leads to an improvement in precision of up to 60% and in recall of up to 25%. In Part II, it has been demonstrated that the semantic modeling can also be employed for the representation of semantic scene categories. Several methods for the categorization and retrieval of natural scene categories have been compared experimentally. In particular, it has been shown that the categorization as well as the retrieval results improve considerably when employing the semantic modeling in contrast to directly using low-level image features. In addition, the semantic modeling lead to a dimensionality reduction of factor 20. Analysis of the mis-categorized images reveals that the usual ambiguity of scene categories demands rather for a typicality ranking of images than for hard-decision categorization. Based on the ﬁndings of two psychophysical experiments on the categorization and typicality ranking of our database images, a perceptually plausible distance measure has been developed. The semantic ranking obtained with the Prototype approach and the learned distance measure correlates highly with human rankings. This validates the semantic modeling also from a human perception point of view. In Chapter 3, we have generated a list of requirements for an eﬀective image representation. The following list summarizes in retrospect which requirements are fulﬁlled by the proposed semantic modeling: semantic & descriptive: As argued throughout this thesis, semantics and description are closely dependent on one another. Through the use of local semantic concepts, images in our system can indeed be described semantically. This applies to the description on a region level (“The image contains a lot of foliage and grass.”) as well as to the full semantic representation of images through concept-occurrence vectors. region-based: In our system, the semantic concepts are extracted from local image

10.1. Conclusion

119

regions. Compared to a global image annotation, this allows to describe images on a more detailed semantic level. segmentation-free: In order not to be dependent on a potentially unsatisfying image segmentation, image regions in our system are extracted on a regular grid of 10x10 regions. global from local: Although detailed semantic image content is extracted from local image regions, the semantic modeling allows to combine the local information into a compact global image representation. In Chapter 7, we show how the global image representation can be employed successfully for the categorization and retrieval of natural scenes. The strength of the semantic modeling is the fact that the discriminative semantic content of scene categories can be learned and thus be used for categorization and retrieval. Furthermore, Chapter 9 illustrates that the image representation lends itself to a semantic ranking of natural scenes images. This result is especially interesting for content-based image retrieval where users expect the retrieval results to be ranked in decreasing order of relevance, which is in fact semantic similarity. inspired by human perception: Several design decisions in this dissertation have been based on insights from psychophysics about human visual perception. The natural scene categories have been selected for being basic-level categories as found through psychophysical experiments. Also the selection of the local semantic concepts has been inspired by work in psychophysics. Furthermore, the psychophysical experiments of Chapter 8 investigate how humans rank the natural scenes employed in this thesis. The data from the psychophysical experiments is used in Chapter 9 for learning a perceptually plausible distance measure. With this distance measure and the proposed semantic modeling, natural scenes can be ranked automatically such that the automatic ranking correlates highly with the human typicality ranking. evaluated quantitatively: All parts of the thesis have been thoroughly evaluated. The image representation has been tested through categorization and retrieval experiments in Chapter 7 as well as ranking experiments in Chapter 9. In these chapters, we compare especially the performance of a representative approach based on category prototypes and of a discriminative approach based on Support Vector Machines. Although many questions relate to the implementation of the concept detectors, the goal of the thesis was to study the applicability of the image representation based on semantic concepts in general. Is it possible to learn or even infer higher-level knowledge given the semantic concepts present in a set of images? Which kinds of scenes can be described? What are the reasons for success or failure? As we have shown, the modeling approach is well-suited for the description and retrieval of nature scenes. It is possible to search for speciﬁc image content as well as to model full scene categories. The representation even allows to model and to compensate for the errors of the concept

120

Chapter 10. Conclusion and Perspective

detectors, and hence to optimize the retrieval performance. And most importantly, the automatic ranking of the images is very close to what humans do. The interesting point is that these good results have been obtained with rather standard methods, be it the semantic modeling, the category representation, or the classiﬁcation approaches. Concerning the psychophysical experiments, especially the high correlation with the human data obtained with our system is very promising. The high consistency of the human participants in categorizing and ranking images also has an inﬂuence on image retrieval systems in general. Since diﬀerent individuals agree on the typicality of natural scenes, a general image retrieval system is feasible. A time-consuming procedure for initially adapting the retrieval system to each individual is thus not necessary. The necessity to annotate region data for the supervised scene and category representation is often regarded as a weakness. The strength of the approach is however that the annotation needs only be performed on a region level with low semantic ambiguity and large amounts of data at one’s disposal. The good results in scene categorization suggest that an unsupervised approach to scene clustering will be successful. Going into more computational detail, the experiments have revealed that Support Vector Machines (SVMs) easily outperform the k-Nearest Neighbor method in classifying semantic concepts. The classiﬁcation rates could be further increased by incorporating the location prior also in the SVM approach since it had positive inﬂuence with the k-Nearest Neighbors method. For the categorization and retrieval of scene categories, SVMs have been compared to a Prototype approach. Also for these tasks, SVMs lead to higher categorization rates and retrieval accuracies. Although typicality ranking of natural scenes might seem similar to scene retrieval, the Prototype approach in combination with the perceptually plausible distance (PPD) measure perform best in ranking scenes. In contrast to the categorization task, the performance measures here compares automatically generated rankings to human rankings. The results conﬁrm that the strength of the SVMs is discrimination and thus classiﬁcation, whereas the prototypes should be chosen for the representation of scenes. In addition, through the learning of the concept weights in the case of the PPD, the space of the concept-occurrence vectors is shaped diﬀerently for each scene category. Spatial modeling of natural scenes beyond the employed horizontally layered image areas did not lead to any performance increase. This suggests that the employed scene categories are indeed visually very diverse. Also, the idea of using the conﬁdence ratings of the concept detectors for higher-level category modeling did not change the performance considerably.

10.2. Perspective

10.2

121

Perspective

The following summarizes some ideas for future work. Clustering of Concept-Occurrence Vectors The experiments on scene categorization show good results even when employing only one prototype per scene category. For that reason, also an automatic clustering of the concept-occurrence vectors is expected to generate semantically meaningful clusters. Semantically meaningful refers here to clusters containing mainly images from one single category. The clustering approach might also group images to subordinate image categories. Extension to more Scene Categories It would be interesting to extend the semantic modeling approach to more scene categories and also to more local semantic concepts. Outdoor scenes with man-made inﬂuence are well-suited for such experiments, since they pose interesting problems concerning the scene ambiguity. In combination with the automatic clustering of concept-occurrence vectors, the approach would reveal which scene types can be extracted. Additional Psychophysical Experiments The psychophysical experiments of this thesis only scratch the surface of human perception of natural scenes. The most interesting question arising from the experiments in this thesis is which semantic information humans use for the categorization and/or ranking task. Is it indeed the local semantic concepts as in the semantic modeling step? Or is it rather some global image information, as suggested by the experiments of Oliva and Torralba [Oliva and Torralba, 2002]? Or is it a combination? Which role does context play for multi-object categorization? Does the knowledge on the type of scene help to classify objects? Additional psychophysical experiments in combination with appropriate computational models might answer some of those questions. Statistical Models of Semantic Co-Occurrence Another very interesting project is the modeling of co-occurrence information similar to the work of [Hofmann et al., 1999], [Barnard et al., 2003], or [Lavrenko et al., 2003]. In the context of the semantic modeling proposed in this thesis, the co-occurrence of the semantic concepts in one image or in one image category could be modeled. Besides being a novel bottom-up image modeling approach, this method would also reveal which spatial conﬁgurations, if any, are distinctive for certain scene types. Extension to Video Retrieval The semantic modeling approach could also be successfully extended for moving pictures, since it entails a method for context recognition and context priming in outdoor scenes. This information can be used to support object recognition in feature ﬁlms as well as for video indexing or keyframe annotation.

122

Chapter 10. Conclusion and Perspective

Interleaved Scene- and Objectclassification The analysis in this thesis showed that scene context and interrelations between semantic concepts help to determine the scene category. In turn, knowledge about the scene category increase concept classiﬁcation accuracies. Thus, learned knowledge from interdependencies between scenes and objects can be used for improving both the scene classiﬁcation as well as the object classiﬁcation. The idea here is to combine bottom-up knowledge, i.e. low-level featurebased information, with top-down knowledge, i.e. information about the scene class, in an iterative manner. Improved Concept Detectors Last but not least, the concept detectors could be extended by employing diﬀerent low-level features, by using diﬀerent features depending on the concepts to be classiﬁed, or by introducing other context information such as the detection and classiﬁcation of the horizon line. Also a combination of bottom-up with top-down approaches, e.g. through scene categorization, might improve the classiﬁcation accuracy of the smaller concept classes. As mentioned before, some preliminary tests have been performed on automatic clustering of the image regions. The goal was here to semantically analyze the clusters found inside one semantic class such as water or foliage. Many of the resulting clusters did indeed show semantic meaning such as “white water”, “calm water”, “light waves” and “strong waves” for water.

Additional Results of the Semantic Typicality Ranking

123

125

human rk=1

human rk=6

human rk=2

human rk=3

human rk=4

human rk=5

human rk=7

human rk=8

human rk=9

human rk=10

Figure A.1: Human Rankings: Top 10 ranked forests-images

1, human rk=2

2, human rk=6

3, human rk=17

6, human rk=11

7, human rk=7

8, human rk=8

4, human rk=5

9, human rk=4

5, human rk=3

10, human rk=19

Figure A.2: Prototype Approach with PPD, annotated image regions: Top 10 ranked forests-images.

1, human rk=17

2, human rk=2

3, human rk=4

4, human rk=11

5, human rk=6

6, human rk=21

7, human rk=3

8, human rk=8

9, human rk=1

10, human rk=5

Figure A.3: Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked forests-images.

126

Appendix A. Additional Results of the Semantic Typicality Ranking

human rk=1

human rk=2

human rk=3

human rk=6

human rk=7

human rk=8

human rk=4

human rk=9

human rk=5

human rk=10

Figure A.4: Human Rankings: Top 10 ranked plains-images

1, human rk=1

2, human rk=12

3, human rk=7

6, human rk=8

7, human rk=6

8, human rk=9

4, human rk=3

9, human rk=4

5, human rk=2

10, human rk=18

Figure A.5: Prototype Approach with PPD, annotated image regions: Top 10 ranked plains-images.

1, human rk=2

2, human rk=6

6, human rk=7

7, human rk=13

3, human rk=4

4, human rk=9

5, human rk=8

8, human rk=22

9, human rk=12

10, human rk=5

Figure A.6: Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked plains-images.

127

human rk=1

human rk=6

human rk=2

human rk=7

human rk=3

human rk=8

human rk=4

human rk=9

human rk=5

human rk=10

Figure A.7: Human Rankings: Top 10 ranked mountains-images

1, human rk=1

6, human rk=10

2, human rk=2

7, human rk=4

3, human rk=7

8, human rk=3

4, human rk=8

5, human rk=6

9, human rk=5

10, human rk=9

Figure A.8: Prototype Approach with PPD, annotated image regions: Top 10 ranked mountains-images.

1, human rk=1

2, human rk=7

3, human rk=4

4, human rk=2

5, human rk=13

6, human rk=11 7, human rk=21

8, human rk=6

9, human rk=8

10, human rk=23

Figure A.9: Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked mountains-images.

List of Figures

2.1 From [Vailaya et al.,1998]: Hierarchy of categories . . . . . . . . . . . . 15 3.1 3.2 3.3 3.4 3.5

Image representation through semantic modeling . . . . . . . . . . . . . From [Tversky and Hemenway, 1982]: Taxonomy of environmental scenes From [Mojsilovic et al., 2004]: Semantic scene categories . . . . . . . . Employed scene taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . natural700 -Database: Examples for each category . . . . . . . . . . . .

21 24 25 26 27

4.1 Two-stage image retrieval system . . . . . . . . . . . . . . . . . . . . . 32 4.2 Multiple retrievals for the query: “20-40% of sky.”, p = 0.90, q = 0.80 . . 35 4.3 Multiple retrievals for the query: “20-35% of water.”, p = q = 0.90 . . . 36 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10

Two-stage retrieval system with query optimization . . . . . . . . . . . Prediction of precision and recall for “10-30% sky ”, varying p, q and S Predicted search space for “20-40% of sky ”, p = 0.90, q = 0.80. . . . . Retrieval results for “20-40% of sky ”, p = 0.90, q = 0.80. . . . . . . . Retrieval with optimization constraint . . . . . . . . . . . . . . . . . . Complete concept distribution P (NP ) for sky and its approximations . P recisiondet vs. Recalldet of various detectors . . . . . . . . . . . . . Retrieval Optimization in Stage I . . . . . . . . . . . . . . . . . . . . Joint Retrieval Optimization in Stage I and II . . . . . . . . . . . . . . Optimization by query mapping: comparison to full optimization . . . .

. . . . . . . . . .

40 41 42 44 45 46 48 50 51 52

6.1 6.2 6.3 6.4

Image segmentation . . . . . . . . . . . . . . . Semantic concept classes . . . . . . . . . . . . Location priors per semantic concept . . . . . . Comparison of the kNN classiﬁcation accuracies

. . . .

59 60 64 66

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7.1 Overview scene categorization and retrieval . . . . . . . . . . . . . . . . 74 7.2 Prototypes and standard deviations of the six scene categories . . . . . . 75 129

130

List of Figures

7.3 Categorization rates vs. Image Areas - Based on annotated image regions and without semantic modeling step . . . . . . . . . . . . . . . . . . . . 7.4 Categorization rates vs. Image Areas - Based on classiﬁed image regions 7.5 Precision vs. recall based on annotated image regions . . . . . . . . . . 7.6 Precision vs. recall based on classiﬁed image regions . . . . . . . . . . . 7.7 Precision vs. recall without semantic modeling step . . . . . . . . . . . . 7.8 Schematic overview of all tested categorization approaches . . . . . . . . 7.9 SVM based scene retrieval with and without semantic modeling . . . . . 7.10 Examples of mis-categorized scenes in each category . . . . . . . . . . . 7.11 Normalized distance computation for scene ordering . . . . . . . . . . . 7.12 Typicality transition from rivers/lakes to forests . . . . . . . . . . 7.13 Typicality transition from forests to mountains . . . . . . . . . . . . 7.14 Typicality transition from mountains to rivers/lakes . . . . . . . . . 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9

Instructions (in German) for the Experiment 1: Categorization Evaluation of Experiment 1 . . . . . . . . . . . . . . . . . . Unanimously categorized images . . . . . . . . . . . . . . . . Images assigned to two categories . . . . . . . . . . . . . . . Images assigned to three categories . . . . . . . . . . . . . . Images assigned to four categories . . . . . . . . . . . . . . . Instructions (in German) for Experiment 2: Typicality Ranking Evaluation of Experiment 2 . . . . . . . . . . . . . . . . . . Comparison Experiment 1 vs. Experiment 2 . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

93 94 95 95 96 96 98 99 100

9.1 9.2 9.3 9.4 9.5

Spearman’s rank correlation: Prototype Approach with SSD . . . . . . Spearman’s rank correlation: Prototype Approach with PPD . . . . . . Scatter plots: Prototype with SSD. Annotated image regions. . . . . . Scatter plots: Prototype with PPD. Annotated image regions. . . . . . Concept weights resulting from the optimization procedure for each category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spearman’s rank correlation: SVM Approach . . . . . . . . . . . . . . Human Rankings: Top 10 ranked coasts-images . . . . . . . . . . . . Prototype Approach with PPD, annotated image regions: Top 10 ranked coasts-images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked coasts-images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Human Rankings: Top 10 ranked rivers/lakes-images . . . . . . . . Prototype Approach with PPD, annotated image regions: Top 10 ranked rivers/lakes-images. . . . . . . . . . . . . . . . . . . . . . . . . . Prototype Approach with PPD, classiﬁed image regions: Top 10 ranked rivers/lakes-images. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

106 107 108 109

9.6 9.7 9.8 9.9 9.10 9.11 9.12

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

76 77 82 83 84 85 86 88 89 90 90 90

. 110 . 111 . 114 . 114 . 114 . 115 . 115 . 115

List of Figures A.1 Human Rankings: Top 10 ranked forests-images . . . A.2 Prototype Approach with PPD, annotated image regions: forests-images. . . . . . . . . . . . . . . . . . . . . . A.3 Prototype Approach with PPD, classiﬁed image regions: forests-images. . . . . . . . . . . . . . . . . . . . . . A.4 Human Rankings: Top 10 ranked plains-images . . . . A.5 Prototype Approach with PPD, annotated image regions: plains-images. . . . . . . . . . . . . . . . . . . . . . A.6 Prototype Approach with PPD, classiﬁed image regions: plains-images. . . . . . . . . . . . . . . . . . . . . . A.7 Human Rankings: Top 10 ranked mountains-images . . A.8 Prototype Approach with PPD, annotated image regions: mountains-images. . . . . . . . . . . . . . . . . . . . A.9 Prototype Approach with PPD, classiﬁed image regions: mountains-images. . . . . . . . . . . . . . . . . . . .

131 . . . . . . . . Top 10 ranked . . . . . . . . Top 10 ranked . . . . . . . . . . . . . . . . Top 10 ranked . . . . . . . . Top 10 ranked . . . . . . . . . . . . . . . . Top 10 ranked . . . . . . . . Top 10 ranked . . . . . . . .

. 125 . 125 . 125 . 126 . 126 . 126 . 127 . 127 . 127

List of Tables

4.1 Possible outcomes of the concept detection per image region . . . . . . . 33 5.1 5.2 5.3 5.4

Comparison of approximate concept distributions . . . . . . . . . . . . . Best detectors for various grass queries . . . . . . . . . . . . . . . . . . Comparison of serial vs. interleaved combination of the optimization stages Comparison of the methods for performance prediction and optimization

47 50 51 54

6.1 6.2 6.3 6.4 6.5 6.6

Sizes of each concept class . . . . . . . . . . . . . . . . . . . . . . . Maximum kNN classiﬁcation accuracies . . . . . . . . . . . . . . . . . SVM classiﬁcation accuracies . . . . . . . . . . . . . . . . . . . . . . Confusion matrix of the kNN concept classiﬁcation (No location prior) Confusion matrix of the kNN concept classiﬁcation (Location prior) . . Confusion matrix of the SVM concept classiﬁcation . . . . . . . . . . .

. . . . . .

61 67 68 69 70 70

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8

Categorization based on annotated image regions - SVM Approach . . Categorization based on annotated image regions - Prototype Approach Categorization based on classiﬁed image regions - SVM Approach . . . Categorization based on classiﬁed image regions - Prototype Approach Categorization without semantic modeling step - SVM Approach . . . Categorization without semantic modeling step - Prototype Approach . Equal error rate performance for SVM approach . . . . . . . . . . . . Equal error rate performance for Prototype approach . . . . . . . . . .

. . . . . . . .

78 78 79 79 80 80 82 83

8.1 Inter-rater reliabilities for Experiment 1 and Experiment 2 . . . . . . . . 97 9.1 Best correlation rs in each category using the SSD . . . . . . . . . . . . 106 9.2 Best correlation rs in each category using the PPD . . . . . . . . . . . . 107 9.3 Best correlation rs in each category with the SVM Approach . . . . . . . 111

133

Bibliography

[Adams et al., 2003] W. Adams, G. Iyengar, C.-Y. Lin, M. Naphade, C. Neti, H. Nock, and J. Smith. Semantic indexing of multimedia content using visual, audio and text cues. EURASIP Journal on Applied Signal Processing , vol. 2, pp. 170–185, 2003. [Aksoy et al., 2000] S. Aksoy, M. Ye, M. Schauf, M. Song, Y. Wang, R. Haralick, J. Parker, J. Pivovarov, D. Royko, C. Sun., and G. Farnebäck. Algorithm performance contest, 2000. In conjunction with ICPR 2000. [Barnard and Forsyth, 2001] K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In International Conference on Computer Vision ICCV’01 . Vancouver, Canada, July 2001. [Barnard et al., 2002] K. Barnard, P. Duygulu, N. de Freitas, and D. Forsyth. Object recognition as machine translation - part 2: Exploiting image data-base clustering models. In European Conference on Computer Vision ECCV’02 . Copenhagen, Denmark, May 2002. [Barnard et al., 2003] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, vol. 3, pp. 1107–1135, 2003. [Biedermann, 1995] I. Biedermann. Visual object recognition. In S. Kosslyn and D. N. Osherson (eds.), An invitation to cognitive science: visual cognition. MIT Press, 1995. [Bortz, 1999] J. Bortz. Statistik für Sozialwissenschaftler . Springer, 5th edn., 1999. [Boutell et al., 2004] M. Boutell, J. Luo, X. Shen, and C. Brown. Learning multi-label scene classiﬁcation. Pattern Recognition, vol. 37, no. 9, pp. 1757–1771, September 2004. [Bowyer and Phillips, 1998] K. Bowyer and P. Phillips (Eds.). Empirical Evaluation Techniques in Computer Vision. IEEE Computer Society Press, 1998. [Buswell, 1935] G. Buswell. How people look at pictures. University of Chicago Press, 1935. 135

136

Bibliography

[Chang and Lin, 2001] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available. URL http://www.csie.ntu.edu.tw/ cjlin/libsvm [Cheeseman et al., 1988] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. AutoClass: A bayesian classiﬁcation system. In International Conference on Machine Learning ICML’88 , pp. 54–64. Ann Arbor, MI, USA, June 1988. [Christensen et al., 1996] H. Christensen, W. Förstner, and C. Madsen (Eds.). Workshop on Performance Characteristics of Vision Algorithms. Cambridge, United Kingdom, April 1996. [CIVR, 2004] CIVR. International Conference on Image and Video Retrieval CIVR. July 2004. LNCS 3115, Springer, Dublin, Ireland. [Clark and Courtney, 1999] A. Clark and P. Courtney (Eds.). Workshop on Performance Characterisation and Benchmarking of Vision Systems. Las Palmas, Spain, January 1999. [Comaniciu and Meer, 1999] D. Comaniciu and P. Meer. Mean shift analysis and applications. In IEEE International Conference on Computer Vision ICCV’99 , pp. 1197–1203. Kerkyra, Greece, September 1999. [Comaniciu and Meer, 2002] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, May 2002. [Courtney and Thacker, 2001] P. Courtney and N. Thacker. Performance characterisation in computer vision: The role of statistics in testing and design. In J. BlancTalon and D. Popescu (eds.), Imaging and Vision Systems: Theory, Assessment and Applications. NOVA Science Books, 2001. [Duygulu et al., 2002] P. Duygulu, K. Barnard, J. de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a ﬁxed image vocabulary. In European Conference on Computer Vision ECCV’02 . Copenhagen, Denmark, May 2002. [Eakins and Graham, 1999] J. Eakins and M. Graham. Content-based image retrieval, a report to the JISC Technology Applications Programme. Tech. rep., Institute for Image Data Research, University of Northumbria at Newcastle, 1999. [Fauqueyr and Boujemaa, 2003] J. Fauqueyr and N. Boujemaa. New image retrieval paradigm: logical composition of region categories. In International Conference on Image Processing ICIP’03 . Barcelona, Spain, October 2003. [Feng et al., 2004] S. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Conference on Image and Video Retrieval CIVR’04 . Dublin, Ireland, July 2004. [Feng et al., 2003] X. Feng, J. Fang, and G. Qiu. Color photo categorization using compressed histograms and support vector machines. In International Conference on Image Processing ICIP’03 . Barcelona, Spain, September 2003.

Bibliography

137

[Ferryman and Crowley, 2004] J. Ferryman and J. Crowley (Eds.). Sixth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance PETSECCV . Prague, Czech Republic, May 2004. [Flickner et al., 1995] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: the QBIC system. IEEE Computer , vol. 28, no. 9, pp. 23–32, September 1995. [Förstner, 1996] W. Förstner. 10 pro’s and con’s against performance characterization of vision algorithms. In Workshop on Performance Characteristics of Vision Algorithms. Cambridge, United Kingdom, April 1996. [Freund and Schapire, 1997] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and System Sciences, vol. 55, no. 1, pp. 119–139, August 1997. [Friedman, 1979] A. Friedman. Framing pictures: The role of knowledge in automatized encoding and memory for gist. Journal of Experimental Psychology: General , vol. 108, pp. 316–355, 1979. [Haralick et al., 1998] R. Haralick, R. Klette, S. Stiehl, and M. Viergever (Eds.). Evaluation and Validation of Computer Vision Algorithms, no. 98111 in Dagstuhl Seminar. Dagstuhl, Germany, March 1998. URL http://www.dagstuhl.de/data/seminars/98/ [Haralick, 1985] R. Haralick. Computer vision theory: The lack thereof. In Third Workshop on Computer Vision: Representation and Control , pp. 113–121. Bellaire, MI, USA, 1985. [Henderson and Ferreira, 2004] J. Henderson and F. Ferreira. Scene perception for psycholinguists. In J. Henderson and F. Ferreira (eds.), The interface of language, vision, and action: Eye movements and the visual world , pp. 1–58. Psychology Press, 2004. [Henderson, 2003] J. Henderson. Human gaze control during real-world scene perception. Trends in Cognitive Sciences, vol. 7, no. 11, pp. 498–504, November 2003. [Hofmann and Puzicha, 1998] T. Hofmann and J. Puzicha. Statistical models for cooccurrence data. Tech. Rep. A.I. Memo No. 1625, MIT, AI Lab, February 1998. [Hofmann et al., 1999] T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In Advances in Neural Information Processing Systems 11 (NIPS’98), pp. 466–472. MIT Press, 1999. [Hoiem et al., 2004] D. Hoiem, R. Sukthankar, H. Schneiderman, and L. Huston. Object-based image retrieval using the statistical structure of images. In Conference on Computer Vision and Pattern Recognition CVPR’04 . Washington, D.C., June 2004. [Hsu and Lin, 2002] C.-W. Hsu and C.-J. Lin. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.

138

Bibliography

[Jain et al., 1995] R. Jain, R. Kasturi, and B. Schunck. Machine Vision. McGraw-Hill, Inc., 1995. [Joachims, 2002] T. Joachims. Learning to Classify Text using Support Vector Machines - Methods, Theory, and Algorithms. Kluwer Academic Publishers, 2002. [Kline, 2000] P. Kline. Handbook of Psychological Testing . Routledge, 2nd edn., 2000. [Konishi and Yuille, 2000] S. Konishi and A. Yuille. Statistical cues for domain speciﬁc image segmentation with performance analysis. In Conference on Computer Vision and Pattern Recognition CVPR’00 . Hilton Head, SC, USA, June 2000. [Kroschel, 1996] K. Kroschel. Statistische Nachrichtentheorie - Signal- und Mustererkennung, Parameter- und Signalschätzung . Springer, 1996. [Kumar and Hebert, 2003] S. Kumar and M. Hebert. Man-made structure detection in natural images using a causal multiscale random ﬁeld. In Conference on Computer Vision and Pattern Recognition CVPR’03 , pp. 119–126. Madison, Wisconsin, 2003. [Lakoﬀ, 1987] G. Lakoﬀ. Women, Fire, and Dangerous Things - What Categories Reveal about the Mind . University of Chicago Press, 1987. [Lavrenko et al., 2003] V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In 17th Annual Conference on Neural Information Processing Systems NIPS’03 . Vancouver, Canada, December 2003. [Leibe, 2004] B. Leibe. Interleaved Object Categorization and Segmentationl . Ph.D. thesis, ETH Zurich, Zurich, Switzerland, October 2004. [Lim and Jin, 2004] J.-H. Lim and J. Jin. Semantics discovery for image indexing. In European Conference on Computer Vision ECCV’04 . Prague, Czech Republic, May 2004. [Lim, 2001] J.-H. Lim. Building visual vocabulary for image indexation and query formulation. Pattern Analysis & Applications, vol. 4, pp. 125–139, 2001. [Lipson et al., 1997] P. Lipson, E. Grimson, and P. Sinha. Conﬁguration based scene classiﬁcation and image indexing. In Conference on Computer Vision and Pattern Recognition CVPR’97 , pp. 1007–1011. Puerto Rico, June 1997. [Li et al., 2002] F. Li, R. Van Rullen, C. Koch, and P. Perona. Rapid natural scene categorization in the near absence of attention. Proceedings of hte National Academy of Sciences of the USA, vol. 99, no. 14, pp. 9596–9601, July 2002. [Li and Wang, 2003] J. Li and J. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075–1088, September 2003. [Mao and Jain, 1992] J. Mao and A. Jain. Texture classiﬁcation and segmentation using multi-resolution simultaneous autoregressive models. Pattern Recognition, vol. 25, no. 2, pp. 173–188, February 1992.

Bibliography

139

[Maron and Ratan, 1998] O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classiﬁcation. In International Conference on Machine Learning ICML’98 , pp. 341–349. Morgan Kaufmann, San Francisco, CA, 1998. [Ma and Manjunath, 1999] W.-Y. Ma and B. Manjunath. Netra: A toolbox for navigating large image databases. Multimedia Systems, vol. 7, no. 3, pp. 184–198, March 1999. [McCotter et al., 2004] M. McCotter, F. Gosselin, P. Sowden, and P. Schyns. The use of visual information in natural scenes. Visual Cognition, 2004. In Press. [Mervis and Pani, 1980] C. Mervis and J. Pani. Acquisition of basic object categories. Cognitive Psychology , vol. 12, pp. 496–522, 1980. [Minka and Picard, 1997] T. Minka and R. Picard. Interactive learning using a society of models. IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 30, no. 4, April 1997. [Moghaddam et al., 2001] B. Moghaddam, H. Viermann, and D. Margaritis. Regionsof-interest and spatial layout for content-based image retrieval. Multimedia Tools and Applications, vol. 14, no. 2, pp. 201–210, June 2001. [Mojsilovic and Rogowitz, 2001] A. Mojsilovic and B. Rogowitz. Capturing image semantics with low-level descriptors. In International Conference on Image Processing ICIP’01 . Thessaloniki, Greece, October 2001. [Mojsilovic et al., 2004] A. Mojsilovic, J. Gomes, and B. Rogowitz. Semantic-friendly indexing and querying of images based on the extraction of the objective semantic cues. International Journal of Computer Vision, vol. 56, no. 1/2, pp. 79–107, January 2004. [Müller et al., 2001] H. Müller, W. Müller, D. Squire, S. Marchand-Maillet, and T. Pun. Performance evaluation in content-based image retrieval: Overview and proposals. Pattern Recognition Letters, vol. 22, pp. 593–601, 2001. [Murphy, 2002] G. L. Murphy. The Big Book of Concepts. MIT Press, 2002. [Naphade and Huang, 2001] M. Naphade and T. Huang. A probabilistic framework for semantic video indexing, ﬁltering, and retrieval. IEEE Transactions on Multimedia, vol. 3, no. 1, pp. 141–151, January 2001. [Oliva and Schyns, 2000] A. Oliva and P. Schyns. Diagnostic color blobs mediate scene recognition. Cognitive Psychology , vol. 41, pp. 176–210, 2000. [Oliva and Torralba, 2001] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, March 2001. [Oliva and Torralba, 2002] A. Oliva and A. Torralba. Scene-centered description from spatial envelope properties. In Second Workshop on Biologically Motivated Computer Vision BMCV’02 . Tübingen, Germany, November 2002.

140

Bibliography

[Oliva et al., 1999] A. Oliva, A. Torralba, A. Guerin-Dugue, and J. Herault. Global semantic classiﬁcation of scenes using power spectrum templates. In Challenge of Image Retrieval CIR. Newcastle, UK, 1999. [Phillips et al., 2000] P. Phillips, H. Moon, P. Rauss, and S. Rizvi. The FERET evaluation methodology for face recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, October 2000. [Picard and Minka, 1995] R. Picard and T. Minka. Vision texture for annotation. ACM Journal of Multimedia Systems, 1995. [Picard, 1995] R. Picard. Toward a visual thesaurus. In Workshop in Computing, MIRO’95 . Glasgow, UK, September 1995. Invited Paper. [Potter, 1976] M. Potter. Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning and Memory , vol. 2, pp. 509–522, 1976. [Price, 1985] K. Price. I’ve seen your demo: so what? In Third Workshop on Computer Vision: Representation and Control , pp. 122–124. 1985. [Rogowitz et al., 1997] B. Rogowitz, T. Frese, J. Smith, C. Bouman, and E. Kalin. Perceptual image similarity experiments. In SPIE Conference on Human Vision and Electronic Imaging , pp. 576–590. San Jose, California, January 1997. [Rosch and Mervis, 1975] E. Rosch and C. Mervis. Family resemblance: Studies in the internal structure of categories. Cognitive Psychology , vol. 7, pp. 573–605, 1975. [Rosch et al., 1976] E. Rosch, C. Simpson, and R. Miller. Structural bases of typicality eﬀects. Journal of Experimental Psychology: Human Perception and Performance, vol. 2, pp. 491–502, 1976. [Rosch, 1978] E. Rosch. Principles of categorization. In E. Rosch and B. Lloyd (eds.), Cognition and categorization. Erlbaum, 1978. [Rui et al., 1999] Y. Rui, T. Huang, and S. Chang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, vol. 10, pp. 39–62, October 1999. [Schiele and Vogel, 2000] B. Schiele and J. Vogel. Vocabulary-supported image retrieval. In First DELOS Workshop on Information Seeking, Searching and Querying in Digital Libraries. Zurich, Switzerland, December 2000. [Schmid, 2004] C. Schmid. Weakly supervised learning of visual models and its application to content-based image retrieval. International Journal of Computer Vision, vol. 56, no. 1/2, pp. 7–16, January/February 2004. [Schyns and Oliva, 1994] P. Schyns and A. Oliva. From blobs to boundary edges: evidence for time- and spatial-scale dependent scene recognition. Psychological Science, vol. 5, pp. 195–200, 1994. [Serrano et al., 2004] N. Serrano, A. Savakis, and J. Luo. Improved scene classiﬁcation using eﬃcient low-level features and semantic cues. Pattern Recognition, vol. 37, no. 9, pp. 1773–1784, September 2004.

Bibliography

141

[Shi and Malik, 1997] J. Shi and J. Malik. Normalised cuts and image segmentation. In Conference on Computer Vision and Pattern Recognition CVPR’97 . Puerto Rico, June 1997. [Sivic and Zisserman, 2003] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In International Conference on Computer Vision ICCV’03 . Nice, France, October 2003. [Sivic and Zisserman, 2004] J. Sivic and A. Zisserman. Video data mining using conﬁgurations of viewpoint invariant regions. In Conference on Computer Vision and Pattern Recognition CVPR’04 . Washington, D.C., June 2004. [Sivic et al., 2004] J. Sivic, F. Schaﬀalitzky, and A. Zisserman. Object level grouping for video shots. In European Conference on Computer Vision ECCV’04 . Prague, Czech Republic, May 2004. [Smeulders et al., 2000] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349–1380, December 2000. [Smith, 1998] J. R. Smith. Image retrieval evaluation. In Workshop on Content-based Access of Image and Video Libraries CAIVL’98 . Santa Barbara, California, June 1998. [Swain and Ballard, 1991] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, vol. 7, no. 1, pp. 11–32, 1991. [Szummer and Picard, 1998] M. Szummer and R. Picard. Indoor-outdoor image classiﬁcation. In Workshop on Content-based Access of Image and Video Databases. Bombay, India, January 1998. [The Benchathlon Network] The Benchathlon Network. Home of CBIR Benchmarking. URL http://www.benchathlon.net [Tieu and Viola, 2004] K. Tieu and P. Viola. Boosting image retrieval. International Journal of Computer Vision, vol. 56, no. 1/2, pp. 17–36, January/February 2004. [Town and Sinclair, 2000] C. Town and D. Sinclair. Content based image retrieval using semantic visual categories. Tech. Rep. 2000.14, AT&T Laboratories Cambridge, 2000. [TREC, Video Track, 2001] TREC, Video Track. The Tenth Text REtrieval Conference, Video Track. 2001. [TREC, 2003] TREC. The Twelfth Text REtrieval Conference, NIST Special Publication 500-255. 2003. [Tversky and Hemenway, 1983] B. Tversky and K. Hemenway. Categories of environmental scenes. Cognitive Psychology , vol. 15, pp. 121–149, 1983. [Vailaya et al., 1996] A. Vailaya, A. Jain, and H. Zhang. Video clustering. Tech. Rep. MSU-CPS-96-64, Michigan State University, 1996.

142

Bibliography

[Vailaya et al., 1998] A. Vailaya, A. Jain, and H. Zhang. On image classiﬁcation: City vs. landscape. Pattern Recognition, vol. 31, no. 12, pp. 1921–1935, December 1998. [Vailaya et al., 2001] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. Image classiﬁcation for content-based indexing. IEEE Transactions on Image Processing , vol. 10, no. 1, pp. 117 – 130, January 2001. [Vapnik, 1995] V. N. Vapnik. The Nature of Statistical Learning Theory . Springer, 1995. [Veltkamp and Tanase, 2001] R. Veltkamp and M. Tanase. Content-based image retrieval systems: A survey. Tech. rep., Department of Computer Science, Utrecht University, March 2001. URL http://www.aa-lab.cs.uu.nl/cbirsurvey/cbir-survey/ [Vogel, 2000] F. Vogel. Beschreibende und schließende Statistik, Formeln, Definitionen, Erläuterungen, Stichwörter und Tabellen. R. Oldenbourg Verlag, 12th edn., 2000. [Vogel and Schiele, 2001] J. Vogel and B. Schiele. Performance prediction for vocabulary-supported image retrieval. In IEEE International Conference on Image Processing ICIP’01 . Thessaloniki, Greece, October 2001. [Vogel and Schiele, 2002a] J. Vogel and B. Schiele. On performance categorization and optimization for image retrieval. In European Conference on Computer Vision ECCV’02 , vol. IV, pp. 49–63. Copenhagen, Denmark, May 2002a. [Vogel and Schiele, 2002b] J. Vogel and B. Schiele. Query-dependent performance optimization for vocabulary-supported image retrieval. In Pattern Recognition Symposium DAGM’02 . Zurich, Switzerland, September 2002b. [Vogel and Schiele, 2004a] J. Vogel and B. Schiele. Natural scene retrieval based on a semantic modeling step. In International Conference on Image and Video Retrieval . Dublin, Ireland, July 2004a. [Vogel and Schiele, 2004b] J. Vogel and B. Schiele. A semantic typicality measure for natural scene categorization. In Pattern Recognition Symposium DAGM’04 . Tübingen, Germany, September 2004b. [Walker Renninger and Malik, 2004] L. Walker Renninger and J. Malik. When is scene identiﬁcation just texture recognition? Vision Research, vol. 44, no. 4, pp. 2301– 2311, April 2004. [Wang and Zhang, 2001] Y. Wang and H. Zhang. Content-based image orientation detection with support vector machines. In Workshop on Content-Based Access of Image and Video Libraries CBAIVL’01 . Kauai, Hawaii, USA, December 2001. [Zhang, 2001] H.-J. Zhang. Relevance feedback in content-based image search. In Conference on New Information Technology NIT’01 . Beijing, China, May 2001. Invited Keynote.

Bibliography

143

[Zhang and Zhang, 2004] R. Zhang and Z. Zhang. Hidden semantic concept discovery in region based image retrieval. In Conference on Computer Vision and Pattern Recognition CVPR. Washington, D.C., June 2004.

Curriculum Vitae Julia Vogel Date of birth:

November 23, 1973

Place of birth:

Cologne, Germany

Citizenship:

German

Education:

Professions:

2000–2004

Doctoral student at the Swiss Federal Institute of Technology (ETH) Zurich, Perceptual Computing and Computer Vision Group.

1996–1998

Studies of Electrical and Computer Engineering, Oregon State University, Corvallis, Oregon, USA. Graduation with the degree M.S. in Electrical and Computer Engineering.

1993–2000

Studies of Electrical Engineering, University of Karlsruhe, Germany. Graduation with the degree Dipl.Ing. Elektrotechnik.

1980–1993

Primary School and High School in Bamberg, Germany.

2000–2004

Research and Teaching Assistant, Perceptual Computing and Computer Vision Group, ETH Zurich.

1998–1999

Research Assistant, Institute for Communications, University of Karlsruhe.

1998

Internship at Rhode&Schwarz, Munich.

1997–1998

Research Assistant, Wireless Communications Group, Oregon State University.