Task Description for PASCAL Challenge

Task Description for PASCAL Challenge Evaluating Ontology Learning and Population from Text Contact Person: Contributions by: Marko Grobelnik, JSI ma...
Author: Cameron Lamb
1 downloads 4 Views 192KB Size
Task Description for PASCAL Challenge Evaluating Ontology Learning and Population from Text Contact Person: Contributions by:

Marko Grobelnik, JSI [email protected] Philipp Cimiano, AIFB [email protected] Eric Gaussier, XRCE [email protected] Paul Buitelaar, DFKI [email protected] Blaz Novak, JSI [email protected] Janez Brank, JSI [email protected] Michael Sintek, DFKI [email protected]

Introduction Ontologies are formal, explicit specifications of shared conceptualizations, representing concepts and their relations that are relevant to a given domain of discourse. Currently, ontologies are mostly developed as well as used through a manual process, which is very ineffective and may cause major barriers to their large-scale use in such areas as Knowledge Discovery and Semantic Web. As human language is a primary mode of knowledge transfer, linguistic analysis of relevant documents for this purpose seems a viable option. More precisely, automation of ontology construction (ontology learning) and use (ontology population through knowledge markup) can be implemented by a combined use of linguistic analysis and machine learning approaches for text mining. Automatic methods for text-based ontology learning and population have developed over recent years, e.g. results from the ECAI-20001, IJCAI-20012, ECAI-20023 workshops on Ontology Learning, the KCAP-20014, ECAI-20025, KCAP-20036, ISWC-20047 and ISWC058 workshops on Knowledge Markup, the BioCreative9 effort in the biological domain and the ECAI-200410 workshop on Ontology Learning and Population. However, two major challenges remain: • •

quantitative evaluation of the accuracy of extracted ontology classes, attributes and instances development of a well-grounded learning framework for the task

Both are central issues as it is currently very hard to compare methods and approaches, due to the lack of a shared understanding of the task and subtasks at hand. The proposed challenge will therefore be concerned with the development of such a shared understanding through the definition of a set of clear subtasks, identification of the resources needed for these and the organization of an evaluation exercise accordingly, and through the development of a complete learning framework.

Ontologies There are a number of different definitions in the literature of what constitutes an ontology, some of which are discussed in [Guarino97]. The most widely accepted definition originates with [Gruber94]: “An ontology is an explicit, formal specification of a shared conceptualization of a domain of 1

http://ol2000.aifb.uni-karlsruhe.de/ http://ol2001.aifb.uni-karlsruhe.de/home.html 3 http://www-sop.inria.fr/acacia/WORKSHOPS/ECAI2002-OLT/ 4 http://semannot2001.aifb.uni-karlsruhe.de/ 5 http://saakm2002.aifb.uni-karlsruhe.de/programme.html 6 http://km.aifb.uni-karlsruhe.de/ws/semannot2003 7 http://km.aifb.uni-karlsruhe.de/ws/semannot2004 8 http://km.aifb.uni-karlsruhe.de/ws/semannot2005 9 http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html 10 http://olp.dfki.de/ecai04/cfp.htm 2

interest”. ‘Formal’ refers to the fact that the ontology should be machine-readable. ‘Shared’ reflects the notion that an ontology captures consensual knowledge, that is, it is not private to some individual, but accepted as a group. The reference to ‘a domain of interest’ indicates that one is not concerned with modeling the whole world, but rather in modeling just what is relevant to the task at hand. Here, we assume a definition of an ontological model that is valid for current knowledge representation languages such as OWL11 or RDF(S)12. The core “ingredients” of such an ontology are a set of concepts and relations between them. Learning these from a given data set (e.g. document collection) is referred to as ontology learning. Ontologies formalize the intentional aspects of a domain. The extensional part is provided by a knowledge base, which contains assertions about instances of the concepts and relations. The process of defining and instantiating a knowledge base is referred to as ontology population.

Description of the Tasks In order to advance towards ontology learning while staying close to real-world needs, which require robust tools that are able to deal with large amount of data, we propose a first set of tasks that address the first step in ontology development: acquisition and population of a taxonomic backbone (i.e. a set of concepts and the hierarchical relations between them). Here, we consider two different types of resources: a general topic taxonomy and an ontology from the domain of tourism. In subsequent tasks, to be defined in the context of the next OLP challenges, we will be concerned with the acquisition and population also of other (non-taxonomic) relations between concepts. These tasks aim at evaluating various aspects of tools for assisting users in the development of real world ontologies. Part I - Induction and population of a topic taxonomy This first set of tasks is based on DMOZ13, which has become a de facto standard in topic taxonomies. The DMOZ taxonomy contains fifteen high-level categories, corresponding to a broad division of the domains covered (e.g. Arts, Business, Science, …) with each such category being divided into subcategories. The average depth of each high-level category is around 7, and the total number of categories amounts to ca. 700,000. Participants will be distributed a sub-part of the complete taxonomy (namely the “Science” subtree, containing approx. 10,000 categories and 100,000 documents), on which they will have to perform the following tasks. Task 1.1: Induction of a topic taxonomy (Reconstruction of an existing topic taxonomy) This task consists of inferring a taxonomy of topics from a collection of documents. Participants are asked to provide a list of topic classes, structured in the form of a tree, as well as assignments of each document to topic classes. Two files should thus be provided by participants: • •

a file indicating the structure underlying the inferred topic classes. This file should contain, on each line, a class identifier followed by the list of its descendants; a file describing to which topic class(es) a given document belongs to. Each line should contain a class identifier, followed by the list of documents the class contains.

For example, assuming documents D1 and D3 deal with the Iterated Prisoner Dilemma while documents D2 and D4 deal with Lindenmayer Systems, participants will receive as input: Description of D1, D2, D3, D4 Where the description of a document comprises: (a) its URL, (b) title and a short description (approx. one sentence), (c) the content of the document itself, (d) snippets of web pages that link to this document (as returned by Google), (e) snippets from other related web pages (as returned by Google). Participants are then asked to provide: 11

http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/rdf-schema/ 13 http://dmoz.org 12

C0: C1,C2 C1: D1,D3 C2: D2,D4 Task 1.2: Populating an existing taxonomy This task consists of relating documents from a given collection to nodes of a topic taxonomy. The following information about the taxonomy will be provided to the participants: • names and brief descriptions of all the topics; • structure of the taxonomy (parent-child relationships between topics); • a set of documents. This will be the documents actually occurring in the DMOZ taxonomy. For each document, the same information as in Task 1.1 will be provided. For a (small) subset of these documents, information about which topics they belong to will also be provided. Participants will thus have at their disposal both annotated and unannotated documents, the set of unannotated documents being potentially used to leverage supervised techniques based on the annotated documents (a framework known as semi-supervised learning). A different set of documents will be used as the test set for evaluation. Participants will have to provide, for each node of the taxonomy, the list of documents they believe are related to it. For this task, and using the example above, participants will receive as input: C0: Artificial Life C1: Iterated Prisoner Dilemma C2: Lindenmayer Systems C0: C1,C2 Description of D1, D2, D3, D4 And are asked to provide: C1: D1,D3 C2: D2,D4 Task 1.3: Class naming This task aims at evaluating whether or not current systems are able to provide synthetic descriptions of nodes in a given topic taxonomy. This task directly relates to the capacity of systems to smoothly interact with users. Participants are provided with a sub-part of the topic taxonomy, together with related documents. They are asked to provide a short description (in no more than 20 words) of each node. Again using the above example, participants will receive as input: C0: C1,C2 C1: D1,D3 C2: D2,D4 Description of D1, D2, D3, D4 And are asked to provide: C1: “iterated prisoner dilemma” C2: “Lindenmayer systems”

Part II – Induction and population of a tourism ontology Task 2.1: Ontology population Participants are provided with a list of named entities to be assigned to concepts stemming from a tourism taxonomy. In particular, for this purpose a subset of 107 concepts of the ontology — consisting of 681 concepts overall — will be used. The corresponding corpus from which these named entities originate will also be provided in a linguistically preprocessed form. The corpus consists of 1801 descriptions of destinations from LonelyPlanet. Participants are asked to generate assignments for the named-entities to one out of 107 concepts. The results should be provided in a file where each line contains a named-entity identifier followed by the concept identifier it is related to. Participants will receive the following data for this task: 1) 2) 3) 4)

one file containing the structure of underlying concept hierarchy one file containing the 1187 named entities to assign to their corresponding concepts one file containing the 107 target concepts the preprocessed corpus

The file with the target concepts will look as follows: country river hotel region mountain_range … And the list of instances to tag could look as follows: France Germany Rhein Seine Alps … As result of the task we will expect a file as follows: France Germany Rhein Seine Alps …

country country river river mountain_range

Task 2.2: Concept Formation Participants are provided with the tourism corpus and a set of named entities. Participants are asked to generate concepts, i.e, groups of entities and label them with an appropriate name. Participants will thus get: 1) one file with the entities to be grouped 2) the preprocessed corpus The result of the task will be two files:

1) The first file will contain a concept identifier as well as all the named entities contained in the extension of the concept 2) The second file contains the label for the concept Imagine the list of entities to be clustered is as follows: France, Germany, Barcelona, Rhein, Seine, Loire, Spain, Ebro, Paris, Madrid, Berlin Then the first file could be as follows: Concept_ID1 Concept_ID2 Concept_ID3

France, Germany, Spain Rhein, Loire, Seine, Ebro Madrid, Barcelona, Paris, Berlin

And the second file would be something like: Concept_ID1 Concept_ID2 Concept_ID3

country river city

While the results in the first file will be evaluated automatically, our aim is to evaluate the labels manually by a set of human judges. The results of the first file will be evaluated as described in the next section. Task 2.3: Taxonomy extension Participants are provided with a pruned version of the tourism taxonomy, the associated tourism corpus, and a set of entities to be grouped together. Conceptually, this task is an extension of task 2.2. In addition to two files with the same structure and meaning as described above for task 2.2, we will expect a third file explicitly assigning a super-class from the pruned tourism taxonomy to the newly formed classes (concepts). Taking the above example output, we would thus expect a third file explicitly mentioning the superclass, i.e. Concept_ID1 Concept_ID2 Concept_ID3

region natural_object location

Evaluation We will rely on different measures for the different tasks. A. Evaluation of the concept formation task In the case where the task (e.g. Task 2.2) does not involve structuring the underlying concepts/classes, we will directly evaluate the answers of the systems in terms of precision, recall and F-measure with respect to the gold standard which will be released after the completion of the task. For a given concept c, let ext(c) be the extension of c, i.e. the set of all its instances. For two concepts c and c', we can measure their similarity by computing the overlap: overlap(c, c') = |ext(c) ∩ ext(c')| / |ext(c) ∪ ext(c')|

Now we can define measure precision and recall by analogy with their usual information-retrieval definitions. For high recall, anything that has been defined as a concept c' in the golden standard ontology should also have been defined as some concept c in the participant’s ontology: recall = averagec' ∈ golden standard maxc ∈ participant’s ontology overlap(c, c'). For high precision, it is the other way around: anything that has been defined as a concept c in the participant’s ontology should correspond to some actual concept c' in the original golden standard ontology. precision = averagec ∈ participant’s ontology maxc' ∈ golden standard overlap(c, c'). As usually in information retrieval, precision and recall can be combined into their harmonic mean, the F-measure: F1 = 2 ⋅ precision ⋅ recall / (precision + recall). B. Evaluation of ontology population tasks In the case where the task also involves adequately placing instances/documents into given, structured concepts/classes (e.g. Tasks 1.2 and 2.1), we will use the information-theoretic definition of recall and precision. For a concept c, let extG(c) be the extension of c in the golden standard ontology, and extP(c) be the extension of c in the ontology populated by the participant’s algorithm. Then we define the contingency table for c: TP(c) = |extG(c) ∩ extP(c)|, TN(c) = |(extG(c) ∪ extP(c))C|,

FP(c) = |extP(c) – extG(c)|, FN(c) = |extG(c) – extP(c)|.

recall(c) = TP(c) / (TP(c) + FN(c)), precision(c) = TP(c) / (TP(c) + FP(c)), F1(c) = 2 ⋅ precision(c) ⋅ recall(c) / (precision(c) + recall(c)). Aggregate scores can be computed either by microaveraging or by macroaveraging. The macroaveraged recall, precision, or F1 is simply the average of recall(c), precision(c), or F1(c) over all concepts c. On the other hand, the microaveraged scores are computed by first summing together the contingency tables of all categories, and then computing recall, precision and F1 from the resulting aggregate contingency table: µ-recall = [Σc TP(c)] / [Σc TP(c) + Σc FN(c)], µ-precision = [Σc TP(c)] / [Σc TP(c) + Σc FP(c)] µ-F1 = 2 ⋅ µ-precision ⋅ µ-recall / (µ-precision + µ-recall). In addition to the above measures, we will also use the learning accuracy and its symmetric variant, as well as, potentially, the edge measure, since they all allow to consider that instances/documents can be tagged with different concepts/classes at different levels of detail or granularity. For a given instance i, let pi be the “predicted” concept (the one to which this instance has been assigned by the participant’s model), and let ci be the correct concept according to the golden standard. Furthermore, let f(a, b) be some measure of similarity of the concepts a and b. Then we can evaluate the participant’s ontology population results by computing the average averageover all instances i f(pi, ci). For the function f, we will use one of the three functions LA, LA', and simEdge, which are defined below. Learning Accuracy [Hahn et al. 1998]. We use the notation δ(a, b) to denote the distance (in terms of the number of edges traversed) between concepts a and b in the golden standard ontology. Furthermore let top denote the root node of the ontology.

p

c

a

c

a

p

c

p LA(p, c) = 2/3 LA'(p, c) = 2/3 simEdge = 2 ⋅ 3 – 1 = 5

LA(p, c) = 2/4 LA'(p, c) = 2/4 simEdge = 2 ⋅ 3 – 2 = 4

LA(p, c) = 2/6 LA'(p, c) = 2/5 simEdge = 2 ⋅ 3 – 3 = 3

Figure 1. An illustration of various measures of similarity between the predicted concept p and the correct concept c. Let p be the “predicted” concept (the one to which an instance has been assigned by the model), and let c be the correct concept according to the golden standard. Furthermore let a be the least common subsumer of p and c, i.e. their deepest common ancestor in the tree. Then Hahn’s learning accuracy is defined as follows: LA(p, c) = [δ(top, a) + 1] / [δ(top, c) + 1] LA(p, c) = [δ(top, a) + 1] / [δ(top, a) + 2 δ(a, p) + 1]

if p is an ancestor of c (and a = p) otherwise.

Thus, a notable characteristic of this measure is that it does not take the distance δ(c, a) into account, except when the predicted concept p is itself an ancestor of the correct concept c. Symmetric Learning Accuracy. This measure is symmetric in the sense that it treats p and c in the same way, and it doesn’t make a distinction between cases when p is a ancestor of c and cases when it isn’t. LA'(p, c) = [δ(top, a) + 1] / [δ(top, a) + δ(a, c) + δ(a, p) + 1]. In other words, one takes the set of ancestors of p and the set of ancestors of c; then LA' is simply the ratio of the size of their intersection to the size of their union. LA and LA' agree when p is an ancestor of c, as well as when it isn’t but p and c are on the same level of the tree (so that δ(a, p) = δ(a, c)). Edge measure [Resnik 1995]. This is simply the distance (number of edges) from the predicted concept p to the correct concept c. If we want a measure where higher values mean better results (by analogy with the learning accuracy), we can subtract this distance from some suitable value, such as twice the maximum depth of the tree: simEdge(p, c) = 2 MaxDepth – δ(p, c). C. Evaluation of ontology induction tasks For tasks also involving the creation of the structure of concepts/classes (as Tasks 1.1 and 2.3), we will use in addition the following measure. For an ontology T, let cT(i) be the concept to which the instance i has been assigned in this ontology. Note that the symmetric learning accuracy LT' as defined above depends on the structure of the ontology; when computed in the ontology T, we will denote it by

LA'T(a, b). We will use the following formula to measure the dissimilarity between two ontologies T and U built over the same set of instances I: dissimilarity(T, U) = Σi, j ∈ I, i≠j |LA'T(cT(i), cT(j)) – LA'U(cU(i), cU(j))|. Each term of this sum is a measure of how much the two ontologies disagree on the relative placement of instances i and j. Note that this measure can be seen as a generalization of the well-known Rand index that is often used to compare (non-hierarchical) clusterings or partitions of a set. In the context of our evaluation, one of the ontologies T and U would be the golden standard and the other would be the ontology constructed by the participant. The “Science” subtree of DMOZ used in Task 1.1 contains approx. 100000 documents, which means that there are approx. 1010 pairs of documents and the sum in the dissimilarity formula above has approx. 1010 terms. Because it would be too time-consuming to evaluate so many terms, we will use a random sample of 107 pairs of documents to compute the dissimilarity. D. Evaluation of the class naming tasks For the class naming tasks (Task 1.3 and the class-naming part of Task 2.2), evaluation will be performed by humans, who will examine a random subset of the proposed class names and assign to each name a score from 1 to 10. The average of these scores will be used as the overall score of the proposed set of class names. Higher scores will be assigned to names that are not only pertinent to the topic of the class but also resemble coherent natural-language phrases such as those actually used as topic names in DMOZ.

Schedule and General Remarks For each task, participants can make use of any resource they want, with the exception of the parts of DMOZ and the tourism ontology not delivered for the task. Please note that participants are asked to fully detail all the resources they use for a given task.

References N. Guarino Understanding, building and using ontologies Intl. J. of Human and Computer Studies 46(2/3), 1997, 293-310 T. Gruber Towards principles for the design of ontologies used for knowledge sharing Int. J. of Human and Computer Studies 43(5/6), 1994, 907-928 U. Hahn, K. Schnattinger, Towards text knowledge engineering, Proceedings of the 15th National Conference on Artificial Intelligence and the 10th Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI), pp. 524-531, 1998 P. Resnik, Using information context to evaluate semantic similarity in a taxonomy, Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pp. 448-453, 1995.

Suggest Documents