Classification of text documents

TDT4745 Kunnskapssystemer December 2005 Classification of text documents Erik Rogstad Øystein Ulseth Supervisor: Tore Amble Co-supervisor: Till Chri...
2 downloads 2 Views 752KB Size
TDT4745 Kunnskapssystemer December 2005

Classification of text documents

Erik Rogstad Øystein Ulseth Supervisor: Tore Amble Co-supervisor: Till Christopher Lech

Acknowledgements We would like to thank our supervisor, Tore Amble, for feedback and contributions throughout the process. Also, we would like to thank CognIT for hosting us this semester. Finally, we would like to thank our co-supervisor, Till Christopher Lech, who has encouraged and pointed us in the right direction a number of times.

i

Contents 1 Introduction

1

I

2

Literature review

2 Classification of documents 2.1 Classification systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Human learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Approaches 3.1 Domain knowledge . . . . . . . . . . . . . . . . . 3.1.1 Content driven . . . . . . . . . . . . . . . 3.1.2 Goal driven . . . . . . . . . . . . . . . . . 3.2 Statistical approaches . . . . . . . . . . . . . . . 3.2.1 Vector Space Model . . . . . . . . . . . . 3.2.2 Polysemy, synonymy and Latent Semantic 3.2.3 Co-occurrences . . . . . . . . . . . . . . . 3.2.4 Clustering . . . . . . . . . . . . . . . . . . 3.3 Rule based approaches . . . . . . . . . . . . . . . 3.3.1 FOPC and semantics . . . . . . . . . . . . 3.3.2 Syntax-driven semantic analysis . . . . . . 3.3.3 Constraint grammar . . . . . . . . . . . . 3.3.4 Reference resolution . . . . . . . . . . . . 3.4 Combined approach . . . . . . . . . . . . . . . . 3.4.1 Text parsing . . . . . . . . . . . . . . . . 3.4.2 Classification of documents . . . . . . . .

3 3 4

. . . . . . . . . . . . . . . .

5 5 6 6 6 7 8 10 13 14 14 15 16 18 20 20 21

. . . .

23 23 23 25 25

5 Summary 5.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27

4 Relevant technologies 4.1 The Oslo-Bergen tagger 4.1.1 Methodology . . 4.2 Corporum - The Kernel 4.2.1 Kernel output . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

ii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . . .

. . . .

CONTENTS

II

CONTENTS

Our approach

28

6 A method proposal 6.1 Brief description of the method . . . . . . . 6.2 Domain representations . . . . . . . . . . . 6.2.1 Domain representation constituents 6.2.2 Finding distinguishing features . . . 6.3 Measuring similarity . . . . . . . . . . . . .

. . . . .

29 29 29 30 31 33

7 Building domain representations 7.1 VSM based domain representations . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Success criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 34 34 35

8 Classification 8.1 Classification using domain VSM domain representations 8.1.1 Measuring similarity . . . . . . . . . . . . . . . . . 8.1.2 Justification of the method . . . . . . . . . . . . . 8.1.3 Threshold value . . . . . . . . . . . . . . . . . . . .

. . . .

37 37 37 37 40

9 Results and evaluation 9.1 General comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Biased training sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Indistinct domain representations . . . . . . . . . . . . . . . . . . . . . . . . . .

41 41 41 43

10 Improvements 10.1 Coreferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Co-reference experiment . . . . . . . . . . . . . . . . . . . . . . . . . . .

44 44 44

III

48

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

Syntactical relations

11 Introduction

49

12 Practical experiment 12.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Noun similarity . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Mutual Information and Hindel’s co-occurrence 12.2.2 Noun-noun similarity . . . . . . . . . . . . . . . 12.2.3 Noun-verb dependencies . . . . . . . . . . . . . 12.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . 12.3 Noise in the output of the Oslo-Bergen tagger . . . . . 12.3.1 Passive voice . . . . . . . . . . . . . . . . . . . 12.3.2 Complementary sentences . . . . . . . . . . . . 12.3.3 Collocations . . . . . . . . . . . . . . . . . . . . 12.3.4 Is-relations . . . . . . . . . . . . . . . . . . . .

51 51 52 52 53 55 55 57 57 58 58 58

13 Conclusion

. . . . . . score . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

59

iii

Chapter 1

Introduction With the increasing flow of information today it gets more and more important to find good ways to organize and structure information. In the field of information technology, the well known folder analogy has been used for years in order to organize documents and files on the hard drives. This analogy helps us draw on real world knowledge about how we organize documents in folders. The information available on the Internet/Intranet seems overwhelming and in order to find what you are looking for, it is clearly a need for organization. Online newspapers such as www.vg.no and www.aftenposten.no have taken the consequence of this and introduced a menu at the left side of the front page that let the user navigate on the site. A typical item that most newsites have on their menus are Sports which is subdivided into Football, Handball etc. Generally, the menu structure is typically organized as a hierarchy. The main objective of this report is to present a method for text document classification, in other words how to give a document an adequate label (such as football ). The report investigates existing methods and paradigms for natural language understanding and discusses how this knowledge can be applied in order to develop an effective method for document classification. This report is divided into three main parts: A literature review (I), a proposal of a method that classify text documents (II) and an experiment that assesses the utility value of syntactical relations in text documents (III). The statistical approach and the rule based approach to information extraction are treated separately. Among the statistical techniques, in particular we pay attention to the vector space model (VSM) and the related latent semantic indexing (LSI). In addition, clustering techniques such as co-occurrence is treated. Results from the ongoing Worthschatz Project at the University of Leipzig are reported, as it seems to be highly relevant for our project. In the rule based sections the strength and weaknesses of syntax-driven semantic analysis and FOPC is discussed in addition to co-reference resolution as a tool to improve concept extraction. Due to the fact that our implementation depends heavily on the Oslo-Bergen tagger, we have also chosen to elaborate on the constraint grammar formalism. Part I is summed up in two hypothesis which make up the foundation of part II and part III repectively. Part II proposes a method for text document classification that is based on the vector space model (VSM). The method is implemented in Java and results from a test-run are recorded and discussed. This section also includes a specific suggestion of improvement that involves co-reference resolution. In part III, we report and discuss results from an experiment that aims to find out the information value of syntactical relations such as subject-verb, subject-verb-object and verbobject. We use the mutual information score to determine dependencies between pair of nouns and noun-verb pairs. This experiment clears the way for future work. 1

Part I

Literature review

2

Chapter 2

Classification of documents Document classification is the process of clustering documents with similar semantic content together. This process could be dependent on predefined categories or prior domain knowledge. Extracting important concepts from the documents is a feasible task, but deciding upon a category is a more complex matter. In many cases typical words like football or golf are not even present in the documents. However, a classification system should be able to understand that a document is dealing with golf even if the term golf is not mentioned at all (1), see 3.1. Further, it should also have the ability to distinguish very similar domains, like football and handball (2). Part two of this report describes a method that satisfies the two properties and the implementation of the method has produced promising results. A possible area of application for classification of texts is to create a hierarchical category structure in which text documents could be classified by. Such a structure is similar to the topic menus used in online newspapers. Here documents are grouped according to content. As the amount of online information is rapidly growing, such a technique would be very useful in order to handle and organise textual data. In this chapter a brief introduction is given on the subject of classification. Section 2.1 will explain different approaches of ordinary classification systems, whereas section 2.2 describes how humans interpret texts on basis of background knowledge.

2.1

Classification systems

Classification systems are used to organise different kinds of data, ie. books, articles, webpages, DVDs, CDs and so on. Over the years, a number of such systems have been proposed and especially libraries have made demands for proper classification systems. In the Englishspeaking world the two most common systems for library classification are the Library of Congress classification (LC) and the Dewey Decimal Classification (DDC). The first system was developed in 1897 by Herbert Putnam and divides subjects into broad categories like social science, language, literature and medicine. The latter system, DDC, was developed by Melvil Dewey in 1876. DDC consists of ten main classes and each class is assigned a distinct, numerical value from 000 - 999. Further, categories are placed under the classes and assigned decimal numbers. This allows the system to be an infinitive hierarchy. In general, classification systems can be divided into three types. The enumerative approach produces an alphabetic list of the knowledge categories. Using hierarchies, subjects can be structured and related from most general to most specific. In a faceted classification system, objects are assigned multiple classifications. This system is extremely suitable to classification of objects in the online world. By contrast to a physical book, which has one and only one spot 3

2.2. HUMAN LEARNING

CHAPTER 2. CLASSIFICATION OF DOCUMENTS

on the bookshelf, an article published on the web could be identified and located by multiple characteristics.

2.2

Human learning

A good place to start, in order to develop a computational classification system, is to have a look at how humans learn and how we apply knowledge and experience to recognize patterns in unknown situations [24]. Human learning processes build knowledge and integrate new knowledge into an existing knowledge base. The cognitive orientation to learning was a reaction to the behaviorist approach. Behaviorists interpreted learning as an extension of a set of stimuli response units. Pavlov’s famous experiment [24] with dogs and their digestive processes is the text-book example of such an unit. In the cognitive approach learning is regarded as a mental activity involving internal coding of knowledge. Researchers like Jean Piaget identified different stages of mental growth and the figure 2.1 depicts such a learning process.

Figure 2.1: Human learning process [24]. It is possible to draw an analogy between the human learning process and classification of text documents. The hypothesis underlying the method described in part II, is that text document classification to a great extent is dependent on prior knowledge, corresponding to the circle in figure 2.1. Such information is the basis of the analysis of unknown documents, similar to the process where humans make use of internal knowledge representations in order to comprehend new information. In other words, we believe that terms like kamp (eng.: match), m˚ al (eng.: goal ) and spiller (eng.: player ) do not have any natural relation to the domain football unless there exists such a relation in some kind of domain representation.

4

Chapter 3

Approaches Is the meaning of a sentence given only by its compositional constituents or is it necessary to take its structure into consideration as well? Approaching the document classification task requires an answer to this question. The matter of analysing text is broadly divided into two main directions, namely statistical analysis and rule based analysis. The main difference between the two is that the former is based on term frequencies and statistical calculations, whereas the latter to a greater extent focuses on word contexts and sentence structure. As an example consider the following two sentences: 1. John pushed Mary. 2. Mary pushed John. It should be clear that these two sentences have different meanings despite being constituted by the same words. A rule based method considering syntactical structures could discover the semantic differences, but a statistical word frequency method would not. This implies that using word frequencies for text analysis makes sense in order to perform a surface analysis such as concept extraction, but does not necessarily capture the meaning of the text. However, statistical analysis grasps wider than counting word frequencies. Probabilistic co-occurrence analysis examines relations between words and latent semantic indexing (LSI) is a statistical method used to capture synonymy and polysemy relations. However rule based approaches tend to provide a more thorough sentence analysis capturing syntactical and semantical features. Regardless of the method at hand one should also consider whether or not the system should rely on predefined domain knowledge. This will be further elaborated in section 3.1. There are several interesting approaches in both the statistical and rule based directions. Section 3.2 and section 3.3 will describe the most relevant approaches to our work. Section 3.4 will consider a combination of the two approaches for text parsing.

3.1

Domain knowledge

Information extraction (IE) is the process of obtaining structured information from unstructured sources [8]. A very important property of an IE system is the level of information available before the analysis is carried out. Such information could be used to improve the analysis and is called domain knowledge. Consider the text fragment in figure 6.2. Key concepts in the two sentences are birdie, bogey, par, putt, round and shot. Even with a minimum of interest in sports, most humans would easily draw the conclusion that this article is about golf. However, 5

3.2. STATISTICAL APPROACHES

CHAPTER 3. APPROACHES

with the term golf not even present in the text, how can an automated system come to the same conclusion? It is not difficult to convince ourselves that such reasoning is difficult without some sort of prior knowledge. Even though this is a simple example, it holds for a large number of text articles. Indeed, most articles published on online news sites assume that the readers have a minimum of common domain knowledge.

Figure 3.1: Short golf notice. Even though domain knowledge would in many cases improve the performance of an IEsystem, it is not always feasible to acquire and build a knowledge representation. The most important downside is the resource requirements, contradicting the very nature of an automatic system. Also, tying the system to tight to a specific domain could represent a major drawback. With these considerations in mind, IE-systems today should be divided into two categories; Goal driven and Content-driven.

3.1.1

Content driven

Content driven systems make the use of a bottom-up approach in order to build ontologies and taxonomies. The advantage with such an approach is its domain independence. Ideally, content driven analysis systems should be able to build perfect ontologies even if no domain specific knowledge is available. Returning to the golf example, is it possible for a bottom up system to infer that all the key terms in the text fragment have the concept golf in common? One could argue that it is quite simply impossible without any prior knowledge. Actually, this deficiency renders the content driven approach inappropriate for applications that require a high level of recall and precision.

3.1.2

Goal driven

Goal driven systems apply a top-down approach to the analysis, which implies that more or less complete ontologies/taxonomies aid the analysis. The following example, courtesy of [8], illustrates the use of taxonomies in goal driven analysis. Say the system is extracting information from a telephone directory. In the first place, a phone book is considered to be a quite structured data source and one could expect to find information about persons. Each person has a name, a phone number and in some cases also an e-mail address. In terms of ontologies, we have now identified the PERSON class and the extraction system know what to look for and thus applies a goal driven approach.

3.2

Statistical approaches

Statistical methods have one important property; the ability to sort out noise. This is particularly true for large data materials and the property is underlying a wide range of statistical analysis, such as hypothesis testing. Related to the topic of this report, the statistical approach proves to be very well suited to analyse huge text corpora. Section 3.2.3 discuss co-occurrence 6

CHAPTER 3. APPROACHES

3.2. STATISTICAL APPROACHES

analysis in the Worthschatz project. The approach taken in this project is to examine cooccurring terms in a huge data material. By the means of a huge corpus, noise is balanced out. These days computational power comes more or less for free which has paved the way for a range of interesting techniques for information retrieval. The following sections present some methods that seem to be useful in relation to the document classification problem.

3.2.1

Vector Space Model

A fundamental quality in a system being capable of classifying documents, is the ability to identify content bearing terms in the text. The vector space model is an algebraic method for information retrieval that makes use of vectors to represent natural language documents [29]. The underlying idea of VSM is to cut off non significant (function) terms and then assign each of the remaining words a weight. There are a number of different ways to remove function words from the text. With the use of stop lists which hold common words, approximately 40 - 50% of the total words in a document are removed [29]. To avoid the language independence of the stop list strategy, probabilistic methods have been applied with success. These methods are based on the assumption that content bearing words and function words are distributed differently in the documents. However, the method used to assign terms weights is the critical factor in terms of the effectiveness of VSM [20]. The VSM analysis takes as input a set of documents and every term that occurs in one or more of the documents is included in the document vectors. The vector consists of values representing every term that is present within the document collection. These values indicate presence/absence or a weight of a given term within a given document and the vectors may be given by d~j = (w1,j , w2,j , w3,j , ..., wn,j )

(3.1)

where subscript j indicates that this vector represents document j in the collection. Alternatively, the document collection may be defined as a term by document matrix : W = [wij ]

(3.2)

In equation 3.1 the entry wi,j is the weight of term i in document j. It should be noted that the term i not necessarily appears in document j, thus wi,j should be interpreted as how closely the document j and the term i are related. Intuitively, if a term i is not present in document j, the weight wi,j should be assigned a value of zero. It is possible to use the dot product on normalized vectors to measure the similarity between documents in the collection. The indication of presence or absence of a term is in most cases not sufficient in order to get good results. Indeed, experiments has proven that weighted frequencies outperforms raw frequencies [30]. There are two main factors for term weighting; the term frequency within a document and the distribution of terms across a text document collection [20]. It is not difficult to convince ourselves that the raw document frequency plays a crucial role in the weighting. If a term is extremely rare or not present in the document at all, the weight of this term should be very low. On the other hand, some terms occur frequently across all documents in the collection. There are good reasons to believe that these terms are not content bearing or at least that they are not suitable to discriminate one document from the other. Terms that occur in only a few documents are therefore useful discrimination factors. The inverse document frequency capture this knowledge and is defined as follows [19]: 7

3.2. STATISTICAL APPROACHES

CHAPTER 3. APPROACHES

idfi =

N ni

(3.3)

In equation 3.3, N is the number of documents in the collection and ni is the number of documents that contain the term i. It is clear that the more documents a term occurs in, the lower is its inverse document frequency. Also, the lowest possible value is 1 (if the term appears in all documents). Due to the fact that collections could consist of a large number of documents, the inverse document frequency is usually calculated by taking the logarithm of equation 3.3: N idfi = log ni 



(3.4)

Finally the weight wi,j is defined as the product of the overall term frequency (tfi,j ) and the inverse document frequency 3.3: wi,j = tfi,j ∗ idfi

(3.5)

The vector space model is a simple, but effective way to compare and discriminate documents. Another related and useful application is to determine the relevance of a document to a query, simply by representing the query string as a vector. On the downside, the model is first of all very calculation intensive. Every time a new term is added into the term space, the vectors have to be recalculated. In terms of query matching, most query strings are different and thus frequent recalculations of the vectors are required. Further, due to the fundamental limitation of keyword driven models, VSM are subjected to false negative matches. Potentially this could result in an analysis where two sematically different sentences are being assigned the same meaning (ie. Cricket is a sport and Cricket is not a sport).

3.2.2

Polysemy, synonymy and Latent Semantic Indexing

A fundamental problem with frequency based analysis is the fact that there are usually more than one way to describe a concept. Indeed, more than 80% of the times, two people would not choose the same word for a well-known common concept [10]. This phenomenon, in which multiple words refer to the same object, is called synonymy (ie. bus and coach). Closely related to synonymy is polysemy, a term used to refer to the fact that a word can have multiple meanings (ie. act, to perform or the official name of a law ). As figure 3.2 illustrates, it is usually more complicated in practice.

Figure 3.2: Two fragments, the same story. In these two fragments, internal bleeding and haemorrhage are obviously synonyms, regardless of the context. Surgeon and doctor are in many cases used interchangeably. In this context 8

CHAPTER 3. APPROACHES

3.2. STATISTICAL APPROACHES

they are referring to the same concept. The phrases fully recovered and get over bear the same meaning, but this relation is context dependent. It is worth noting that by representing the documents as weighted term by document vectors, it is possible to calculate the similarity between documents. If the analysis is capable of capturing semantics, the comparison between the two fragments in figure 3.2 should yield high similarity. Due to the fact the VSM only takes the lexical representation of the terms into consideration; it fails fundamentally to capture the semantic similarity between the fragments. There is not enough information in these two fragments to state that they are describing the same incident, but both reports that a man has been injured and that he will recover from internal bleedings. From a query perspective, a query string consisting of haemorrhage and surgeon should match both fragments. In the context of this project, synonymy and polysemy are extremely important phenomena that need to be dealt with in order to enable proper document classification. latent semantic indexing (LSI) is based on the assumption that there is an underlying semantic structure in texts and that this structure is partially hidden by the randomness of word selection [10]. This structure is thought to be built-in or latent, hence the name of the method. From an IR perspective, the widespread of synonymy causes reduced recall performance, while polysemy leads to negative matches and with that a degradation of precision. Let us revisit the golf example outlined in figure 6.2. Figure 3.3 indicates correlation between some terms in a fictitious document collection.

Figure 3.3: Semantic relations. The left part of the ellipse representing round, intersects with golf and putt. On the right side, round intersects ellipse and circle. It is worth noting that the two clusters (golf/putt and circle/ellipse) do not intersect. Based on these simple observations, it is reasonable to say that round has at least two distinct meanings (polysemy). Further, there is also some kind of relation between round, putt and golf. Likewise, there is a similar relation between round, ellipse and circle. It will make sense to assume that documents containing the terms round and circle and documents containing ellipse are semantically related. However, they are not related to documents containing terms like round and putt, even though they share the lexical term round. A correlation structure analysis may help to overcome the problems caused by synonymy and polysemy in information retrieval [10]. The latent semantic analysis starts with a matrix which is made up of term by document vectors as described in 3.2.1. The matrix W is given by: W = [wij ] 9

(3.6)

3.2. STATISTICAL APPROACHES

CHAPTER 3. APPROACHES

By the use of a mathematical operation called Singular Value Decomposition (SVD) the matrix undergoes a dimension reduction. The mathematical details underlying the SVD analysis are presented in [10] and [2]. With the aid of SVD, the matrix W is decomposed into three special matrices. These matrices are a breakdown of the original relationships between terms and documents in W and the relations are represented as k (typically 50 - 150) linearly independent components or factors [11]. It is worth noting that an approximation of the original matrix W can be constructed by the means of linear combination. The idea here is that by ignoring the smaller factors and reducing the dimension of the document representations, the semantic relations between documents will be exposed. In equation 3.6, the indexes i and j are the number of documents and unique terms in the document collection, respectively. After the dimension reduction, vectors of factor values represent the documents. The dimension of these vectors is k, a much smaller value than the number of unique terms i. In the reduced model, the closeness of documents is determined by the overall pattern usage and not the actual lexical terms used [11]. As a result, it is possible that two documents with a somewhat different term usage, but a consistent semantic pattern, are mapped into the same vector. Clearly, such a property is extremely desirable in systems for document classification.

3.2.3

Co-occurrences

Co-occurrence refers to the phenomenon that some words seem to occur often together. In the context of natural language processing, co-occurrence measures can be used to establish word similarities and relationships between words. Examples of such co-occurrences measures include relationships between head words in syntactic constructions, like subject-verb, verbobject and adjective-noun structures, as well as word sequences and collocations [7]. Intuitively there are two ways to go about co-occurrences in NLP systems. One approach is to look at the syntactical relations, such as modifying adjectives, subject-verb and verb-object, in order to infer word similarities. Words that often occur in the same context (modified by the same words) are then rated similar. This is elaborated further in the section Syntactical relations. Another possibility is to take a broader approach and look for words that appear together in the same discourse, but that does not necessarily modify each other directly. This can be used to establish a relationship between the words, which in turn can be used for ontology building. This approach is described in the coming section Discourse co-occurrence. No matter which approach is taken, extensive data materials are needed in order to infer consistent word co-occurrences. Discourse co-occurrences Looking for words that co-occurs within the same discourses is an approach that can be carried out in a purely statistical way. The Worthschatz Project at the University of Leipzig is a project that takes this stand and uses a brute force method where they analyse a huge text corpus in order to capture repeated word co-occurrences. We will therefore describe this method in light of this project. The term brute force in this sense is meant describe the fact that the method does not discriminate co-occurrent words on the basis of part of speech information. It simply determines which words that seem to co-occurr with each other, either within a sentence or as a word pair (neighbours). This is opposed to using co-occurrences to investigate syntactical relations. In such cases one would pick out certain parts of speech to look for, such as subject-verb co-occurrences. This is elaborted further in the section Syntactical relations. Co-occurrences in the Worthschatz project The objectives of the Wortschatz project is to provide a text corpora for a variety of languages through their web services at http://corpora.informatik.uni10

CHAPTER 3. APPROACHES

3.2. STATISTICAL APPROACHES

leipzig.de/. In order to enrich the plain text corpora with structure, they focus on languageindependent methods without using manually developed resources. As they base their analysis on very large amounts of plain text data, effective data processing is extremely important [3]. Their data material is extracted from online newspapers and non-copyrighted literature. By crawling the web they aim to build comprehensive, rather than error-free data collections. The errors, however, is balanced out with the increasing size of the corpus. Statistical methods as well as intellectual optimization routines eliminates errors [4]. The Wortschatz Project is mainly based on co-occurrences to detect semantic relations between the words of a language. Their measure for co-occurrences is based on the statistical G-Test for Poisson distributions: Given two words A and B occurring a and b times in sentences, respectively, and k times together, they calculate the significance, sig(A, B) of their co-occurrence in a sentence as follows [4]: sig(A, B) = x − k log x + log k!

(3.7)

where x is defined as: x=

ab n

(3.8)

where n being the number of sentences. These measures are used to define collocation sets for words in the corpus. To differentiate, these collocation sets are divided into sentence based collocations as well as immediate left and right neighbours of each word. An example of such collocation sets for the Norwegian word Universitetet are shown in figure 3.4.

Figure 3.4: Collocation sets for Universitetet [3]. If a part-of-speech (POS) tagger is available, it is used to improve the collocation sets. Words that have several parts of speech are given different collocation sets, differentiated by their POS tags. In an application searching for synonyms, the POS tag is used to reduce the candidate set so that it includes only those with similar POS tags [4]. The semantic relations inferred on the corpora through the language-independent methods are mainly used for ontology building. These ontologies are primarily co-occurrence based and constitute of word nodes and relational edges. Based on the co-occurrent collocation sets, it is possible to establish the fact that universitet, fakultet, institutt, UiO are related terms. A graphical interface like the one in figure 3.5 is used to visualise these relations. One area of application for this system is assisting brainstorm activity in meetings. The system recognises speech and when someone says a word, the system can display a visual representation of the word and related terms that will stimulate the brainstorming process. Syntactical relations Whereas the approach described in the previous section is purely statistical based, looking at co-occurrences in terms of syntactical relations requires at least a part of speech tagger. By 11

3.2. STATISTICAL APPROACHES

CHAPTER 3. APPROACHES

Figure 3.5: Visualisation of the term Universitetet and related terms [3].

using co-occurrences in this manner it is possible to establish word similarities to a certain extent. A possible application of such co-occurrences is to investigate the subject-adjective relations in a text corpus. One could compile a list of all nouns and corresponding adjectives used to describe the nouns, as shown in table 3.1.

Noun Apple Football Orange Banana

Co-occurring adjective Green, round, tasty, yellow, eatable, juicy. Round, white. Round, juicy, eatable, yellow. tasty, yellow, eatable. Table 3.1: Subject - Adjective relations.

Further, it is possible to make comparisons and draw conclusions about word similarities. From the example in table 3.1 one might conclude that orange, banana and apple are related objects as they have some common describing adjectives (eatable and tasty). Apple and Orange seem to be even more similar as they both have the properties round and juicy as well. Despite being round the football does not match the other objects and should be considered different from these. A possible visualisation of the relations between the words are shown in figure 3.6. In this figure an object has all the properties of ancestor nodes and the properties assigned only to the object are those that differentiate them from the rest. Similar relations could be established using other co-occurrent-based syntactic relations, such as subject-verb and verbobject structures. 12

CHAPTER 3. APPROACHES

3.2. STATISTICAL APPROACHES

Figure 3.6: Visualisation of the words in table 3.1

3.2.4

Clustering

Related to the concept of co-occurrences is clustering. Generally speaking clustering is the process of discovering patterns or structures in data by grouping data elements according to their similarity. From a natural language processing perspective clustering can be applied to create clusters of conceptually related words or documents. Regardless of the approach a reasonable cluster is defined as one that maximises the within-cluster similarity and minimises the between-cluster similarity [20]. An ideal type of cluster for NLP is the one which guarantees mutual substitutability, in terms of both syntactic and semantic soundness, among words in the same class [5]. Consider the following sentences:

1. He went to the house by car. 2. He went to the apartment by bus. If a predefined set of word clusters are built up using a training corpus, a statistical parser could recognise the similarity of these two sentences. House and apartment would appear in the same cluster, as would car and bus. There are mainly two classes of clustering algorithms, namely hierarchical clustering and partitional clustering. Hierarchical clustering seems to be the most common method in natural language processing and can either be agglomerative (bottom-up) or divisive (top-down). Agglomerative hierarchical clustering starts out with all single words as individual clusters and merges them successively into larger clusters. Divisive hierarchical clustering approaches the clustering process from the opposite angle, by starting with the whole set as a cluster and dividing into successively smaller clusters. Either way, hierarchical clustering builds a tree with clusters at different granularity in each level of the tree. For further mathematical details on the clustering process the reader is referred to [31]. In terms of classification of documents, clustering techniques can be applied to a text corpora to cluster classes of related words, which in turn can be used to group documents with similar content. 13

3.3. RULE BASED APPROACHES

3.3

CHAPTER 3. APPROACHES

Rule based approaches

By the means of statistical methods like VSM and LSI, it is possible to measure the similarity between documents and extract key concepts. However, it is hard to capture the actual semantic behind sentences using only a statistical analysis. Frequencies and weighting are no doubt useful measures, but the statistical methods outlined in 3.2 ignore the fact that the meaning of a sentence is not only given by its terms, but also by the sequence in which the terms appear. For instance, birdie putts missed series of a Montgomerie does not mean the same as Montgomerie missed a series of birdie putts. Frege’s principle [15] states that the meaning of a sentence is determined by the meanings of its meaningful components, plus their mode of composition. According to Frege it is thus necessary to take both the meaning of the terms and the order of the terms into account. A common way to represent meaning is to use First Order Predicate Calculus (FOPC) expressions. FOPC is flexible, well-understood and provides a sound computational basis for meaning representation [20]. This section will not explore FOPC in depth, but rather examine it briefly by some examples. For a more thorough and formal discussion of the topic, the reader is referred to [17] and [14].

3.3.1

FOPC and semantics

In FOPC, constants refer to specific objects and hence Colin Montgomerie is a constant. Also, it is possible to create hierarchies of objects and constants using is-a and a-kind-of relations. The following expression captures the fact that Colin Montgomerie is a golfer: ISA(ColinM ontgomerie, Golf er) Further, Mr. Montgomerie is also a person and one could add another is-a relation to denote that. However, the following a-kind-of relation states that all golfers are persons. Hence Colin Montgomerie is a person. AKO(Golf er, P erson) By the use of functions in FOPC it is possible to refer to properties of constants. In the case of golf players, a relevant property is the number of times they have won major championships. Also, it is interesting to represent information about tournaments and its winners. The following FOPC predicates capture such knowledge: M ajorChampionships(ColinM ontgomerie, 0) W inner(T heOpen2005, T igerW oods) In FOPC, there are two quantifiers, the existential and the universal. The existential quantifier means ”there exists” and is denoted ∃. Similar, the universal quantifier means ”for all” and is denoted ∀. Using quantifiers two natural language sentences might be translated into FOPC as follows: 1. Colin Montgomerie is a golfer. 2. Every golfer has at least one club. 14

CHAPTER 3. APPROACHES

3.3. RULE BASED APPROACHES

1.∃x ISA(x, Golf er) ∧ N ame(x, ColinM ontgomerie) 2.∀x ISA(x, Golf er) → ∃yISA(y, Club) ∧ Owns(x, y) By representing the two sentences (1) and (2) as FOPC expressions it is possible to reason over the content of the sentences. Due to the fact that Colin Montgomerie is a golf player, we can now infer that Colin Montgomerie must have at least one club. It should be clear that FOPC representations of natural text have major advantages over the statistical approaches discussed in 3.2. In the following section we will discuss how natural language documents might be represented as predicate expressions.

3.3.2

Syntax-driven semantic analysis

Syntax-driven semantic analysis is the process of mapping syntax to semantics. Please not that this section is not exhaustive in any way, indeed the reader is assumed to have some knowledge of grammar formalisms such as context free grammars and the lambda calculus. The following discussion is based on chapter 15 of [20] and chapter 6 of [1]. The objective of this section is to illustrate the advantages of syntax-driven semantic analysis and the fundamental difference between such an analysis and the statistical approaches outlined in 3.2. Typically, the input text is first parsed by a lexical parser and the output of the parse is fed into a semantic analyzer for the final meaning representation. The rules in table 3.2 covers a fragment of the English language, in particular the two sentences: 1. Colin Montgomerie plays golf. 2. Every golfer has a caddie.

Figure 3.7: Parse trees of sentences 1 and 2. Figure 3.7 shows parse trees for the two sentences. The derivation of the semantics is based on the principle of compositionality; that is the semantics of the sentences can be composed of the semantics of their subparts. Thus, we apply rules that map syntactical constituents to semantic representations. In order to realize this approach the lambda notation [17] is introduced. Please note that coming up with these rules and corresponding semantic representations might be very hard and that the two examples here more or less just scratch the surface of the matter. Nevertheless, they illustrate the advantage of the rule based approach. 15

3.3. RULE BASED APPROACHES

CHAPTER 3. APPROACHES

1

Rule S → NP VP

Sematic representation {VP.Sem(NP.sem)}

2 3 4

NP → Noun NP → Proper_noun NP → Det Noun

{Noun.sem} {Proper_noun.sem} {Det.sem(Noun.sem)}

5

VP → Verb NP

{Verb.sem(NP.sem)}

6 7 8 9

Vocabulary Noun → golfer Noun → golf Noun → clubs Noun → caddie

{golfer} {golf} {clubs} {caddie}

10 11

Verb → plays Verb → has

{λx λy play(y, x)} {λx λy has(y, x)}

12 13

Det → a Det → every

λQ λZ ∃x (Q(x) ∧ Z(x)) λQ λZ ∀x (Q(x) → Z(x))

Table 3.2: Grammer for a fragment of English.

Based on the parse tree of sentence 1, the meaning of the sentence might be derived by applying the rules in table 3.2. 1 2 3 4 5

VP.Sem(NP.sem) Verb.sem(NP.sem)(NP.sem) λx λy play(y, x) (golf)(Proper_noun.sem) λy play(y, golf ) (Colin Montgomerie) play(Colin Montgomerie, golf)

Rule 1 Rule 5 Rule 2, Rule 3 and Rule 10 λ-reduction λ-reduction

Sentence 2 is slightly more complicated, but the same procedure holds. Please note that this sentence is subjected to ambiguity and the following derivation uses quantifying in [9] in order to arrive with one of the valid interpretations. It should not be difficult to see the advantages of a perfect syntax-driven semantic analysis. In comparison to statistical information extraction methods such as VSM and LSI, the rule based approach actually grasps the semantics of a sentence. Also, the fact that it is possible to represent the semantics in FOPC and hence draw inferences makes it a good choice for a wide range of applications. However, this approach is also prone to errors. Section 3.4 elaborates further on these issues and discusses whether it is possible to combine the statistical and rule based approaches.

3.3.3

Constraint grammar

Section 3.3.2 and section 3.3.1 describes the transformation from natural text into meaning representation and a way of representing meaning, respectively. This is a strictly rule-based 16

CHAPTER 3. APPROACHES

1 2 3 4 5 6 7 8 9

3.3. RULE BASED APPROACHES

VP.Sem(NP.sem) Verb.sem(NP.sem)(NP.sem) [λx λy has(y, x)(NP.sem)](NP.sem) [λx λy has(y, x)( λQ λZ ∃x (Q(x) ∧ Z(x))(caddie))](NP.sem) [λx λy has(y, x)( λZ ∃x (caddie(x) ∧ Z(x)))](NP.sem) [λy ∃x (caddie(x) ∧ has(y, x)))](NP.sem) [λy ∃x (caddie(x) ∧ has(y, x)))]( λQ λZ ∀x (Q(x) → Z(x))(golf er)) [λy ∃x (caddie(x) ∧ has(y, x)))]( λZ ∀x (golf er(x) → Z(x))) ∀x (golf er(x) → [∃x (caddie(x) ∧ has(y, x))])

Rule 1 Rule 6 Rule 11 Rule 4, Rule 9 and Rule 12 λ-reduction Quantifying in Rule 4, Rule 9 and Rule 13 λ-reduction Quantifying in

approach and such approaches are most suitable for domain specific text fragments as they seem to leak when confronted with unrestricted texts [22]. Constraint grammar on the other hand is a language-independent formalism that targets at overcoming these limitations and parse unrestricted text. It should be stressed, however, that constraint grammar only deals with structural and grammatical ambiguities and does not does not attempt to resolve semantic or pragmatic ambiguities. This means that the constraint grammar approach is used for part of speech and syntactical tagging and does not capture semantic meaning. As we will see later in section 4.1, this is the formalism that the Oslo-Bergen tagger is based on. In order to parse unrestricted text, constraint grammar takes the stand that both constraints (grammar rules) and more probabilistic (heuristic) statements are needed. Despite the fact that constraint grammar is classified as a combination of grammar-based rules and probabilistic modules, it should be pointed out that the kernel of constraint grammar is meant to be linguistic rather than probabilistic in nature [22]. For the purpose of constructing a successful parser, it is necessary to extract as much information as possible from grammatical restrictions. Also the process is supplied with probabilistic measures in sentences that are only partially analysed. Constraint grammar is basically a formalism for writing disambiguation rules. These disambiguation rules, called constraints is intended to discard as many improper alternatives as possible in ambiguous cases where words are assigned more than one tag. Constraint grammar is thereby said to be reductionistic in nature as it starts out by assigning all possible interpretations to all words and disambiguates from there on. A successful parse eliminates all ambiguities. Constraint grammar parsing Formally speaking constraint grammar parsing is separated into morphological analysis and disambiguation, morphosyntactic mapping, determination of clause boundaries and disambiguation of syntactic functions. Put in other words the parsing process consists of lexical and syntactical tagging and disambiguation. Ideally at the end of the process all words are disambiguated and thus have only one interpretation. Prior to these processing steps is the preprocessing of the text. The different steps progresses as follows [21]: 1. Preprocessor: The preprocessor normalises the text content, by handling headings, footnotes, paragraph structures, interpunctation, etc. Additionally the preprocessor recognises idioms and fixed multiword expressions that is language dependent. 2. Assignment of tags: The morphological analysis uses a lexicon to assign all possible lexical interpretations to all words in the text. An example of such an assignment for the word move is shown in figure 3.8. 17

3.3. RULE BASED APPROACHES

CHAPTER 3. APPROACHES

Figure 3.8: Output from the morphological analysis [21]. In figure 3.8 the first lexeme in each line is the base form of the word, the upper case letters denote morphological features, whereas @ denotes a syntactic function retrieved from the lexicon. The interpretations that are not assigned an initial syntactic function from the lexicon are given syntactical functions according to certain rules. An example of such a rule is that a pronoun in the genitive case is either a prenominal genitive modifier, subject predicate complement, or object predicate complement. These syntactical labels are then assigned to the word and the disambiguation process tries to sort out which one is correct [21]. 3. Disambiguation: The disambiguation process is based on constraints. These constraints are divided into two main categories, those concerning morphological features and those resolving syntactic disambiguties. An example of the former is shown in figure 3.9.

Figure 3.9: A morphological constraint [21]. This constraint implies that if a word (@w) has the reading “PREP”, this interpretation is discarded (=0) iff the preceding word (in position -1) has a reading with the feature “DET” [21]. An example of syntactical disambiguation constraints are shown in figure 3.10. The two constraints shown discard @+FMAINV (finite main verb) as a syntactic alternative if there is a unique finite main verb to the left (*-1) or to the right (*1) in the same clause.

Figure 3.10: A syntactical constraint [21]. The intention of these constraints are to discard as many improper alternatives as possible rather than “prove” the correctness of a sentence. By imposing both types of constraints onto the ambiguous text many interpretations are disambiguated into one unique form. Constraint grammar additionally takes advantage of probability methods, such as heuristics to further improve the parsing in ambiguous cases where constraints were not sufficient. This is however, not elaborated much on in the literature.

3.3.4

Reference resolution

In natural languages one usually makes use of pronouns to refer to nouns. Once we have established a common agreement of the noun (implicit or explicit) we can refer to it using a pronoun. The reference is then in most cases easily resolved by the human brain. However, there are ambigous cases as well even for human beings. Imagine then the complexity involved 18

CHAPTER 3. APPROACHES

3.3. RULE BASED APPROACHES

in resolving these references in a natural language processing system. In this section we will look at a rule based algorithm for resolving anaphora resolution in a natural language processing system. A natural language expression used to perform a reference is called a referring expression, and the entity that is referred to is called the referent or the antecedent [20]. Referring expressions that are used to refer to the same entity are said to corefer. The subject of reference resolution is concerned with resolving these coreferences. As an example of coreference, consider the following sentence: P auli ate dinner, while hei watched TV. In this sentence, Paul is the antecedent, while he is the referring expression. There are mainly two types of coreferences, namely anaphoric and cataphoric. References to an entity that has previously been introduced into the discourse are called anaphoric, whereas pronouns that are mentioned before their referents are called cataphoric [20]. Mikov’s algorithm Mitkov (1998) presented an algorithm for resolving coreferences in terms of anaphora resolution as shown in the listing below. This algorithm is targeted at avoiding complex syntactic, semantic and discourse analysis and is entirely based on the output of a part of speech tagger combined with simple noun phrase rules. 1. Examine the current sentence and the two preceding sentences (if available). Look for noun phrases only to the left of the anaphora. 2. Select from the noun phrases identified only those which agree in gender and number with the pronominal anaphora and group them as a set of potential candidates. 3. Apply the antecedent indicators to each potential candidate and assign scores; the candidate with the highest aggregate score is proposed as antecedent. If two candidates share an equal score precedence are given for immediate reference. The antecedent indicators mentioned in step three of the algorithm play a decisive role in determining the most probable antecedent among the possible candidates. Each indicator is a rule that assigns a score in the range −1 to 2 to each of the candidates. The candidate with the highest total score is considered the antecedent of the referring expression. The indicators used in the algorithm are as follows [25]: ˆ Definite noun phrases in previous sentences are more likely antecedents than indefinite ones. They are given the score 0 and −1, respectively. ˆ The first noun phrase in the previous sentences are deemed good candidates as antecedents as they tend to represent the “given information”. They are thereby given the score 1, while others receive the score 0. ˆ If a verb is a member of a predefined set (discuss, present, etc.), the following noun phrase is considered a preferred candidate and scores 1, whereas others score 0. ˆ Repeated noun phrases are likely candidates and scores 2 if repeated twice or more in the same paragraph, 1 if repeated once and 0 if only mentioned once.

19

3.4. COMBINED APPROACH

CHAPTER 3. APPROACHES

ˆ Non-prepositional noun phrases are regarded higher preference than prepositional noun phrases and they receive the score 0 and −1, respectively. ˆ Preference is given to those noun phrases that share the same pattern as the pronoun. The patterns considered in this context are noun phrase (pronoun) verb and verb noun phrase (pronoun). The candidates with such similar patterns receive the score 2, whereas others score 0. ˆ In sentences of the form “...V1 N P...conj < you > V2 < it > ...” the noun phrase immediately following V1 is likely to be the antecedent of the pronoun it and receives the score 2, whereas others score 0. ˆ In complex sentences the likelihood of a noun phrase being an antecedent decreases with its referential distance. Candidates in the previous clause score 2, in the previous sentence score 1, two sentences back score 0 and three sentences back score −1. For simple sentences noun phrases in the previous sentence are regarded best candidates, and get the score 1.

In terms of classification of documents, reference resolution can prove valuable in order to improve the concept extraction process. This is meant in the sense that resolving references by replacing the referring expressions with their antecedent, would increase the frequency of the concept. Additionally it can help connecting subject-verb-object relations of pronouns to its antecedent and thereby ease the process of clustering such relations.

3.4

Combined approach

Roughly speaking one could say that behaviourism and cognitive psychology gave birth to statistical linguistic methods and rule based linguistic methods, respectively. This happened during the 20th century and there has been an ongoing discussion between these two linguistic camps ever since [12]. Reaching a uniform conclusion regarding which one is the better method, however, is not feasible as it depends on the area of application. Both methods imply pros and cons concerning natural language processing and should be equally taken into consideration when deciding upon a method of analysis.

3.4.1

Text parsing

The primarily focus in the literature has been on statistical methods versus rule based methods for text parsing. It is claimed that although rule based parsers are widely used in real working NLP systems, they have some major disadvantages. Rule based parsers require extensive amounts of dictionary data along with highly skilled linguistics to create, enhance and maintain comprehensive sets of rules [26]. This statement is especially true if the parser is required to have broad coverage. Broad coverage in this sense means that the parser is able to analyse text across several different domains. It is very difficult to make such a general rule based parser with broad coverage as the set of rules simply does not apply to unrestricted text. The coverage limitations of rule based parsers, which might result in complete parsing failure seem to be their main disadvantage [27]. On the other hand, the output of successfully parsed sentences are usually of high precision. Statistically based methods include a significant decrease in the amount of rule coding required to create a parser that performs adequately. Additionally they have the ability to “tune” a parser to a particular type of text simply by extracting statistical information from 20

CHAPTER 3. APPROACHES

3.4. COMBINED APPROACH

the same type of text [26]. Statistical methods also tend to have better recall measures than rule based methods in some cases, but often suffers in precision [27]. Another disadvantage is the requirement for large amounts of training data, often in the form of large NL text corpora that have been annotated with hand-coded tags specifying parts-of-speech, syntactic function, etc. [26]. To sum up the pros and cons in the two approaches, rule based methods are superior in terms of precision, but lacks broad coverage and is subjected to a potentially very complex rule creation process. Statistical methods requires less effort regarding rule coding and seems to provide good recall, but comes at the cost of less precision and the need of large amounts of training data. In order to overcome the downsides of statistical methods and rule based methods for text parsing, a combination of the two seems reasonable. One possible way to go about a combined method is to first use a rule based approach to parse text within a given domain. Assuming that the sentences are parsed with high precision the output can be fed into statistical methods as training sets. The intention of the statistical method is then to generalize the methods and make them applicable outside the given domain to ensure broader coverage. One possible weakness of a combined method like this is that the statistical methods are trained only by the material already interpreted by the rule based parser. A lot of training might then be needed in order to generalise the approach.

3.4.2

Classification of documents

On the basis of the methods presented in the previous sections for statistical and rule based approaches, a way of approaching the process of document classification could be as depicted in figure 3.11.

Figure 3.11: A way of combining methods for document classification. The process of classifying documents requires at least a part of speech tagger. The text parser could be a rule based parser based on Constraint grammar as described in section 3.3.3. A constraint grammar parser has the ability of providing part of speech information as well as syntactical information needed in step two and three of figure 3.11. The Oslo-Bergen tagger to be presented in section 4.1 is a real working text parser for Norwegian, which is based on constraint grammar. Another alternative is a statistical parser like the one to be presented in section 4.2. That particular parser only provides part of speech information, but has several features that makes it interesting. Whether the text parsing is based on statistical or rule 21

3.4. COMBINED APPROACH

CHAPTER 3. APPROACHES

based methods or a combination of the two, the most important aspect is that it provides good recall and precision measures. After the text has been parsed statistical methods can be used to perform a surface analysis. Such an analysis is carried out to discard noise an extract only the information relevant to the process of document classification. For this purpose information from a part of speech tagger is needed. A VSM analysis is ideal for extracting important concepts as well as providing distinguishing factors for documents and domains. This process could be supplied by a latent semantic analysis and coreference resolution in order to disambiguate synonymous and polysemy relations as well as pronoun references. When the surface analysis is completed co-occurrence and clustering methods combined with rule based methods for meaning representation (FOPC) and syntactical relations can be used to further enhance the internal domain representations. This would incorporate more structure into the domain representations, which only consists of a list of concepts after the surface analysis.

22

Chapter 4

Relevant technologies The process of classifying documents will at some stage require a part of speech tagger. Part of speech information is necessary in order to filter out the information relevant to our approach. In terms of concept extraction a part of speech tagger would enable us to extract concepts of a certain part of speech. Additionally a part of speech tagger provides the foundation for adding syntactical features such as subject-verbal-object structures. However, creating a part of speech tagger from scratch requires a considerable effort. In this chapter we will therefore consider two available solutions, namely the Oslo-Bergen tagger and the kernel of the Corporum technology developed at CognIT.

4.1

The Oslo-Bergen tagger

The Oslo-Bergen tagger is a rule-based tagger (as opposed to a statistical tagger), which is able to classify words in a text written in either of the two written languages for Norwegian. The tags assigned to each word contains information such as part of speech with subgroup information as well as morpho-syntactic aspects. The tagger is based on constraint grammar, which is developed at the University in Helsinki and the software for the constraint grammar rules is supplied by the Finish company Lingsoft. A constraint based tagger implies that it has linguistic rules for every choice of disambiguation, which means that it does not build phrases, but bases all choices on relations between singular words [18]. For further details on Constraint Grammar, refer to section 3.3.3. The original tagger was developed in 1996-1998 and consists of a preprocessor, a multitagger and a module for morphological and syntactical disambiguation. The tagger is still under development and all the three parts mentioned above have been upgraded in the tagger now known as the Oslo-Bergen tagger. Additionally a more user friendly graphical web interface have been added to the original tagger.

4.1.1

Methodology

The system input is a text file. However, there is no requirements regarding the text formating. Tables and headlines can be included, which makes the tagging process harder. As mentioned above the tagger takes the input data through the following processing steps [18]: 1. Preprocessing: Handles the following issues: ˆ Separate headings from text. ˆ Sentence recognition.

23

4.1. THE OSLO-BERGEN TAGGER

CHAPTER 4. RELEVANT TECHNOLOGIES

ˆ Separate dates from other numbers. ˆ Recognise conjunctional collocations, i.e. kirke- og utdanningsministeren. ˆ Recognise predefined expressions that is regarded as single words, i.e. i ny og ne.

2. Multitagging: Processes the text on a word-by-word basis. Each word is looked up in a full dictionary database containing lexical terms with all inflected forms. The context is not considered in this step so each word is assigned all possible dictionary matches. Example output from the multitagging process is shown in figure 4.1 a).

Figure 4.1: a) Output from the multitagger. b) Final output from the Oslo-Bergen tagger. Additionally the following issues are handled: ˆ Proper noun recognition. ˆ Recognising compound words, which is very common in the Norwegian language. Words that are not recognised in the full dictionary database are analysed as compound words. If such an analysis succeeds the last part of the compound word is looked up in the full dictionary database as this determines the part of speech.

24

CHAPTER 4. RELEVANT TECHNOLOGIES

4.2. CORPORUM - THE KERNEL

3. Disambiguating tagging: In this step context is considered. The morphological tagger disambiguates by only assigning one tag in prior multitagged cases. This is only done in situations where the disambiguation can be based on a solid foundation. Additionally the syntactical tagger assigns syntactical functions, i.e. adverbial, subject, object, to the tags. Example output from this step is shown in figure 4.1 b). Our approach to classification of documents is dependent on a tagger in order to incorporate more than just statistical methods. The Oslo-Bergen tagger has proved a precision of 95,4% and a recall of 99%, yielding a f-measure of 97,2% and serves well as a part of speech tagger. There is, however, considerable improvement needed in the subject and object tagging [18].

4.2

Corporum - The Kernel

The kernel of the Corporum technology developed at CognIT offers similar functionality to the Oslo-Bergen tagger as described in section 4.1. It is a part of speech tagger, but the output differs somewhat from the output of the Oslo-Bergen tagger.

4.2.1

Kernel output

The kernel transforms plain textual input into a part of speech tagged xml output file. An example of a partition of such a file is shown in figure 4.2.

Figure 4.2: Output from the kernel of the Corporum technology. The first thing to note about the kernel output is the structure of the file. The xml format provides the capability of structuring the output into different granularity levels. The output file is structured as a list of paragraphs, each constituting of sentences of elements. Each element can either be a token (word) or a collocation (set of tokens that naturally belong together). The fact that the output is formated in this manner makes it easy to navigate for implementation purposes. It is easy to parse the output and retrieve the discourses of interest. The part of speech information is provided at the token level. The most relevant attributes assigned to each token are: 25

4.2. CORPORUM - THE KERNEL

CHAPTER 4. RELEVANT TECHNOLOGIES

ˆ TOKEN UID: A unique token reference. ˆ Clean: The lowercase version of the token. ˆ Raw: The token as it appears in the text. ˆ Type: Whether the token is a name (ttName), an abbreviation (ttAbbreviation) or a regular word (ttWord). ˆ BrillTag: Denotes the part of speech of the token. The tag also provides inflection information. Consider the last token in figure 4.2 (regelbrudd). The BrillTag in this case (btNNeutIndefSg) denotes that regelbrudd is a noun (btN), neutral in gender (Neut) and the inflection is indefinite (Indef) and singular (Sg). ˆ Lemma: The stem of the token.

Another considerable advantage of the kernel output is the fact that it resolves collocations. As shown in figure 4.3, the name Bjørn Haugseth is grouped together as a collocation. This provides a better foundation for adding syntactical information such as subject and object tags. Lack of collocation resolvement is currently one of the downsides with the Oslo-Bergen tagger.

Figure 4.3: Collocation in the kernel output. In some cases the kernel output provides more than one possible part of speech interpretation of a token. The third token (mange) in figure 4.2 represent such an ambiguous case. mange is assigned the two possible interpretations, namely pronoun and adjective. To ensure an unambiguous interpretation in those cases it is possible to make use of the BrillTag and look for the matching Category tag in the listing of possible candidates. The BrillTag of mange is btJJ, which denotes adjectives. By applying this information it is possible to pick out the correct alternative and thereby get the correct stem of the word. The stemming versions, however, does not always seem to be correct and the Oslo-Bergen tagger is better suited in this area. It is possible, though, to put simple rules on top of the kernel output to get the correct stemming in most cases. The matter of stemming words is important in applications that makes use of term frequencies.

26

Chapter 5

Summary The literature review has given a brief introduction on ordinary classification systems and presented statistical and rule based methods acquired to create a document classification system. The methods introduced are all well established in the area of natural language processing (NLP) and are thus relevant for our approach regarding document classification. However, within the scope of this thesis we are not able to process all of them further. This does not discard them as improper methods for future work, but rather provides a foundation for further enhancements of the work started in this project. To avoid grasping to wide, we have focused our work in this project around the working hypotheses provided in the following section.

5.1

Hypothesis

Based on the literature review we would like to raise the following two working hypotheses for this thesis: 1. By the use of VSM analysis and the inverse document frequency property in particular, we believe it is possible to represent domains of text documents. The method should distinguish semantically similar domains as well as semantically different domains. The method would require training sets for each of the domains. Further we believe these domain representations might be used as the basis of a classification system. 2. We believe that syntactical relations such as ˆ subject-verb ˆ verb-object ˆ subject-verb-object ˆ adjective-subject

are useful features of text documents. These features might prove useful in order to provide structural and layered information about the domains and single text documents. If this turns out to be the case it will improve and possibly complement the shallow classification process of hypothesis 1. These relations are possibly a good starting point for automatic building of ontologies. These two hypotheses are processed in part II and part III, respectively.

27

Part II

Our approach

28

Chapter 6

A method proposal This part of the report describes a method for classification of text documents. The technique is based on the VSM analysis described in 3.2.1. In particular, the method makes heavy use of the inverse document frequency and builds domain representations of training sets. A hallmark of the method is it’s ability to differentiate very similar domains as well as very different domains.

6.1

Brief description of the method

Originally, the intention of this project was to develop a method that would create suitable categories and put each document of a collection in the appropriate category. One hallmark of this method was that the categories were not known beforehand. However, based on knowledge obtained from the literature review and practical experiments, we have decided that the categories should be defined in advance. Hence, the method requires some degree of user intervention. The following points outline the method and explain how it might be used for practical applications. ˆ The user trains the system by giving the system a set of documents within the same domain. The system analyses a small number of domain specific documents (10 - 15 documents) and creates a representation of this particular domain. ˆ Documents with unknown semantic content are compared to document representations and the document is assigned a category.

Say for example that a large document archive consists of articles dealing with football, handball, golf, music or movies. The manual part of the process is to feed the system with a number of football articles, a number of handball articles and so on. Domain representations for each of these domains are then automatically created and the system is ready to analyze and classify unknown documents. In addition it is possible to manually create hierarchies like figure 6.1. If a document sorts under the football domain it is also categorised as a sports article. This makes it possible to compile a menu structure as seen on news pages on the Internet (ie. www.cnn.com, www.dagbladet.no, www.aftenposten.no).

6.2

Domain representations

The crux of the method is how domains are represented, in other words how to extract distinguishing features from documents within the same domain. Two fundamental problems 29

6.2. DOMAIN REPRESENTATIONS

CHAPTER 6. A METHOD PROPOSAL

Figure 6.1: Domain hierarchy. need to be addressed: What are the building blocks of the representation? and How is the representation computed?. Figure 6.2 depicts the conceptual idea underlying the domain knowledge extraction process. As shown, a number of football documents are fed into the analysis. All of the documents contain some features that are characteristic for the football domain, encapsulated in the circles. However, the documents also contain a vast amount of data which is not football related or not suitable to distinguish football documents from documents not dealing with football. Therefore, we need some way of getting rid of the noise and keep the characteristic terms and relations.

Figure 6.2: Extracting domain knowledge. The rightmost circle illustrates the ideal process in which only football related features are extracted from the documents.

6.2.1

Domain representation constituents

Even though the intuitive understanding of the extraction process in figure 6.2 is easy to grasp, it is difficult to decide what kind of data to include in the representations. Domain representations are based on two features: term frequency and the distribution of syntactical relations. Syntactical relations are not included in the domain representation build in this part of the report, but we will return to these features in part III.

Term frequency Term frequency and term distribution across a set of documents obviously bear some information about the domain. If for example the term lag (eng: team) and dødball (eng: set pieces) have high frequencies and appears in the majority of the documents in the set, it is likely that these two words should have impact on the domain representation. We have chosen to use an analysis similar to VSM (3.2.1) for concept extraction. 30

CHAPTER 6. A METHOD PROPOSAL

6.2. DOMAIN REPRESENTATIONS

Syntactical relations According to Frege [15], both the meaning of the terms and the order of the terms is necessary to understand a sentence. Inspired by this well-known principle, we have investigated how term relations can be exploited to develop a more effective method for document categorisation. The method proposed in this project analyses the following relations: ˆ Subject - Verb - Object (SVO) ˆ Verb - Noun (VN) ˆ Noun - Verb (NV) ˆ Adjective - Subject (AS)

˚ge Hareide and Marit Breivik Take the handball and football domains as an example. A are coaches for the national football and handball teams respectively. Not surprisingly it turns out that these proper names occur in a number of identical relations. This fact might indicate Age Hareide and Marit Breivik have similar semantic roles. that ˚ In addition, some relations are very frequent in some domains and hardly used in other domains. Such information helps to reveal semantic relations and to build better and more robust domain representations. This report will discuss the details of the analysis and report results from an experiment performed on a test corpus (see part III).

6.2.2

Finding distinguishing features

In order to find an adequate domain representation, in general there are at least two different approaches. One way is to take only the domain specific documents into account and find similarities between those documents. Another way is to treat the domain specific documents as one unit and make a comparison between these documents and a general collection. We have chosen to use the latter approach, simply because domain representations should capture characteristics that differentiate the domains from each other. A simple example illustrates the rational behind the decision. Again, take the two domains, football and handball, as an example. In this experiment only the term frequency component of the domain representation is included. Table 6.1 shows the result from a concept extraction method performed on two small document collections. One collection contains football articles, the other handball articles. In this case each collection is analyzed individually and no comparison between the domain and a general collection has been made. The table depicts the 15 most important concepts in the two collections. Also, a concept is marked if it is an important concept in both domains. As the table shows, six out of 15 concepts are among the highest ranked in both domains. It is worth noting that even if two domains are very similar, the method should differentiate the two domains. This implies that each domain representation ought to consist of discriminating factors. Table 6.2 shows the result from an analysis where each of the two domains is compared with a general corpus. Now, the football and handball domains do not share any of the 15 highest ranked concepts. It is a very subjective matter to decide which concept ranking is better, but we argue that the latter has a number of advantages. The two rankings in table 6.1 have a lot of common concepts, which could make it difficult to distinguish the two domains. A related issue is the fact that both include very general concepts which are part of most ball game vocabularies (spiller, kamp, 31

6.2. DOMAIN REPRESENTATIONS

• • • •





CHAPTER 6. A METHOD PROPOSAL

Fotball spiller millioner spiller kamp landslag gang m˚ al minutt sko sesong lørdag bane krone klubb lag

Handball • ˚ ar vm • kamp • m˚ al november • bane pause dag • kamp turnering slutt vg omgang landslag • lag

Table 6.1: Concept extraction within domain.

Fotball lag ball minutt klubb kontrakt satsing førsteomgang frispark stadion sko børs spiss stolpe ligacup scoring

Handball turnering h˚ andball landslagssjef kontring mesterskap spillet kant strekspiller ledelse avslutning landslagslege korsb˚ and trening skulder h˚ anda

Table 6.2: Comparing the domains with a general corpus.

32

CHAPTER 6. A METHOD PROPOSAL

6.3. MEASURING SIMILARITY

m˚ al, bane, lag). A table including the 50 most important concepts should indeed contain all these general ball game concepts; however they are not suitable to discriminate similar domains. We have therefore created a representation that is useful for two slightly different purposes: ˆ differentiate one particular domain from other, very different domains (using the lowranked parts of the representation) ˆ differentiate one particular domain from a similar domain (using the high-ranked parts of the representation)

Section 7.1 describes the details of how to arrive at such a domain representation.

6.3

Measuring similarity

In order to classify a document, it is necessary to compare the document to the domain representations. Each comparison yields a similarity value. As soon as the domain representations are built, all the hard work is done. Actually, the basic idea behind the similarity value is very simple. Each component of the document to be classified is assigned a frequency or some frequency related value. Typically this component is also represented with a weight value in the domain representation. The most intuitive operation is therefore to combine the two values and calculate an aggregated value for all the components in the document that is to be classified. A high similarity value might indicate that the document belongs to the corresponding domain. For a more detailed discussion of the calculation of the value and related problems, please refer to 8.1.

33

Chapter 7

Building domain representations 7.1

VSM based domain representations

In order to create useful domain representations, the documents included in the training set have to be semantically consistent. If a training set contains too much noise, the quality of the resulting domain representation will suffer. In this context we define noise as documents that contradicts the majority of the documents in the domain, ie. a handball document in a collection of football documents.

7.1.1

Success criteria

As described briefly in 6.2.2 it is not enough to extract concepts that characterises the domain. It is crucial to find discriminating factors that distinguish each domain from the rest of the corpus, including domains that are very similar. Figure 7.1 outlines the starting point of the analysis in which domain documents are separated from the rest of the collection.

Figure 7.1: Domain documents vs. the general corpus. In our analysis, the following three properties characterises terms that are important concepts in a domain: 1. high frequency in the domain collection 34

CHAPTER 7. BUILDING DOMAIN REPRESENTATIONS 7.1. VSM BASED DOMAIN REPRESENTATIONS

2. low frequency and spread in the general corpus 3. high spread in the domain collection; that is concepts appear in as many documents in the domain as possible High frequency (1) might indicate that a term is an important concept. On the other hand, if a term also has a high frequency in the general corpus, it is not a very useful discrimination factor. That is why the concept should occur rarely outside the domain (2). Further, important concepts should also occur in as many documents in the domain collection as possible, otherwise it could be noise.

7.1.2

Solution

In order to extract concepts according to the three points, two VSM-analysis are required: One analysis of the domain and one analysis of the general corpus. A useful feature of the VSM-analysis is the inverse document frequency (IDF), discussed more detailed in 3.2.1. The IDF is inverse proportional to the spread of a term in a collection, a property useful to satisfy point 2 and 3. The desirable outcome of the combined VSM analysis is that concepts with low IDF-values in the domain and high IDF-values in the general corpus are assigned high final IDF-values. Please bear in mind that a high IDF-value indicates that the term is not widespread in the collection and that important concepts are usually widespread in the domain collection. Also, important concepts that are well suited as discrimination factors rarely occur frequently or widespread in the general corpus. Let V SMGEN and V SMDOM denote the two VSM analysis, one for the general corpus j and one for the domain respectively (figure 7.1). IDFGEN denotes the inverse document j denotes frequency of word j of the VSM analysis of the general corpus. Similarly, IDFDOM j the inverse document frequency of word j in the domain collection. IDFF IN denotes the final IDF-value of word j. The final IDF-value is the value assigned to a specific word in the domain representation. Also, |DOM | and |GEN | denote the number of distinct terms in the two domains. In the domain representations, each distinct term in the domain is assigned an IDF value. The IDF value of word j in domain i is denoted by idfi (j). In order to recalculate this value for each distinct term in the domain collection, we have used the following calculation rules: 1. If the term is present in the general corpus: idfi (j) =

|DOM | j IDFDOM

j ∗ IDFGEN

(7.1)

∗ |GEN |

(7.2)

2. If the term is not present in the general corpus: idfi (j) =

|DOM | j IDFDOM

The interpretation of these equation is that the inverse document frequencies of the terms in the domain collection are all inverted by dividing the number of distinct terms in the domain collection (|DOM |) by the original IDF-value. This inverted value is then multiplied by the IDF-value of the term in the general corpus. If a term does not occur in the general corpus at all, the inverted IDF-value in the domain is multiplied by the number of distinct concepts in 35

7.1. VSM BASED DOMAIN REPRESENTATIONS CHAPTER 7. BUILDING DOMAIN REPRESENTATIONS

the general corpus (|GEN |) which represent the maximum possible IDF-value in the general corpus. This way, concepts that occur frequently across the domain and at the same time are hardly present in the general corpus, are assigned high IDF-values. Concepts that are assigned high IDF-values after this process are typically well suitable as discrimination factors. Equations 7.1 and 7.2 captures all the knowledge stated in the three properties in 7.1.1.

36

Chapter 8

Classification 8.1

Classification using domain VSM domain representations

Documents with unknown content are classified by comparing the documents with the domain representations built using the VSM inspired technique as described in 7.1. In general, the document in question is compared to each domain representation and each comparison yields a similarity value. Basically, the domain corresponding to the comparison with the highest similarity value is probably the correct document class. The classification process is quite straightforward, but there are a few remarks that should be noted. The following sections discuss the classification process in detail.

8.1.1

Measuring similarity

As described in 7.1.2, in a domain representation Di each distinct term j is assigned an IDF value or a weight, denoted by idfi (j). In real world applications we would expect that many domains have substantial overlaps. An example previously used in this report is that of the football and handball domains. Both are ball games and have a certain ball game vocabulary in common. The method should be able to separate both very different domains and very similar domains, which might seem to be difficult. Each comparison between a document and a domain representation returns a similarity value. The similarity value is a sum of products between the frequency of each distinct term in the document and the weight of the same term in the domain representation. Let fj denote the frequency of term j in the document and Sim(A, Di ) the similarity value between document A and domain representation Di . The similarity value is given by equation 8.1: Sim(A, Di ) =

X

fj ∗ idfi (j)

(8.1)

j

If the term j is not present in the domain representation, the value of idfi (j) is zero.

8.1.2

Justification of the method

With the definition of Sim(A, Di ) in mind we can now analyse how the method distinguishes domains of text documents. Tables 8.1 and 8.2 show fragments of two domain representations, handball and football respectively. The terms appearing above the dots are terms with high weight (ie., high IDF value) values. Below the dots are terms with high frequency in the domain but not necessarily high IDF values. 37

8.1. CLASSIFICATION USING DOMAIN VSM DOMAIN CHAPTER REPRESENTATIONS 8. CLASSIFICATION

Term turnering h˚ andball mesterskap landskamp kontring skulder brudd ... kamp m˚ al spiller omgang lag

Weight (IDF) 3962 3335 3031 2727 2415 2415 2415 ... 457 502 350 540 336

Frequency in domain 14 6 6 4 6 4 6 ... 38 32 18 15 12

Table 8.1: Fragment of the handball domain representation.

Term innlegg overgang posisjon førsteomgang stadion satsing frispark ... kamp m˚ al spiller klubb lag

Weight (IDF) 3335 2678 2678 2678 2364 2364 2364 ... 385 438 454 428 270

Frequency in domain 6 4 4 4 6 5 3 ... 39 26 22 18 13

Table 8.2: Fragment of the football domain representation.

38

CHAPTER 8.1. 8. CLASSIFICATION CLASSIFICATION USING DOMAIN VSM DOMAIN REPRESENTATIONS

Not surprisingly, the terms with the high frequencies are to a great extent the same in both domains, thus they are assigned low weights. One way to interpret the similarity value is to split the value into two components. It is obviously not that clear cut, but it is a pretty accurate description and a good way to understand the strength of the method. Consider figure 8.1. The two bars F and H represent two similarity values; the result of a comparison between a football document and the football domain representation and the result of a comparison between a handball document and the handball domain representation. According to tables 8.1 and 8.2 terms such as kamp (eng.: match), m˚ al (eng.: goal ) and spiller (eng.: player ) occur with high frequencies in football- and handball documents. Thus, these terms have low weights in the domain representations. The bottom ellipse in figure 8.1 contains such words. Due to the fact that they occur frequently in both handball- and football documents, they contribute considerably to both H and F. However, without the domain specific terms, represented with the two ellipses to the left and right respectively, it would be impossible to distinguish handball documents from football documents.

Figure 8.1: Two similarity values (football domain and handball domain). A simple experiment might convince ourselves that figure 8.1 is a reasonable interpretation of how the method works. Table 8.3 shows the contribution of various terms to the similarity score Sim(A, F ootball) for a document A. Document A is classified as a document belonging to the football domain and the analysis arrives with a similarity value of 15180. Document A is a real document referring to the Norwegian football player Fredrik Winsnes and his transfer from RBK to the Danish club Aalborg. In table 8.3 a number of terms from document A are grouped in three rows. The terms in the first row have low frequencies in document A but is regarded as distinguishing terms in the football domain representation, that is they have high idfi (j) values. Thus, they contribute to more than 50 % of the total score. Row one corresponds to the upper part of bar H in figure 8.1. The words in row two have a relatively high frequency, but due to the fact that they are very common ball game terms, their IDF values are low. Together with 39

8.1. CLASSIFICATION USING DOMAIN VSM DOMAIN CHAPTER REPRESENTATIONS 8. CLASSIFICATION

1 2 3

Terms frispark, fotballspiller, stadion, rbk sesong, kamp, lag, klubb, nummer, vei ball, styrke, ledelse, ...

Contribution 56 % 28 % 16 %

Group frequency 4 14 12

Table 8.3: Contribution to similarity value.

row three, which contains words with low frequency- and weight values in document A, row two corresponds to the lower band of bar H.

8.1.3

Threshold value

Consider the analysis of a document which clearly does not belong to any of the domains that the system is trained to recognize. It is clearly not acceptable that the system simply chooses the domain that returns the highest similarity value. Such a case might be illustrated with figure 8.2.

Figure 8.2: Analysis of a miscellaneous document and a handball document. Doc2 is similar to both the football- and handball domains, but is correctly classified as a handball document (the document refers to a game between Norway and Australia in the World Cup 2005). The classification of Doc1 is a different matter. The test machine that ran this classification was trained to recognize documents from five different domains, namely football, handball, movies, air traffic and music. Doc1 is about an arranged marriage and it clearly does not fit into any of the five domains. However, according to the analysis, Doc1 is most similar to the music domain with a similarity value of 6901. In comparison, Sim(Doc1, Handball) equals 25586. In order to prevent a document like Doc1 being classified at all, it is necessary to introduce a threshold value. If the highest similarity value for a document is lower than the threshold value the system should quite simply refuse to put a label on the document. After experimenting with document classification using the method described above, we have come to the conclusion that a threshold value should have a value of about 8000.

40

Chapter 9

Results and evaluation The discussion in this chapter is based on a test-run on a set of 20 documents. Each document is compared to the following domain representations: Air traffic, Football, Handball, Movie and Music. Table 9.1 shows the results of the test run. The Class row indicates which class the system has classified the document to, the Absolute score is the similarity value to this domain and Relative score is the Absolute score over the sum of similarity values from the comparisons to the other domains. P ilme1.txt,M ovie) equals 46 %, In other words, Sim(f ilme1.txt, M ovie) equals 16477 and Sim(f Sim(f ilme1.txt,d) where d ∈ {Airtraf f ic, F ootball, Handball, M usic}.

9.1

d

General comments

The general conclusion is that the results are very promising. In order to evaluate the results more formally, we redefine the recall and precision measures to fit the test run. In this context recall is defined as the number of classified documents over the total number of documents in the set. Using a threshold value (8.1.3) of 8000, all but one document in table 9.1 are classified. fotballe1.txt’s highest similarity value is only 2874, so it should not be classified. Thus, in this case we have a recall of 95 %. Precision is defined as the ratio of the number of correct classifications to the total number of returned classifications. In this case, all the 19 classifications are actually correct, so the precision is 100 %. We would like to add that the documents used to train the system the test set in table 9.1 are collected at approximately the same time. Also, please note that the results seem to be better for some of the domains, the air traffic as a particular example.

9.2

Biased training sets

Even though each of the domains used in the test suite has been given labels like football and handball, it does not mean that the domain representations are actually covering the real world football and handball domains. In essence, the domain representations are nothing more than an extraction of knowledge from a manually picked set of documents. Consequently, if the documents included in the training sets are not representative of the class (ie., football, handball etc), the ability to recognize new documents will suffer. Further, domain representations of high quality are dynamic by nature. Late in the test process we made an interesting observation: Some documents that obviously were members of the football domain (like fotballe1.txt) got ratings that indicated a very strong similarity to the handball domain. It turned out that these documents predominantly dealt with the 41

9.2. BIASED TRAINING SETS

CHAPTER 9. RESULTS AND EVALUATION

Document name filme1.txt filme2.txt filmo1.txt filmo2.txt

Classified as Movie Movie Movie Movie

flytrafikke1.txt flytrafikke2.txt flytrafikko1.txt flytrafikko2.txt

Air Air Air Air

fotballe1.txt fotballe2.txt fotballo1.txt fotballo2.txt

Absolute Score 16477 11711 10116 13597

traffic traffic traffic traffic

Relative Score 46 % 46 % 42 % 45 %

48150 44059 105214 47309

78 77 83 74

% % % %

NA Football Football Football

2874 9680 25188 8732

28 64 52 37

% % % %

handballe1.txt handballe2.txt handballo1.txt handballo2.txt

Handball Handball Handball Handball

25084 9874 8060 25586

51 43 35 50

% % % %

musikke1.txt musikke2.txt musikko1.txt musikko2.txt

Music Music Music Music

15605 30401 32272 21850

60 47 55 58

% % % %

Table 9.1: Results from test-run.

42

CHAPTER 9. RESULTS AND EVALUATION 9.3. INDISTINCT DOMAIN REPRESENTATIONS

two tournaments Royal League and Champions League, which were much discussed at the time of writing (9th December 2005). However, at the time the training sets were compiled, Champions League was rarely mentioned in the media and Royal League had not yet started. Actually, most football documents included in the football training set are dealing with the finishing stage of the Norwegian Tippeliga. On the other hand, the majority of the documents in the handball training set are referring to the Norwegian national team’s preparations prior to the Woman’s World Championship, also a tournament. Due to the fact that the majority of football documents in the training set did not deal with tournaments, a certain ”tournament vocabulary” was weighted as an important part of the handball domain representation. As a result, football documents referring to for example Royal League has a substantial similarity to the handball domain. This example shows it might be necessary to keep the domain representations up-to-date or make them as complete as possible in the first place.

9.3

Indistinct domain representations

As table 9.1 depicts, the air traffic documents have very high similarity values. They also have very high relative scores, so there should be no doubt that these documents are classified correctly. Section 8.1.2 describes how the method builds unique domain representations. However, one of the strengths of the method, the ability to differentiate very similar domains, might turn out to be a weakness as well. Take the air traffic domain as an example. In the corresponding domain representation terms like fly (eng.: aircraft), pilot (eng.: pilot) and rullebane (eng.: runway) are assigned very high weight values. One could argue that these terms are as important to the air traffic domain as the terms below the dots in table 8.2 are to the football domain. However, terms like kamp (eng.: match), m˚ al (eng.: goal ) and spiller (eng.: player ) are assigned low weights, due to the fact that they also appear frequently in the handball domain. The phrase indistinct domain representations is used to describe the fact that some terms very representative of one domain tend to get too low weights if other training sets make frequent use of them as well. This is exactly what has happened to the handball and football domain representations. On the other hand, the air traffic domain is very distinct.

43

Chapter 10

Improvements 10.1

Coreferences

There are several ways to improve the method of concept extraction. We have a hypothesis that resolving coreferences and replacing the referring expressions with the corresponding antecedent, might improve the concept extraction process. In this section we will discuss this subject in light of an evaluation of Mitkov’s algorithm, some results presented by Bernt Bremdal (managing director at CognIT) regarding reference resolutions. Also, a small experiment is conducted.

10.1.1

Co-reference experiment

On the basis of the findings of [25] we conducted an experiment regarding coreferences and its influence on the concept extraction process. The intention was to replace referring expressions (pronouns) with their antecedent (noun) in order to increase the term frequency. Our assumption was that such an increased term frequency would improve the concept extraction process and consequently the domain representations. However, rather than implementing the full algorithm presented in [25] we approached the issue less comprehensively. We implemented a simple co-reference resolver that checked for coreferences only within a sentence. The algorithm uses the output from the Oslo-Bergen tagger and examines one sentence at the time. The progress is as follows: 1. If a pronoun is recognised, check for nouns to the left of the pronoun that match in gender and number. 2. If there exists exactly one match, replace the referring expression (pronoun) with the antecedent (noun). Like the method proposed by [25], the algorithm only considers candidates that agree in gender and number. It could be argued that the algorithm is risk averse due to the fact that no action is taken if more than one candidate matches. The rational behind this is that by replacing an anaphora in cases with ambiguity (more than one antecedent matches), the probability of making an erroneous replacement is always at least 50%. However the results were not satisfactory as they did not indicate any particular improvements and the algorithm was subjected to numerous incorrect replacements. The simple approach works fine in sentences like the following:

44

CHAPTER 10. IMPROVEMENTS

10.1. COREFERENCES

Passasjerenei fortalte at dei hadde sett dramaet dei selv deltok i p˚ a nyhetskanalen CNN. In this case the noun Passasjerene matches the criterions of the pronoun de and the replacements are carried out, yielding the following output: Passasjerenei fortalte at passasjerenei hadde sett dramaet passasjerenei selv deltok i p˚ a nyhetskanalen CNN. However, the simple method proved insufficient in many cases. Incorrect substitutions were mainly caused by one of three reasons, namely matches in number and gender with a wrong antecedent (1), incorrect tagging by the Oslo-Bergen tagger (2) and semantic errors (3) in the text. Examples of the different cases are shown in the list below: 1. Matching number and gender with the wrong antecedent: This problem occurs when a referring expression refers to an antecedent in previous sentences, but there exists a unique noun that matches in number and gender within the sentence of the referring expression. This is illustrated by the following sentence: I serie˚ apningeni mot Molde scoret hanj ˚ arets første Brann-m˚ al. In this case serie˚ apningen and han are both singular and masculine terms and therefore match in both number and gender. As a consequence the pronoun han is replaced by the noun serie˚ apningen. It is, however, obvious to a human that these two terms does not corefer, but the simple method does not comprehend. 2. Incorrect tagging: The method also suffers from occasional erroneous tagging by the Oslo-Bergen tagger. One example is given below: En av passasjerene sier til APi at detj var surrealistisk ˚ a følge med p˚ a sendingen. The output from the Oslo-Bergen tagger tags the proper noun AP in the following sentence as a neutral and singular noun, which matches the pronoun det. As a result, an incorrect replacement occurs. 3. Semantic errors: Occasionally, incorrect replacements due to semantic errors in the analysed texts were discovered, as in the following example: Rett etter at Chelsea ble tildelt straffei fikk ogs˚ a Blackburn sjansen fra krittmerketj , dettei av det mer billige slaget. The author has used the pronoun dette in stead of denne as the referring expression to straffe. Unfortunately (in this case), dette and krittmerket matches in both gender and number and a wrong replacement takes place. These are all examples of situations that lead to incorrect replacements of referring expressions. Whereas the first example is quite easy to solve, the two latter are more difficult. The first one can be handled by implementing a more complex algorithm that aligns better with 45

10.1. COREFERENCES

CHAPTER 10. IMPROVEMENTS

the suggestions of [25]. Despite being a time consuming and challenging task it is perfectly feasible within the scope of the upcoming thesis. Solving erroneous tagging however requires an improved part-of-speech tagger, while incorrect grammar in texts is beyond our control. The evaluation of Mitkov’s algorithm for anaphora resolution shows a success rate of 84% to 95% dependent on the text material and the number of anaphoric references relative to the document [25]. As the method is robust in the sense that it always suggests an antecedent to the referring expression, it is more correct to use a success rate than recall and precision measures. The method is knowledge poor and suffers in syntactically complex sentences since it does not rely on any syntactical knowledge. Based on this evaluation it is fair to assume that implementing the full algorithm for anaphora resolution would yield similar success rates for our system, which is fairly good considering the fact that it is a knowledge poor approach. However the big question remains, how will such resolvement of coreferences impact the representation of the document? Bernt Bremdal has conducted several manual experiments regarding the problem of coreferences in texts. He has manually resolved coreferences by simply replacing the referring expressions with the corresponding antecedent. Not surprisingly such replacements result in higher frequencies in terms that are often referred to as illustrated in figure 10.1 and figure 10.2.

Figure 10.1: Frequency of extracted concepts before co-reference resolution.

Figure 10.2: Frequency of extracted concepts after co-reference resolution. 46

CHAPTER 10. IMPROVEMENTS

10.1. COREFERENCES

Based on the evaluation of Mitkov’s algorithm for anaphora resolution and Bremdal’s experiments regarding the impact of resolving these references in texts, we see that a correct resolvement of references would increase the frequency of important concepts. But does this necessarily improve the concept extraction process? During the evaluation of our experiment we found that the results of the concept extraction process are difficult to measure, as identifying the most important concepts in a text is a process with multiple interpretations. As our simple approach to anaphora resolution was subjected to several incorrect replacements as well, it was hard to measure the impact on the concept extraction process. It is for example hard to graduate which of the two concept extractions in table 10.1 best represents a text describing a football match between Chelsea and Blackburn. These results stem from a concept extraction analysis before and after simple anaphora resolution. Original concepts (weight) Chelsea (13,2) gjestene (11,2) Blackburn (10,6) hvilen (7,5) Lampard (7,5)

Concepts after simple anaphora resolution (weight) Chelsea (13,2) Blackburn (12,2) gjestene (11,2) krittmerket (11,2) kvarter (11,2)

Table 10.1: Comparison of concept extraction before and after simple anaphora resolution.

From an objective point of view it is difficult to put preference on one concept extraction over another. Please note, however, that the weighting of the concepts are higher after resolving anaphora references. In this case some of the improved weighting might originate from incorrect replacements. Due to the fact that the results of a concept extraction could have multiple correct interpretations, we suspect that coreferences resolution does not necessarily improve the concept analysis. Verifying this hypothesis, however, requires a large experiment with manual coreference resolving. This is considered to be beyond the scope of this report.

47

Part III

Syntactical relations

48

Chapter 11

Introduction The method presented and discussed in part II enables classification of documents and delivers promising results (9.1). However, one could argue that the method is a surface analysis and that it ignores one important aspect of natural language; namely the composition and order of terms. The main objective of the domain representations is to differentiate the various domains and thus distinguishing factors are included in the representations. Actually, no syntactical information is taken into consideration. As stated in hypothesis 2 in section 5.1 we believe that a substantial amount of information is lost by ignoring the compositional element of natural language. This might seem obvious, but due to the fact that the classification method in part II provided such good results, it is challenging to pinpoint how this information should be applied. As a matter of fact we do not intend to use the syntactical relations to improve the classification process. The main reason for this decision is the nature of the data material. By using the syntactical relations obtained from the output of the Oslo-Bergen tagger it is extremely difficult to differentiate documents. As discussed in section 12 it turns out that the same relations appear more or less consistently within the various domains. A relation like si (eng.: say(s)) is a good example of such relations. Instead of using the relations to improve the method, this section investigates how to use them to extend the analysis. By this approach we leave the classification method as it is and look at the relations in the context of the document class. In other words, consider a document A which is classified as a football document using the method described in part II. As soon as the document class is found, the syntactical relations come into use. Figure 11.1 depicts a possible internal representation of the football domain.

Figure 11.1: Internal representation of the football domain. Rather than using terms such as say or score to distinguish one domain from another, these terms could be used to find similar concepts in the text corpus. One way to approach the problem would be to determine what concepts that have the ability/property of saying, 49

CHAPTER 11. INTRODUCTION

who has the ability/property scoreing and what is scored. In the following sections, a couple of question will be addressed: First of all, using the syntactical relations extracted from domain specific documents, is it possible to create an internal hierarchical representation of the domain as depicted in figure 11.1? Assuming that such a representation exists, is it possible to extend the classification of a football document into subclasses?

50

Chapter 12

Practical experiment The following sections describe an experiment conducted to assess the usefulness of syntactical relations in order to identify similar concepts and create subclasses within a domain. Please note that this problem is very similar to ontology engineering and automatic acquisition of terminological knowledge from domain texts, which today remains unsolved [13].

12.1

Input data

In order to bring the classification method (7.1) to a syntactical level, it is necessary to tag the textual data in a part of speech manner with syntactical features. As we will see, the Oslo-Bergen tagger (section 4.1) is a good starting point for syntactical relation extraction. In particular it is possible to implement a framework on top of the tagger for relation analysis. The Spartan-script is an extension of the Oslo-Bergen tagger and returns dependency relations [32]. The output of Spartan is triples consisting of a syntactical relation identifier and two words. Also, the position of the words within the sentence and the sentence number are indicated. Table 12.1 shows sample output of Spartan: Relation Word1 V_O_N vise V_S_N vise V_O_N score V_S_N score V_O_N vinne V_O_N inneha V_S_N inneha V_S_N ˚ apne A_mod_N sist A_mod_N sist A_mod_N god

Word2 klasse spiss m˚ al spiss kamp 17.plassen West Bromwich overgangsvindu serierunde runde kamp

Pos of Word1 2 2 7 7 4 2 2 19 8 12 6

Pos of Word2 7 3 10 4 5 3 1 18 9 13 7

Sentence 311 311 314 314 318 319 319 209 659 661 662

Table 12.1: Output of the Spartan-script.

The output of the Spartan script is fed into a Java application written specifically for this project. The application splices relations and represents them in such a way that the data might be analysed effectively. The output in table 12.1 is transformed by the Java-application into the representation shown in table 12.2. 51

12.2. NOUN SIMILARITY

CHAPTER 12. PRACTICAL EXPERIMENT

Relation Subject S_V_O spiss S_V_O spiss V_O S_V_O West Bromwich V_S ˚ apne

Verb vise score vinne inneha overgangsvindu

Relation A_mod_N A_mod_N A_mod_N

Noun serierunde runde kamp

Adjective sist sist god

Object klasse m˚ al kamp 17.plassen

Table 12.2: Relation representation in the Java application.

The notation used for relations in this report is [subject, verb, object] where either the subject or the object might be missing. Hence, [spiss, score, m˚ al], [spiss, skyte, _] and [_, spille, landskamp] are valid syntactical relations. It is also worth noting that section 12.3 discuss the quality of the output of the Oslo-Bergen tagger, which is obviously a very important factor in the overall performance of the analysis.

12.2

Noun similarity

Section 3.2.3 of part I describes and discusses co-occurence. By measuring the distance between pairs of words within sentences it is possible to identify dependency relations and thus word similarity. In this experiment co-occurrence is approached from a slightly different angle and aims to study noun similarity on the basis of the distribution of subjects, verbs and objects. This experiment is based on Hindel’s co-occurrence score [16] which is an estimate of Mutual Information (MI). In effect, the MI score is applied as filter on the relations extracted by the Oslo-Bergen tagger.

12.2.1

Mutual Information and Hindel’s co-occurrence score

In probability theory, Mutual Information scores are used to measure the mutual dependence of two random variables X and Y . If X and Y are independent, the MI-score should be zero. On the other hand, two related variables should yield a high score. In this experiment, X and Y are both [subjects and verbs] and [object and verbs]. We want to measure the MI-scores of tuples such as: SV SV VO

[hareide, si] [spiss, score] [m˚ al, score]

[hareide, say] [striker, score] [goal, score]

Eventually, the MI-scores of verbs and arguments are used to measure the similarity between nouns in the corpus. In the discrete case (as for the distribution of subjects, verbs and objects), MI can be defined as: 52

CHAPTER 12. PRACTICAL EXPERIMENT

M I(X; Y ) =

X X

12.2. NOUN SIMILARITY

p(x, y) log2

y∈Y x∈X

p(x, y) f (x) f (y)

(12.1)

where p is the joint probability distribution of X and Y and f is the marginal distribution of X and Y . The mutual information of two events M I(x; y) is thus defined as: M I(x; y) = p(x, y) log2

p(x, y) f (x) f (y)

(12.2)

In equation 12.2 x and y might denote the two events spiss (striker) and score (score), respectively. Hindle [16] define two co-occurrence scores, Csubj (n v) and Cobj (n v). These scores are estimations of mutual information of pairs of nouns and verbs. In Csubj the nouns are subject arguments of the verbs, in Cobj the nouns are object arguments. Equation 12.3 defines Csubj : Csubj (n v) =

f (nv) log2 f (n)Nf (v) N N

(12.3)

where f (n v) is the frequency of noun n occurring as subject of verb v, f (n) is the frequency of the noun n occurring as argument of any verb, f (v) is the frequency of the verb v and N the total number of relations [16]. Cobj is defined analogously. Each noun and verb pair in the corpus thus has two co-occurrence values, Csubj (n v) and Cobj (n v). These values are used to compute the subject similarity and object similarity of two nouns with respect to a verb:   

min[Csubj (vi nj ), Csubj (vi nk )] for Csubj (vi nj ) > 0 and Csubj (vi nk ) > 0 |max[Csubj (vi nj ), Csubj (vi nk )]| for Csubj (vi nj ) < 0 and Csubj (vi nk ) < 0   0 otherwise (12.4) SIMobj (vi nj nk ) is defined analogously. Intuitively, SIMobj (use striker goalkeeper) says something about the similarity of the nouns striker and goalkeeper as object arguments of use. Finally, Hindle defines the noun similarity of two nouns as: SIMsubj (vi nj nk ) =

SIM (n1 n2 ) =

N X

SIMsubj (vi n1 n2 ) + SIMobj (vi n1 n2 )

(12.5)

i=0

Equation 12.5 is based on the distributional hypothesis which states that nouns are similar to the extent that they share verb and verb contexts [16]. The metrics of equations 12.5 and 12.3 have been implemented in the Java application that sits on top of the Oslo-Bergen tagger. The corpus used in the experiment consists of the documents in the training set for the football domain used to build the domain representation in section 7.1, plus 50 general articles about the Norwegian national team. The total number of documents is 86 and the total number of subject-verb, subject-verb-object and verb-object relations is 1892. We have analysed relations that include the most frequent concepts in the football domain which are m˚ al (goal), kamp (match), spiller (player), lag (team), spiss (striker), seier (victory), sjanse (chance), ball (ball), poeng (point), innlegg (cross) and stadion (stadium).

12.2.2

Noun-noun similarity

The relations in this section is computed using equation 12.5. Table 12.3 shows concepts in the football domain and related nouns. A reasonable interpretation of a related noun is that 53

12.2. NOUN SIMILARITY

CHAPTER 12. PRACTICAL EXPERIMENT

it is possible to substitute a related noun for the corresponding concept. Due to the fact that equality 12.4 measures the similarity between two nouns with respect to a verb, the substitution is not necessarily semantically correct, but valid with respect to the verb. One of the relations including spiller is [spiller, ta, ansvar]. Both lag and tropp might replace spiller in this and similar relations. Please note that proper names are not included in the set of related nouns. Concept m˚ al kamp spiller lag spiss seier sjanse ball poeng innlegg stadion

Top 5 related nouns (exclusive proper names) rest (43.72), trøstem˚ al (43.72), utligningsm˚ al (43.72), brann-m˚ al (43.72), spiller (35.90) ellever (28.67), hodeduell (25.83), kvalifisering (20.27), fotballkamp (20.25), landslagstropp (19.11) tropp (49.79), mann (36.26), plass (36.08), m˚ al (35.90), lag (33.32) spilletid (94.60), dødball (94.60), kaptein (86.00), øyeblikk (52.56), angrep (52.56) a-landslagstrener (630.67), klassekeeper (63.07), oslo-gutt (63.07), forbilde (23.65), tid (16.89) trøkk (63.07), fokus (63.07), fasong (63.07), landskamp (27.03), ledelse (20.01) ros (118.25), utlikning (15.26), scoring (15.26), resultat (15.26), ledelse (15.26) [EMPTY SET] oppmerksomhet (236.50), plass (67.57), rom (16.89), stjerne (16.89), spillerbørs (16.89) meter (315.33), buepasning (36.38), ledelse (26.08), sensasjonslag (18.19), pasning (18.19) moderklubb (315.33), bane (315.33), klubb (157.67), stilling (157.67), supporter (94.60) Table 12.3: Concepts in the football domain and similar nouns.

The results in table 12.3 are probably subjected to noise, both from the corpus and from the tagging. Nevertheless, it is possible to identify some useful relations, such as the clustering of {m˚ al, trøstem˚ al, utligningsm˚ al, brann-m˚ al }, {sjanse, utlikning, scoring, ledelse} and {innlegg, buepasning, pasning}. At this stage it is worth noting that these clusters are identified manually and that we do not propose a method for automatic processing of the relations in the table. On the contrary, the results recorded in table 12.4 seem to be a good starting point for identifying instances of concepts within the domain. This data is also subjected to some noise, although to a much smaller extent. Concept spiller lag spiss kamp

Similar proper nouns carew, rosenborg, nord-irland, grod˚ as, hassan, gunnar, hareide, lundekvam, r¨osler, flo lyon, tsjekkia, skottland, rosenborg, chelsea, hviterussland, arsenal hareide, owen, pedersen, shearer, andy, sørensen, riseth, vassell, gerrard, ronaldinho norge, tsjekkia, r¨osler, cole, flo, pires, skottland Table 12.4: Concepts in the football domain and related proper names.

It is fair to say that {carew, grod˚ as, hassan, gunnar, hareide, lundekvam, r¨ osler, flo} are all players, that {lyon, tsjekkia, skottland, rosenborg, chelsea, hviterussland, arsenal } are all teams and that {hareide, owen, pedersen, shearer, andy, sørensen, riseth, vassel, ronaldinho} are all strikers. 54

CHAPTER 12. PRACTICAL EXPERIMENT

12.2.3

12.2. NOUN SIMILARITY

Noun-verb dependencies

Further, dependencies between the concepts and verbs are computed using equation 12.3. Table 12.5 and 12.6 show the nouns as subject arguments and object arguments to the verbs, respectively. Concept (subjects) kamp

m˚ al spiller

sesong lag

spiss seier poeng innlegg stadion

Related verbs spørres (111.29), telle (111.29), rokke (111.29), sondere (111.29), starte (13.91), love (11.13), gi (7.18), g˚ a (6.36), se (4.64), spile (3.65), spille (2.53), komme (1.85) dra (189.20), komme (15.67) nevne (63.07), ofre (63.07), forhindre (63.07), h˚ andtere (63.07), bevise (63.07), regulere (63.07), synke (31.53), slippe (31.53), tr˚ a (31.53), forsterke (21.02) [EMPTY SET] ramme (172.0), forsvare (86.00), overraske (86.0), tilby (57.33), tape (24.57), ligge (13.23), komme (13.23), legge (8.19), spille (3.91) avslutte (94.60), innrømme (63.07), vise (23.65), bruke (15.77), score (13.51) bringe (630.67), gjøre (16.17), komme (10.51) [EMPTY SET] føre (473.00), sl˚ a (18.19), komme (15.77) st˚ a (94.60), n˚ a (78.83)

Table 12.5: Concepts in subject position and related verbs.

Table 12.5 states that a player prevents (en spiller forhindre), that a team defends (et lag forsvare) and that a striker scores (en spiss scorer ). Similarly, from table 12.6 we can extract the knowledge that it is possible to win a game (˚ a vinne en kamp), to score a goal (˚ a score et m˚ al ) and to substitute a player (˚ a bytte en spiller ).

12.2.4

Conclusion

The motivation for doing this experiment was to study what information the syntactical relations carry and if this information might aid the process of finding similar concepts and to create subclasses within a domain. We have applied Hindle’s approximation to mutual information [16] in order to identify noun dependencies with respect to verbs. It seems that the relations between proper names and concepts (table 12.4) yield promising results. These dependencies might be useful for the creation of subclasses within the domain. Further, the relations of verbs and concepts as shown in tables 12.5 and 12.6 add useful information to these subclasses. Figure 12.1 depicts how this information can be used in order to create some sort of hierarchical representation of a part of the football domain. Again we would like to stress that even though the figure is based on information automatically acquired from the corpus, substantial manual intervention is required in order to arrive at the representation. In this representation, the football domain has three subclasses; lag (team), spiller (player) and spiss (striker). The leaf nodes are instances and the word clusters to the left and right of the subclasses are verbs that take the subclass as object argument and subject argument respectively. Figure 12.1 is included mainly for illustrative purposes, but we believe it is realistic to build such a hierarchy automatically by increasing the size of the text corpus. Hindel [16] 55

12.2. NOUN SIMILARITY

Concept (objects) kamp

m˚ al spiller

sesong lag spiss seier poeng innlegg stadion

CHAPTER 12. PRACTICAL EXPERIMENT

Related verbs sone (57.33), kalle (28.67), angripe (28.67), kontrollere (28.67), analysere (28.67), prege (19.11), oppsummere (19.11), punktere (19.11), vinne (17.64), vurdere (14.33), spile (10.34), ødelegge (9.56), tape (8.19), starte (7.17), miste (5.73), spille (2.61), se (2.39), bruke (2.39) score (43.72), nærme (27.82), love (22.26), lage (11.13), holde (9.27), komme (1.85) klandre (99.59), kritisere (49.79), prestere (49.79), bytte (29.87), samle (24.89), nærme (24.89), presentere (24.89), finnes (24.89), la (16.60), begynne (11.06), velge (11.06), lage (9.96) [EMPTY SET] bety (94.60), skape (52.56), møte (39.42), tro (31.53) kjøpe (630.67) miste (63.07), se (13.14), ta (11.26) plukke (473.00), stjele (236.50), trenges (236.50), berge (118.25) løpe (315.33), sl˚ a (36.38) forlate (315.33)

Table 12.6: Concepts in object position and related verbs.

Figure 12.1: Hierarchical representation.

56

CHAPTER 12. PRACTICAL 12.3. NOISE EXPERIMENT IN THE OUTPUT OF THE OSLO-BERGEN TAGGER

points out that derivation of semantic relatedness is depended on a large text corpora. Further, Lech and de Smedt [23] have conducted a similar experiment with a corpus of approximately the same size as the one used in the experiment described above. However, the text documents in their corpus were all dealing with two very similar murder investigations. This in turn reduces the noise in the set of syntactical relations. The set of syntactical relations are also subjected to parse errors and misidentification. As long as the errors are consistent, this is usually not a problem in cases with large text corpora. However, with smaller text corpora such noise is not balanced out. A final remark is the number of nouns as object argument of verbs. After parsing the texts with the Oslo-Bergen tagger and running the Spartan script, the resulting data set consisted of 1892 syntactical relations. 56 % of these are subject-verb relations, 20 % subject-verb-object relations and 22 % verb-object relations. This distribution might indicate that it is more difficult to identify the object in a grammatical analysis than the subject. If this is the case it will affect the performance. Section 12.3 elaborate on noise in the data material caused by the Oslo-Bergen tagger. In order to pick up the thread of the introduction of this chapter: We do not believe it is possible to extend the classification technique proposed in part III by using the same corpus as used for building domain representations. However, by increasing the size of the corpus it seems within reach to build subclasses and identify instances.

12.3

Noise in the output of the Oslo-Bergen tagger

As presented in section 4.1 the Oslo-Bergen tagger provides good recall and precision measures with regards to part of speech tagging. It seems to be the case, however, that the syntactical information is not of the same quality. This is especially true for the syntactical subject and object tags. In order to extract relational information, we have made extensive use of information from the Oslo-Bergen tagger and the Spartan script. During this process some problems have come into view. In this section syntactical relations will be denoted by [subject, verb, object] triples. In the Oslo-Bergen tagger, the syntactical tags for subjects and objects are denoted by @subj and @obj, respectively. The tagger performs satisfactory in simple cases like, Thomas sparket ballen. Thomas kicked the ball. which result in the syntactical relation [Thomas, sparke, ball]. In sentences with similar structure the relations extracted provide good precision. However, there are several groups of sentences that cause problems to the relational analysis.

12.3.1

Passive voice

The most obvious problems occur in sentences written in passive voice. In such cases the relations extracted are consistently wrong. The first noun is given the syntactical tag @subj. When combined with the main verb, this obscures the quality of the relational information. Consider the following sentence: Maten ble spist av hunden. The food was eaten by the dog. This sentence would result in the subject-verb relation [mat, spise, _], which is obviously wrong as eating is an action performed by the dog. 57

12.3. NOISE IN THE OUTPUT OF THE OSLO-BERGEN CHAPTER 12.TAGGER PRACTICAL EXPERIMENT

12.3.2

Complementary sentences

The second issue of the tagger arises in complex sentences that include complementary sentences, like the following: Nils fortalte at Sting er en artist. Nils told that Sting is an artist. In such sentences the tagger consistently tags the first word following the main verb as the object. In this case at is tagged as the object. The Spartan script, however, helps avoiding such faults as it only considers nouns as valid subjects and objects. The fact that the Spartan script only takes nouns into considerations, eliminates other relations that could have been of interest, namely cases where the noun is replaced by a pronoun. The simple sentence, Han traff ballen. He hit the ball. is correctly given the syntactical structure [han, treffe, ball] by the Oslo-Bergen tagger, but this is not reflected by the Spartan script as the subject is a pronoun.

12.3.3

Collocations

The Oslo-Bergen tagger does not handle collocations. As a result collocations are not reflected in the extracted relations. Only the last part of the collocation is tagged as either subject or object, which leaves out important information. The sentence, Ole Johnny kastet ballen. Ole Johnny threw the ball. results in the syntactical structure [Johnny, kaste, ball]. Information about the ”thrower” of the ball is therefore lost.

12.3.4

Is-relations

Information is lost in sentences containing a subject in relation with the verb være (eng: be. This is because the information following the verb is considered a subject predicative and not an object. In the sentence, Paul er lastebilsj˚ afør. Paul is a truck driver. lastebilsj˚ afør (eng: truck driver), is regarded as a subject predicative and receives the tag @s-pred. Therefore no subject-verb-object relation is extracted from this sentence. Regardless of the syntactical correctness of the @s-pred tag, the fact that Paul is a truck driver is still relevant in order to establish is_a relations.

58

Chapter 13

Conclusion This report is divided into three main parts: A literature review (I), a proposal of a method that classify text documents (II) and an experiment that assesses the utility value of syntactical relations in text documents (III). The rationale behind the first part is to outline and discuss theories relevant to the objectives of part II and part III. Part I discusses classification theory in general and begins by stating that classification to a large extent is impossible without at least a minimum of prior knowledge about the various domains. Analogous to that is the fundamental difference between a bottom up approach and a top down approach for information extraction. This pointed us in the direction of using training sets in order to build domain representations. Obviously, we have thus chosen a goal-driven approach in which domain knowledge is at hand to aid the classification process. Further, part I identifies two main directions within the information extraction field; namely the statistical approach and the rule based approach. Although the split in practice is rarely that clear cut, we treat the two strategies separately. This part concludes that the statistical method vector space model (VSM) is particularly suitable for domain representations. Also, latent semantic indexing (LSI) could supply valuable synonymy and polysemy information. Constraint grammar is considered useful as an approach to rule based text parsing. The Oslo-Bergen tagger is based on this grammar formalism. Syntax-driven semantic analysis and FOPC are regarded useful in order to apply semantics to text and represent meaning. They are, however, complex to implement. Part I finishes up with a discussion of a possible combination of the statistical- and the rule based approach. The two hypothesis in the end of part I is the starting point for part II and part III, respectively. Part II proposes a method for text document classification that is based on VSM. The results so far have been very promising and the method is able to differentiate documents from very similar domains as well as very different domains. Each domain that is to be recognized has to be represented by a training set of documents. It turns out that training sets of size 30 to 35 documents is sufficient. The method is purely based on statistical measures and due to the fact that training sets are necessary, it is semi-automatic. Also, we have studied the effect of co-reference resolution in terms of improved concept extraction. At this stage, it is difficult to draw a conclusion on way or the other, but we have observed that terms that are often referred to gets higher frequencies after resolution. The objective of part III is to analyse syntactical relations such as subject-verb, subjectverb-object and verb-object in order to find out if these relations might be used to create hierarchies of subclasses within a domain. An experiment was conducted on a corpus of 86 football documents. The conclusion of this experiment is that it requires either a large or a very consistent corpus in order to extract hierarchical relations. In other words, it is not 59

CHAPTER 13. CONCLUSION

possible to apply the relations as an extension to the classification method proposed in part II without increasing the size of the training set considerably. However, the test-run indicated that it is possible to identify subclasses and instances of these subclasses. We believe the results are interesting and that further work and elaboration is necessary in order to reveal the informational power that predicate relations have.

60

Bibliography [1] T. Amble. The understanding computer - natural language understanding in practice. 2004. [2] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573–595, 1995. [3] C. Biemann. Language-independent methods for enriching corpora. 2005. [4] C. Biemann, S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. Language-independent methods for compiling monolingual lexical data. Leipzig University, Computer Science Institute, NLP Dept., 2004. [5] E. Brill and M. Marcus. Automatically acquiring phrase structure using distributional analysis. Darpa Workshop on Speech and Natural Language, Harriman, N.Y., 1992. [6] P. F. Brown, V. J. DellaPietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. [7] I. Dagan, L. Lee, and F. C. N. Pereira. Similarity-based models of word cooccurrence probability. Kluwer Academic Publishers, Boston., pages 1–31, 1999. [8] J. Davis, D. Fensel, and F. V. Harmelen. Towards the semantic web. John Wiley and Sons, LTD, 2003. [9] H. de Swart. Introduction to Natural Language Semantics. CSLI Publications, 1998. [10] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of Documentation, 41(6):391–407, 1990. [11] S. T. Dumais, G. W. Furnas, T. K. Landauer, and S. Deerwester. Using latent semantic analysis to improve information retrieval. In Proceedings of CHI’88: Conference on Human Factors in Computing, New York: ACM, pages 281–285, 1988. [12] R. Engels and B. Bremdal. On-to-knowledge: Information extraction: State-of-the-artreport. 2000. [13] L. Gillam and M. Tariq. Ontology via terminology? proceedings of workshop on terminology, ontology and knowledge representation (termino 2004). Technical report, Department of Computing, University of Surrey,, 2004. [14] A. G. Hamilton. Logic for Mathematicians. Cambridge University Press, 1988. [15] J. Haugeland. Understanding natural language. Journal of Philosophy, 76:619–632, 1979. 61

BIBLIOGRAPHY

BIBLIOGRAPHY

[16] D. Hindle. Noun classification from predicateargument structures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics., pages 268–275, 1990. [17] M. Huth and M. Ryan. Logic in computer science : modelling and reasoning about systems. Cambridge University Press, 2004. [18] J. B. Johannessen. En gramatisk tagger for norsk. 1998. [19] K. S. Jones. A statistical interpretation of terms specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972. [20] D. Jurafsky and H. M. James. Speech and Language Processing. Prentice-Hall, Inc., 2000. [21] F. Karlsson. Constrain grammar as a framework for parsing running text. Papers presented to the 13th International Conference on Computational Linguistics, 3:168–173, 1990. [22] F. Karlsson, A. Voutilainen, J. Heikkil¨a, and A. Anttila. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Berlin and New York: Mouton de Gruyter, 1995. [23] T. C. Lech and K. de Smedt. Ontology extraction for coreference chaining. 2005. [24] G. R. LeFranqois. Theories of Human Learning : What the Old Man Said. Wadsworth Publishing Company Inc., 2000. [25] R. Mitkov. Robust pronoun resolution with limited knowledge. University of Wolverhampton, 1998. [26] S. D. Richardson. Bootstrapping statistical processing into rule-based natural language parser. Microsoft Research, pages 96–103, 1994. [27] K. Sagae and A. Lavie. Combining rule-based and data driven techniques for grammatical relation extraction in spoken language. Carnegie Mellon University, 2003. [28] G. Salton. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc., 1971. [29] G. Salton. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [30] G. Salton and C. Buckley. Term Weighting Approaches in Automatic Text Retrieval. Department of Computer Science, Cornell University, 1987. [31] A. Ushioda. Hierarchical clustering of words and application to nlp tasks. Fujitsu Laboratories Ltd., pages 28–41, 1997. [32] E. Velldal. Modeling word senses with fuzzy clustering. Cand.philol. thesis, University of Oslo, 2003.

62

Suggest Documents