A Semantic Retrieval Method Based on Ontology 1. Qi Shen, Meng Zhang

Advanced Materials Research ISSN: 1662-8985, Vols. 989-994, pp 2179-2183 doi:10.4028/www.scientific.net/AMR.989-994.2179 © 2014 Trans Tech Publication...
Author: April Pitts
2 downloads 0 Views 182KB Size
Advanced Materials Research ISSN: 1662-8985, Vols. 989-994, pp 2179-2183 doi:10.4028/www.scientific.net/AMR.989-994.2179 © 2014 Trans Tech Publications, Switzerland

Submitted: 2014-05-24 Accepted: 2014-05-29 Online: 2014-07-16

A Semantic Retrieval Method Based on Ontology1 Qi Shen, Meng Zhang School of Software Engineering Beijing University of Technology, Beijing, China [email protected],

[email protected]

Keywords: ontology; concept similarity; weight; semantic expansion.

Abstract. Semantic retrieval method stands at the crossroads between Natural Language Processing and Machine Intelligent. This paper makes analysis on the semantic search method and research on concept similarity algorithm, and discusses the factor of weight’s influence on concept similarity as well. On this basis, this paper proposed a new semantic search method based on ontology, and apply it to the tourism information retrieval, which intellectualized tourism information retrieval service. 1. Introduction With the rapid development of Internet and mobile communication technology, Web has become a source of global information, how to find the needed information quickly and accurately from the vast information resources becomes a difficult problem for the user. The traditional information retrieval method provided to the users is matching search according to the keywords input by users. But in majority of cases, it is difficult to understand users’ real purpose by these simple keywords matching search which leading to the low accuracy of current information retrieval. Some experts abroad have conducted practical operation based on ontology, such as the Ontoseek directory system which retrievaling based on content, it integrates the product tree structure and online yellow pages. It is said that formal ability and body content matching mechanism are combined, and the ontology and database dictionary together, provide the user a can input natural language, and then into the domain ontology vocabulary, semantic retrieval system[1]. We can learn from Ontoseek pages in the system, the relationship between the concept of vocabulary and vocabulary which is not subject to any constraints, so the relationship between the vocabulary may not play any role, therefore will occur the retrieved results are not required by the user. Tourism yellow page service system at present, such as "Ctrip", "road ox net", in the search for a keyword input by the user information only for the simple matching words, do not understand the user input information on the semantics, also cannot be well retrieved information the user really need. Therefore, must the information retrieval from the existing keyword matching level to understand the knowledge level, semantic level to the information and expression based on them, so as to design a user can understand the semantic information retrieval model[2]. 2. The Semantic Search Method based on Ontology The basic idea of semantic retrieval algorithms extend is that through calculating semantic similarity of the input keyword to get a similarity value and then add the concept of greater than the threshold value to form a new set as new keyword to retrieval[3]. A. The Concept Similarity Concept similarity refers to the semantic closeness of two concepts. Under the support of Ontology, there are three models to calculating the conceptual similarity as following: 1) Distance Model Constructed ontology forms tree structure. With the ade of this structure, distance model matches the keyword to the corresponding nodes in the tree body of ontology[3]. The similarity between the two concepts is measured by the distance between two nodes of the tree. For simpicity on practice,

All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 130.203.136.75, Pennsylvania State University, University Park, USA-04/03/16,00:35:40)

2180

Materials Science, Computer and Information Technology

it adopts a simple calculation: that takes the distance between every two nodes in the tree body of ontology having the same length, quantize as 1 for example. Therefore the shortest distance of two nodes on the ontology tree equals to the semantic distance of these two concepts[4]. The concept similarity is calculated as follows: Similar (x 1 , x 2 ) =

2 ´ (Maxlen - 1) - Min 2 ´ (Maxlen - 1)

(1)

Wherein: MaxLen refers to the depth of the ontology tree, Min refers to the shortest distance between the concept nodes x 1 and x 2 . Obviously the above formula is very simple, and the very conceptual similarity is not precise. 2) Information Quantity Model The quantity of information possessed by the conceptural tree nodes refers to the capacity of the concept content. The greater intersection of two concepts nodes contained on content the higher semantic similarity between them is, on the contrary lower similarity[5]. In the ontology tree structure, each node concepts are seen as inherited from its parent node, so each node in the tree can be seen containing the approximately equivalent quantity of information as their parent nodes. Therefore, the extent of the similarity of any two concepts in the ontology tree nodes can be replaced by caculating the capacity of its respective closenest parent nodes[6]. The fomular of calculating information quantity owned by the concept nodes: p (x ) =

f m

(2)

DC ( x ) = - log[p( x )]

(3)

Among them, F represents the number of x appeared in training resources. M represents the total number of concepts in training resources. p (x) represents to the probility of node concept x appeared in the constructed ontology repositories; DC (x) represents carrying out processing log on the probality of the node concept appeared in the ontology repositories. Based on the quantity of information contained by the concept node, revealed the following formular about the concept similarity in the ontology tree structure: Similar (x 1 , x 2 ) =

2 ´ DC [Ancestor (x 1 , x 2 )] DC (x 1 ) + DC (x 2 )

(4)

Wherein: Ancestor(x1, x2 ) is the information quantity contained by the parent concept nodes nerest to concept nodes x1 and x2 in the ontology tree structure. 3)Property Model In the process of comparing the similarity or closeness between two things, it is typical to analize their properties or characteristics[7]. The similarity of node concept in the property model based on the concept node can be calculated as follows: Similar (x 1, x 2 ) = f (x 1 Çx 2 ) - f (x 1 - x 2 ) - f (x 2 - x 1)

(5)

Wherein: x1 ∩ x2 said the intercecton of properties between node concept x1 and x 2 , the x 1 - x 2 represents the number of properties attribute to node concept x1 while are not included in the node concept x 2 , x 2 - x 1 means the number of properties attibute to node concept x 2 while are not included in the node concept x1 . B. Weight

Advanced Materials Research Vols. 989-994

2181

Weight refers to the value given to the distance between the connected concept nodes. The size of the given weight can impact the concept similarity; this paper makes analysis on three major factors that affecting the size of weight as following: 1) Property Type The value of weight given to the distance connected concept nodes holding different type of properties in the ontology tree structure should be different. If the property of the concept node is synonymous, such as in the field of tourism the ontology properties of two node concpts connected by "SynonymsIs" is synonym, so the weight value given to the synonyms should be higher than the weight of the other several types of properties weights. Therefore, according to common sense, the weight given to the edges with "inheritance" property is usually higher than that given to "wholepart relationship" poperties. So ultimately formed the relationship between property types and weights: ì1,type (x , y ) = SynonymsIs ; ïï weight (x , y ) = í0.5,type (x , y ) = IsExtend ; ï ïî0.3,type (x , y ) = IsPartOf ;

(6)

Weight(x, y) refers to the weight given to the edge containing certain type of property formed by the node concept x and its parent node y. The size of the value in formula (6) is defined by the ontology builder. These people all have a deep understanding on the concept and terminology of this field. The weight caculating formula reflects the relationship between the property type and size of the weights. 2) The Depth of Edge Nearer to the leaf, the nodes contain more detailed information. On the reverse, node in the tree root represents less specific information. So, the size of the weight given to the edge is closely related to its level in the tree structure. This hierarchy represents the depth of the edge. Therefore, it is concluded the relationship between the depth of the edge and the weight: weight (x , y ) = (

depth ( y ) 1 1 1 + + ... + + 1) = å depth ( y ) depth ( y )-1 n 2 2 2 n =0 2

1

(7)

Among them, depth( y) refers to the depth of concept node y in the tree. Formula (7) well embodies the relationship between depth of ontology tree edges and the size of weight: the size of weight given to the edge is directly proportional to the depth of nodes in the tree. 3) The Intersection of Properties For the concept nodes in one edge with certain property, the more same properties they contained shows the more common characters they have, therefore the two concepts are more similar. The weight given to the edge connected by the two concepts node through the property also should be bigger. As a result, the relationship between the sizes of intersection set of two concept nodes’ properties and the weight: weight (x , y ) =

count (A ssemble(x) Ç A ssemble(y)) count (A ssemble(x ) È A ssemble( y ))

(8)

and Assemble(y) are on behalf of the property set of concept node x and property set of concept node y; A ssemble(x) Ç A ssemble(y) are on behalf of the properties intersection of concept nodes x and concept node y; A ssemble(x) È A ssemble(y) are on behalf of the properties union of concept nodes x and concept node y; and count () function is on behalf of the number of properties in the relative property set. C. Semantic Expansion Through analysis of the concept similarity in the ontology tree structure, and several factors affecting the weight as well we proposed the improved semantic expantion retreval algorithm. The distance formua inserted weights asfollowing: Assemble(x)

2182

Materials Science, Computer and Information Technology

Distance(a,b) = N [a, Ancestor(a,b)]+N [b, Ancestor(a,b)]

å

N [b, Ancestor(a,b)] =

(9)

Distance(x , Parent (x ))

xÎpath(a,Ancestor (a,b ))

Distance (x , Parent (x )) =

(10)

a

-a

weight (x , Parent (x ))

(11) represents the nearest common ancestor node of concept node a and concept node b Ancestor (a, b) in the tree structure; path(a, b) refers to the set formed by all the concept nodes in the shortest path between concept node a and concept node b of the tree stucture in the field of ontology. α is an adjustable factor. Formular (11) represents the distance between the child node and the parent node. Thus, under the condition of knowing the distance between arbitrary two nodes in the domain of ontology tree structure, we concluded the similarity calculation formula between any two concepts: q Similar (a , b ) = Distance (a, b ) + q (12) θ is an adjustable parameter. The formula shows the relation between arbitary two ontology semantic similarity and the two concept nodes: When Distance (a,b ) = 0 , indicating that the value of the denominator is small, but the degree of similarity is greater, so the concept node a is similar to concept nodes b. The maximum value of concept similarity is up to 1; then when Distance (a, b ) = ¥ , the value of the denominator is large, the value of similarity is very small, so the concept node a and node b hardly have any relation, the minimum value of concept similarity is down to 0. Formula (12) shows that the value of the concept similarity forms a closed interval from 0 to 1, in this article we set the threshold as 0.7, when the value of the concept similarity is greater than 0.7, the similarity of the two concepts can be considered relatively high, and then extend this concept into a more comprehensive, updated search term 3. Application in Tourism The proposed retrieval method in this paper is applied to the Beijing tourism resources retrieval, all experimental data are retrieved from the ontology library, the library is artificialy classified first, and then taking use of reptiles tools to crawling tourism data from the Internet. the data in the repository is web data captured through a web crawler. To facilitate the calculating results, we set the amount of resources of each classified information to 1000, and relative resources for retrieving to 100. For example: we set a total of 1,000 hotel information resources, and for those near the Beijing University of Technology are100. Thus retrieve information about the following three statistics, namely: to retrieve the bus information, retrieve hotels near the attractions, retrieve sites nearby attractions. Retrieval results shows that there are 82 relative resources about No. 52 bus, 144 were retrieved. The number of relative hotels near Beijing University of Technology is 85, 130 were retrieved. The number of Attractions near Tiananmen is 93, 155 were retrieved. The charter below shows the recall ratio and precion ratio between the retrieval method employed by the Ctrip and semantic retrieval method: Table 1 Semantic search method compare with Ctrip counterpart Retrieval System

Ctrip

Semantic Retrieval

Sort

Bus inquiries

Attractions hotel

near

the

Attrations station inquiry

Recall ratio

45%

55%

50%

Precision ratio Recall ratio

30%

50%

40%

82%

85%

93%

57%

65%

60%

Precision ratio

nearby

On both recall ratio and precision ratio, the semantic retrieval method proposed in this paper is higher than the trational tourist website represented by Ctrip. From data displayed in the above chart

Advanced Materials Research Vols. 989-994

2183

we can drawn a conclusion that with the support of ontology technology, both the recall ratio and precision ratio are improved. 4. Conclusion The purpose of applying semantic retrieval is to achieve computer’s understanding about the user input content to process retrieval on semantic level . Bearing on the concept similarity and its three caculating methos, this paper introduces the concept of weight and its caculating formula. Based on the analysis of combining concept similarity and weight, this paper proposes a semantic search method based on ontology and applied this method in to tourist information retrieval. This paper also evaluates the result of retrieval from the persepect pf recall ratio and precision ratio which validated the effectiveness of the algorithm. Acknowledgment This work is supported by Scientific Research Project of Beijing Municipal Commission of Education (Grant No. KM201210005030), the support is gratefully acknowledged. References [1] Ying Zhang Web3.0: personalized learning platform [J]. China education technology equipment. 2012 (27): 38-39. [2] Zhihong Deng, Shiwei Tang, et al. Overview of Ontology [J]. Journal of Peking University: Natural Science Edition, 2002, 38 (5): 730-738. [3] Jianhou Gan , Youming Xia, Tianren Xu, et al. Ontology knowledge representation extends the [J]. language OWL Journal of Yunnan Normal University (NATURAL SCIENCE EDITION). 2005 (04). [4] Binbin YU and construction tools of [J]. border economic and cultural ontology construction method. 2012 (12): 167-168. [5] Yong Zhang, Junbai Lv Protege Ontology Modeling Research Based on [J]. Fujian computer. 2011 (01). [6] Tao, Teng-Yang, Zhao. An Ontology-Based Information Retrieval Model for Vegetables ECommerce[J]. 2012, 11 (5): 800-807. [7] Chengyi Che, Zongmin Ma, Xiaolong Jiao . Study on the recognition method of [J]. computer engineering data table in the Web page. 2012, 38 (23): 154-157.

Materials Science, Computer and Information Technology 10.4028/www.scientific.net/AMR.989-994

A Semantic Retrieval Method Based on Ontology 10.4028/www.scientific.net/AMR.989-994.2179

Suggest Documents