Resolving Ambiguous Queries via Fuzzy String Matching and Dynamic Buffering Techniques

Resolving Ambiguous Queries via Fuzzy String Matching and Dynamic Buffering Techniques Olufade F.W. Onifade and Adenike O. Osofisan Department of Comp...
Author: Damian Stephens
7 downloads 0 Views 911KB Size
Resolving Ambiguous Queries via Fuzzy String Matching and Dynamic Buffering Techniques Olufade F.W. Onifade and Adenike O. Osofisan Department of Computer Science, University of Ibadan, Oyo State, Nigeria [email protected], [email protected]

Abstract. The general means for representing user information need is through query. Obtaining desired information is therefore dependent on the ability to formulate some set of words to match the database content. The problem of non-retrieval arises when the query fails to predictably and reliably match a set of document either because of limited knowledge, wrong input and/or supposedly simple errors like words or character transposition, insertion, deletion or total substitution. The accruable risk is better imagined for a scenario where information is employed for strategic decisions. With myriad of string matching function to deal with some of these query problems, the problem has not abated because of uncertainty which engulf the process. This research proposed a fuzzy-based buffering technique to compliment a fuzzy string matching model in a bid to accommodate query matching problems that result from ambiguous query representation. Keywords: Fuzzy string matching, fuzzy buffering, query evaluation, information retrieval.

1 Introduction The bulk of information employed for decision purpose resides in databases or what can be referred to as information delivery systems. Access to this information is facilitated by a technique made available by the information retrieval system (IRS) put in place. The sole responsibility of IRS is in providing fast, effective and reliable content-based access to vast amount of information organized as documents (information items) [17,13]. In a bid to access the organized documents, the user is required to formulate his/her query with a set of constraints in a bid to determine the relevance of the query to the information items. The main components of IRSs are: collection of documents, a query language allowing the expression of selection criteria synthesizing the user’s needs, and the matching mechanism which estimates the relevance of the documents to the query [13]. Attempt to estimate the relevance of each document with respect to a specific user need is based on a formal model which provides a formal representation of both documents and the user queries. Using the trio of documents collection, query language and the matching mechanism S. Dua, S. Sahni, and D.P. Goyal (Eds.): ICISTM 2011, CCIS 141, pp. 198–205, 2011. © Springer-Verlag Berlin Heidelberg 2011

Resolving Ambiguous Queries via Fuzzy String Matching

199

in an IRS, the input represents the user’s query while the corresponding output reflects the relevance estimation of the user information need (query) and the information collection. It can therefore be established that IRS are composed primarily to provide information solution for decision-making problems, that is: identifying the information items corresponding to the user’s information preferences in terms of relevance. Most decision making are based on three main components: obtaining relevant information (from memory or external world), construction of the decision or problem space followed by attempt to fix the acquired information appropriately into the decision problem structure, and assessing the values and likelihoods of different outcomes. The above submission further buttresses the importance of information in everyday life. With myriad of information retrieval tools and techniques and the emergence of search engines as an outshoot of IR the story has not recorded much change. This is sequel to the fact that most search engines are developed from the traditional retrieval models defined several years back. A typical example is the Boolean query which forces the user to precisely express his/her information needs as a set of un-weighted keywords. The result of this is inability to express and accommodate vagueness and uncertainty as part of requirement for specifying selection criteria which tolerate imprecision [14]. This research builds from the fuzzy string matching technique described in [11, 12]. We build up from the buffering mechanism described in [1] to fortify the above fuzzy string matching technique as a robust tool for retrieval operation. Our consideration is based on the issue of relevance which kept recurring in the search for information need from information items. Relevance is contextual, dependent on individual interpretation and some other factors that cannot be adequately described, thus retrieval tools like the Boolean model cannot handle such ambiguous operation. With this notion our focus is to create a flexible means of accessing information systems and databases with focus on reducing the risk accruable from retrieval. In the rest of this work, we review approximate string matching in section two, and present briefly our novel fuzzy string matching model in section three. We discuss the fuzzy buffering model in section four with some examples and conclude the work in section five.

2 Approximate String Matching Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without an index. Early algorithms for on-line approximate matching were suggested by [9]. The algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only [6]. On-line searching techniques have been repeatedly improved. Perhaps the most famous improvement is the bimap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient

200

O.F.W. Onifade and A.O. Osofisan

for relatively short pattern strings [2]. The Bitap algorithm is the heart of the UNIX searching utility agrep. An excellent review of on-line searching algorithms was done by [9]. Although very fast on-line techniques exist, their performance on large data is grossly inadequate. Text pre-processing or indexing makes searching dramatically faster. Today, a variety of indexing algorithms have been presented. Among them are suffix trees, metric trees and n-gram methods. In computing, approximate string matching provide a means for finding approximate matches to a pattern in a string. The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance — also called the Levenshtein distance — between the string and the pattern. The usual primitive operations are: i. Insertion (e.g., changing cot to coat), ii. Deletion (e.g. changing coat to cot), and iii. Substitution (e.g. changing coat to cost). Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped as a primitive operation e.g. “transposition of cost to cots”. Different approximate matchers impose different constraints. Some matchers use a single global un-weighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern [8]. For example, if the pattern is coil, foil differs by one substitution, coils by one insertion, oil by one deletion, and foal by two substitutions. If all operations count as a single unit of cost and the limit is set to one, foil, coils, and oil will count as matches while foal will not. Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers allow separate assignments of limits and weights to individual groups in the pattern. Most approximate matchers used for text processing are regular expression matchers. The distance between a candidate and the pattern is therefore computed as the minimum distance between the candidate and a fixed string matching the regular expression [6]. If the pattern is co.l, using the POSIX notation in which a dot matches any single character, both coal and coil are exact matches, while soil differs by one substitution. While the above present a flexible means for string matching, the mode of implementation still does not adequately support the vagueness and uncertainty earlier discussed as a debilitating factor to ambiguous queries. In the next section, we present our proposition on this.

3 Fuzzy String Matching Fuzzy string matching is our attempt to guide against the risk accruable form some class of dirty-data, which include strings that are miss-spelt, inconsistent entries, incomplete context different ordering and ambiguous data. Consider the strings ‘onifade’ and ‘onitade’. The two strings are practically the same, but for the character ‘t’ in the later. The problem arises when a typical matching algorithm encounter this entry, once no direct relationship can be established, it would be ignored. However, when viewed fuzzily, the two strings have a lot in common. Firstly, we can establish

Resolving Ambiguous Queries via Fuzzy String Matching

201

that the substring ‘oni’ and ‘ade’ are in the same position when the two strings are analyzed concurrently. Another point is that they both have the same number of character and thus the main problem is either in misspelling or transposition. The above described scenario formed the basis for the Fuzzy string matching algorithm analyses shown in Fig. 1. In order to favourably and concurrently compare the user’s string and the database contents, two dynamic buffers were created at the commencement of the operation. One holds the unmatched characters of the user sub input ‘buffer1’ and the other holds the unmatched characters of the database substring ‘buffer2’.

Fig. 1. Fuzzy String Matching Model [11]

The algorithm then scans the character content of the two strings concurrently. When the characters are similar, the variable indicating how many characters were matched is incremented. If the characters are dissimilar, the two characters are stored in buffer1 and buffer2 respectively. After all the characters might have been compared, it gets to the end of one of the strings (in the case where the size of the two strings are not the same), the fuzzy match value is calculated based on the level of containment or belongingness (via fuzzy membership function ‘func1()) of the matched character size and the size of the database substring (see Fig. 3). The above operation does not do away with the unmatched characters, instead they are considered to generate some other entries to be displayed alongside the retrieved entries. While this could generate a high volume of redundant entries, the user has the opportunity to decrease the level of fuzziness at the application level and thus reducing the number of entries. We considered the above as exigent for two reasons, extreme cases of misspelling as in the cases of dyslexia, and when the supplied query forms a subset of the database content but not a whole e.g. ‘Oberman’ and ‘Hoberman’. A fuller discussion on this model and its operation can be found in [11].

202

O.F.W. Onifade and A.O. Osofisan

In what followed, we referred to two sets of buffers that were not explicitly described in the above paper, and which constitute the main contribution of this work. This is the focus of the next section.

4 Fuzzy Buffering Model Buffering is a technique used for facilitating synchronization between and amongst dissimilar devises. There exist different types of buffer management techniques, a list of which were presented in [1]. We however look at the static buffering, linear buffering and modulo buffering to create an environment for discussing our work. These buffering methods assume that samples are queued in the arc buffers in the arrival order and access is via the movement of buffer indices.

Fig. 2. FuzzyMatch Buffering Model

Static buffering has a limited area of applications, thus its only useful whenever the indices does not change at runtime. Linear and modulo buffering also have their drawbacks, linear buffering requires a large size buffer to function while modulo buffering requires a runtime computation to function appropriately. The overhead of the above methods amongst others does not support the retrieval operation we propose for our work. Thus, we propose a fuzzy based buffering model which employs the fuzzy partitioning method to determine the belongingness of any subset of a string to a particular string. Users with disabilities such as dyslexia for example could spell a word with the same character content but in most cases, the characters are muddled up. For example a dyslexic could spell ‘clement’ as ‘elcmten’. In order to trap cases like these, the algorithm analyses the character content of the two strings even if their characters do not match concurrently. To do this, the unmatched characters placed in buffer1 and

Resolving Ambiguous Queries via Fuzzy String Matching

203

buffer2 described above were analysed to check for similarity. The case of dyslexics could be considered as an extreme case, but research has shown that most of failed-hit in retrieval operation are due to misspellings. Google and Yahoo search have however propose some level of fuzziness in addressing such problems. In Fig. 2, we present the buffering operation which is another important model included in FuzzyMatch. The character splitting operation is not complex, but it is important to feed the buffering model for character analysis and fuzzy decision making. After the analysis, a buffer Match value is produced. This buffer value is then compared to the fuzzy match value. If it is larger, then the new fuzzy value becomes the average of the fuzzy match and the buffer match. However, if it is less, then the buffer value is discarded and the fuzzy value remains unchanged. When this process is applied to the strings ‘clement’ and ‘elcmten’, a fuzzy match value of comparable to 50% is recorded which we considered fair enough considering that the strings cannot be matched concurrently and could have been otherwise discarded in other search engines.

Fig. 3. (a) Buffering operation for the string

(b) Google results for the same string

While it is almost impossible to eradicate data errors at input, manipulation, and deletion, it is important to note that the effect of these errors can be mild or grave depending on the situation. The accruable risk thus is better imagined in real life scenario. With our developed tool, the focus is to reduce the risk in information retrieval processes. We are considering basically some element of the classes considered as dirty data which include: spelling error, omission, insertion and sometimes deletion. We submitted the same input ‘ONIEADF’ to Google search engine and the result is interesting. Before discussing the Google result, let us look at the string comparison vis-à-vis the buffering pattern employed by our combined approach to resolving such ambiguity. The model compares the two buffers containing the unmatched characters produced from stage 2 to check for similarity in their character content. The product of this stage is the buffer match value. Fig. 4 attempts to explicitly capture the operations involved in string splitting, matching and the buffering pattern. Once a character presence can be established in the string, the buffer content continues to be

204

O.F.W. Onifade and A.O. Osofisan

manipulated dynamically until the last entry is considered in the string. This results into the fuzzy match which is the multiplicative effect of the buffer match and the level of belongingness. The fuzzy function employed helps to determine the level of fuzziness in the pattern of arrangement of the user’s input and used same to assist in possible rearrangement.

Fig. 4. Strings Comparison and Buffering Operations

We have demonstrated how the retrieval for “onieadf” is handled by our combined approach of fuzzy string matching and fuzzy buffering technique. Our result compares favourable with other search engine in dealing with some inherent information retrieval problem based on ambiguous and incoherent user entries.

5 Conclusion Adequate definition of user information need is the only way by which information retrieval exercise can be near full success. However it has been established that on many occasions, users don’t even know how to define and formulate their queries appropriately in other to result into the right information item stored up in the database. We reiterated that this factor is of grave consequence more importantly when strategic decision are involved. In a bid to ameliorate this, we proposed in this research a fuzzy buffering model to handle ambiguous queries from the definition of information need. The result showed that, issue like transposition, deletion, insertion and/or omission which hitherto could result into no-hit scenario were properly handled in our model. We also facilitate a robust manner via which user can determine the level of fuzziness of their queries and employed same to minimize redundant entries. In the future, we hope to build a complete framework representing complete retrieval tool.

Resolving Ambiguous Queries via Fuzzy String Matching

205

References 1. Aderounmu, G.A., Ogwu, F.J., Onifade, O.F.W.: A dynamic traffic shaper technique for a scalable QoS in ATM networks. In: ICCCT, Austin, Texas, USA, August 14-17, pp. 332– 337 (2004) 2. Baeza-Yates, R., Navarro, G.: Fast approximate string matching in a dictionary. In: Proc. SPIRE 1998, pp. 14–22. IEEE CS Press, Los Alamitos (1998) 3. Bordogna, G., Pasi, G.: Controlling retrieval trough a user-adaptive representation of documents. International Journal of Approximate Reasoning 12, 317–339 (1995) 4. Bordogna, G., Pasi, G.: Modelling vagueness in information retrieval. In: Agosti, M., Crestani, F., Pasi, G. (eds.) Lectures in Information Retrieval. Springer, Heidelberg (2001) 5. Crestani, F., Pasi, G.: Soft information retrieval: applications of fuzzy set theory and neural networks. In: Kasabov, N., Kozma, R. (eds.) Neuro-Fuzzy Techniques for Intelligent Information Systems, pp. 287–313. Physica-Verlag, Springer-Verlag Group (1999) 6. Dyke, N.V.: Levenshtein: Levenshtein distance metric in scheme (2006), http://www.neilvandyke.org/levenshtein-scheme/ 7. Hahn, J., Chou, P.H.: Buffer optimization and dispatching scheme for embedded systems with behavioral transparency. In: Seventh ACM and IEEE International Conference on Embedded Software Salzburg, Austria, pp. 94–103 (2007) 8. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001), doi:10.1145/375360.375365 9. Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001) 10. Oh, H., Dutt, N., Ha, S.: Shift buffering technique for automatic code synthesis from synchronous dataflow graphs. In: Proceedings of the Third IEEE/ACM/IFIP International Conference on Hardware/Software Code Sign and System Synthesis, pp. 51–56 (2005) 11. Onifade, O.F.W., Thiery, O., Osofisan, A.O., Duffing, G.: Dynamic fuzzy string-matching model for information retrieval based on incongruous user queries. Paper presented at the 2010 WCE, London, pp. 283–288 (June 2010) 12. Onifade, O.F.W., Thiery, O., Osofisan., A.O., Duffing, G.: A fuzzy model for improving relevance ranking in information retrieval process. In: International Conference on Artificial Intelligence and Pattern Recognition (AIPR 2010), Florida, USA (July 2010) 13. Pasi, G.: Fuzzy sets in information retrieval: state of the art and research trends. In: Bustince, H., et al. (eds.) Fuzzy Sets and Their Extensions: Representation, Aggregation and Models, pp. 517–535. Springer, Heidelberg (2008) 14. Robins, D.: Interactive information retrieval: context and basic notions. Information Science 3(2), 57–61 (2000) 15. Rocchio, J.J.: Relevance feedback in information retrieval. Prentice Hall, Englewood Cliffs (1971) 16. Rupley, M.L.: Introduction to query processing and optimization, 2, http://www.cs.iusb.edu/technical_reports/TR-20080105-1.pdf 17. Salton, G.: Automatic text processing - the transformation. In: Analysis and Retrieval of Information by Computer. Addison Wesley Publishing Company, Reading (1989) 18. Smeaton, A.F.: Progress in the application of natural language processing to information retrieval tasks. The Computer Journal 35(3), 268–278 (1992) 19. van Rijsbergen, C.J.: Information retrieval. Butterworths, London (1979) 20. Voorhees, E.M., Harman, D.K.: Overview of the Eighth Text Retrieval Conference (TREC-8). In: Information Technology: The Eighth Text Retrieval Conference (TREC-8). NIST SP 500-246, 1-23, GPO: Washington, D.C (2000)