A SURVEY- LINK ALGORITHM FOR WEB MINING

ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110 A SURVEY- LINK ALGORITHM FO...
6 downloads 0 Views 1MB Size
ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

A SURVEY- LINK ALGORITHM FOR WEB MINING Gurpreet Kaur M.TECH, Research Scholar, Dept. of CSE, S.G.G.S.W.U., Fatehgarh Sahib (Punjab),India [email protected]

1

Shruti Aggarwal M.TECH, Assistant Professor, Dept. of CSE, S.G.G.S.W.U., Fatehgarh Sahib (Punjab), India [email protected] 1

Abstract- Web mining is the most active area where the research is going on rapidly. Web mining is the integration of information gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web. Based on the information gathered over the WWW web mining is categorized into three: Web content mining, Web structure mining and Web usage mining. In search engines web mining application can be seen. Most of the search engines are ranking their search results in response to user’s queries to make their search navigations easier. In this paper we give a survey of page ranking algorithms and description about Weighted Page Content Rank (WPCR) based on web content mining and structure mining that shows the relevancy of the pages to a given query is better determined, as compared to the Page Rank and Weighted Page Rank algorithms. Keywords- Web mining, web content, Page rank, Weighted Page rank, weighted page content rank, web structure.

I INTRODUCTION The World Wide Web is the collection of information resources on the Internet that are using the Hypertext Transfer Protocol. It is a repository of many interlinked hypertext documents, accessed via the Internet. Web may contain text, images, video and other multimedia data. In order to analyze such data, some techniques called web mining techniques are used by various web applications and tools. Web mining describes the use of data mining techniques to automatically discover Web documents and services, to extract information from the Web resources and to uncover general patterns on the Web. Over the years, Web mining research has been extended to cover the use of data mining and similar techniques to discover resources, patterns, and knowledge from the Web-related data (such as Web usage data or Web server logs).It is used to understand customer behavior, evaluate the effectiveness of a particular Web and help quantify the success of a marketing campaign. It is a rapidly growing research area.

II Web Mining In 1996 it’s Etzioni who first coined the term web mining. Etzioni starts by making a hypothesis that information on web is sufficiently structured and outliers the subtasks of web mining.[5].It refers to overall process of discovering potentially useful and previously unknown information from web document and services web mining could be viewed as an extension of standard data mining to web data.

Fig.4. Taxonomy of web mining [8]. 2.1.Web Mining Process The complete process of extracting knowledge from Web data is as follows:

105

ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

Web mining can be decomposed into the subtasks, namely: Resource finding:- the task of retrieving intended Web documents. By resource finding we mean the process of retrieving the data that is either online or offline from the text sources available on the web such as electronic newsletters, electronic newswires, the text contents of HTML documents obtained by removing HTML tags, and also the manual selection of Web resources. Information selection and preprocessing:- automatically selecting and preprocessing specific information from retrieved Web resources. It is a kind of transformation process of original data retrieved in the IR process. These transformations could be either a kind of pre-processing aimed at obtaining the desired representation such as finding phrases in the training corpus, transforming the representation to relational or first order logic form, etc. Generalization:- automatically discovers the general patterns at individual Websites as well as across multiple sites. Machine learning or data mining techniques are typically used in the process of generalization. Humans play an important role in the information or knowledge discovery process on the Web since the Web is an interactive medium. Analysis:- validating and/or interpretation of the mined patterns.

A. Web Content Mining Web Content Mining [9] deals with discovering useful information or knowledge from web page contents. Web

content mining analyzes the content of web resources. Content data is the collection of facts that are contained in a web page. It consists of unstructured data such as free texts, images, audio, video, semi structured data such as HTML documents and a more structured data such as data in tables or database generated HTML pages. The primary web resources that are mined in web content mining are individual pages. They can be used to group, categorize, analyze and retrieved documents. B. Web Structure Mining Web structure Mining[10] is the process of discovering structure information from the web. The structure of a typical web graph consists of web pages as nodes and hyperlinks as edges connecting related pages.

Fig.5. Web Graph Structure.[12]. Hyperlinks: A hyperlink is a a.) structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. b.) Document structure: In addition, the content within a web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object

106

ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

model structures out of documents. C. Web Usage Mining Web usage mining[11] is the process of finding out what users are looking for on the internet. Web usage mining focuses on the techniques that could predict the behavior of users while they are interacting with the WWW. It collects the data from web log records to discover user access patterns of web pages. Usage data captures the identity or origin of web users along with their browsing behavior at a website. There are two main tendencies in web usage mining driven by the application of the discoveries: General Access Pattern Tracking and Customized Usage Tracking. The general access pattern tracking analysis the web logs to understand access patterns and trends. Its purpose is to customized websites to the users. III Link Analysis Algorithms Web mining technique provides the additional information through hyperlinks where different documents are connected. We can view the web as a directed labeled graph whose nodes are the documents or pages and edges are the hyperlinks between them. This directed graph structure is known as web graph. There are number of algorithms proposed based on link analysis. Three important algorithms Page Rank, Weighted Page Rank and Weighted Page Content Rank are discussed below: A. Page Rank This algorithm was developed by Brin and Page Stanford University which extends the idea of citation analysis. In citation analysis the incoming links are treated as citation but this technique could not provide fruitful results because this gives some approximation of importance of page. So, page provides a better approach that can compute the importance of web page by simply counting the number of pages that are linked to it. These links are called backlinks. If a backlinks comes from an

important page than this link is given higher weightage than those which are coming from non important pages. The link from one page to another page is considered as a vote. Not only the number of votes that a page receives is important but the importance of pages that casts the vote is also important. Page and Brin proposed a formula to calculate the page rank of a page A as stated below: PR(A)=(1-d)+d(PR(T1)/C(T1)+…..+PR(Tn/C(Tn))

Here PR(Ti) is the page rank of the pages Ti which links to page A, C(Ti) is number of outlinks on page Ti and d is damping factor. It is used to stop other pages having too much influence. The total vote is “damped down” by multiplying it to 0.85. The page rank forms a probability distribution over the web pages so the some of page ranks of all web pages will be one. The page rank of a page can be calculated without knowing the final value of page rank of other pages. It is an interactive algorithm which follows the principle of normalized link matrix of web. Page rank of a page depends on the number of pages pointing to a page. B. Weighted Page Rank The more popular web pages are the more linkages that other web pages tends to have to them or are linked to by them. The proposed extended page rank algorithm –a weighted page rank algorithm assigns larger rank values to more important pages instead of dividing the rank value of a page evenly among its outlink pages. Each out link page gets a value proportional to its popularity. The popularity from the number of inlinks and out links are recorded as W in(v,u) and Wout (v,u) respectively. W in(v,u) is the weight of link(v,u) calculated based on the number of in links of page u and the number of inlinks of all reference pages of page v.

107

ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

Iu Win(v,u) =

factor).

∑p € R(v) Ip

Output: Rank score

Where Iu and Ip represent the number of in links of pages u and page p, respectively. R(v) denotes the reference page list of page v. Wout (v,u) is the weight of link(v,u) calculated based on the number of out links of page u and the number of out links f all reference page of page v.

Wout(v,u) =

Ou

∑ p € R(v) Op Where Ou and Op represent the number of outlinks of the page u and page p, respectively. R(v) denotes the reference page list of page v.

Step 1: Relevance calculation: a) Find all meaningful word strings of Q (say N)s b) Find whether the N strings are occurring in P or not? Z= Sum of frequencies of all N strings. c) S= Set of the maximum possible strings occurring in P. d) X= Sum of frequencies of strings in S. e) Content Weight(CW)= X/Z f) C= No. of query terms in P g) D= No. of all query terms of Q while ignoring stop words. h) Probability Weight(PW)= C/D Step 2: Rank calculation: a) Find all backlinks of P (say set B). b)PR(P)=(1-d)+d[ PR(V) Win (P,V)Wout(P,V) ](CW+PW) c) Output PR(P) i.e. the Rank score

C. Weighted Page Content Rank Weighted Page Content Rank Algorithm is a proposed page ranking algorithm which is used to give a sorted order to the web pages returned by a search engine in response to a user query. WPCR is a numerical value based on which the web pages are given an order. This algorithm employs web structure mining as well as web content mining techniques. Web structure mining is used to calculate the importance of the page and web content mining is used to find how much relevant a page is? Importance here means the popularity of the page i.e. how many pages are pointing to or are referred by this particular page. It can be calculated based on the number of in links and out links of the page. Relevancy means matching of the page with the fired query. If a page is maximally matched to the query that becomes more relevant.

Comparison of Algorithms Table shows the difference between above three algorithms: Table : Comparison of Page Rank, Weighted Page Rank and Weighted Page Content Rank

Algorithm of Weighted Page Content Rank Input: Page P, Inlink and Outlink Weights of all backlinks of P, Query Q, d (damping

108

ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

IV Conclusion Web mining is the Data Mining technique that automatically discovers or extracts the information from web documents. Page Rank and Weighted Page Rank algorithms are used in Web Structure Mining to rank the relevant pages.In this paper we focused on comparitative study of page rank Algorithms .By using Page Rank and Weighted Page Rank algorithms users may not get the required relevant documents easily, but in new algorithm Weighted Page Content Rank user can get relevant and important pages easily as it employs web structure mining and web content mining. As part of our future work, we are planning to implement the Weighted Page Content Rank algorithm and integrate it with clustering algorithm and working on finding required relevant and important pages more easily. REFERENCES [1] Tipawan Silwattananusarn and Assoc.Prof.Dr.Kulthida Tuamsuk, International journal of Data Mining and Knowledge Management Process “Data Mining and its applications for knowledge Management: A Literature Review from [12] Taher H. Haveliwala, “Topic-Sensitive Page Rank: A Context-Sensitive Ranking Algorithms for Web Search”, IEEE transactions on Knowledge and Data Engineering Vol 15, No 4, July/August 2003. [13] Tamana Bhatia, “ Link Analysis Algorithms For Web Mining” IJCST Vol. 2, Issue 2, June 2011.

2007 to 2012.” Volume 2, no. 5, Sept.2012 [2] NeelamadhabPadhy, Dr. Pragnyaban Mishra, and Rasmita Panigrahi. “The Survey of Data Mining Apllications and Feature Scope” Volume 2, No.3, June2012. [3] Venkatadri. M Research Scholar, “ A Review on Data Mining from Past to the Future” Vol. 15- No.7, Feb.2011. [4] Osmar R. Zaiane “Introduction to data `mining”, 1999. [5] Chintandeep kaur, Rinkle Rani Aggarwal “Web Mining Tasks and Types: A Survey” Vol 2, Issue 2, Feb 2012 [6]N. Senthil Kumar, P.M. Durai Raj Vincent “ Web Mining An Integrated Approach” Vol 2, Issue 3,March 2012 [7] G T Rajul and P S Satyanarayana “Knowledge Discovery from Web usage Data: Complete Preprocessing Methodology “ IJCSNS International Journal of Computer Science and Network security, vol. 8, No. 1, January 2008 [8] Shakti Kundu, International Journal of Computer Science and Engg (IJCSE) “ An intelligent approach of Web data mining” Vol. 4, No. 5, May 2012 [9] Boley D, Gini ML, Gross R, Han EH, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J. Document categorization and query generation on the world wide web using WebACE. J Artif Intell Rev 1999; 13(5-6):365-91. [10] Pirolli P, Pitkow J, Rao R, Silk from a sow’s ear: extracting usable structures from the web.In Proceedings of conference on Human Factors in computing systems, Vancover, British Columbia, Canada 1996; 1996:118-25 [11] Masseglia F, Poncelet P, Cicchetti R. An efficient algorithm for web usage mining. J Networking Inf Syst 1999; 2(56): 571-603 [14] Rekha Jain, Dr. G. N. Purohit “ Page Ranking Algorithms for Web Mining” International Journal of Computer Applications Vol. 13- No.5, Jan 2011. [15] Neelam Tyagi, Simple Sharma, International journal of Soft Computing and Engineering “Weighted Page rank algorithm based on number of visits of Links of web page” Vol-2, Issue-3, July 2012.

109

ISSN:2249-5789 Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

[16] N. Duhan, A.K.Sharma and Bhatia K.K, “Page Ranking Algorithms, A Survey, Proceedings Of the IEEE International Conference on Advance Computing, 2009, 978-1-4244-1888-6. [17] Pooja Sharma, Deepak Tyagi, Pawan Bhadana, International journal of Engineering Science and Technology “Weighted Page Content Rank for ordering Web Search Result”, Vol 2(12) 2010, 73017310. [18] S.Chakrabarti et al., “Mining the Web’s Link Structure” Computer, 32(8):60-67, 1999. [19] Raymond Kosala, Hendrik Blockee, “Web Mining Research: A Survey”, CAN Sigkdd Explorations Newsletter, June 2000, Volume 2. [20] Cooley, R., Mobasher, B., Srivastava, J, “Web Mining: Information and pattern discovery on the World Wide Web”. In proceedings of the 9th IEEE International conference on Tools with Artificial Intelligence (ICTAI’ 97), Newport Beach, CA, 1997. [21] Companion slides for the text by Dr. M. H. Dunham, “Data Mining: Introductory and Advanced Topics”, Prentice Hall, 2002. [22] Wenpu Xing and Ali Ghorbani, “Weighted Page Rank Algorithm”, Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR’04), 2004 IEEE.

110