Analytical Implementation of Web Structure Mining Using Data Analysis in Educational Domain

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556 © Research India Publications. http://www...
Author: Garry Pope
8 downloads 0 Views 387KB Size
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556 © Research India Publications. http://www.ripublication.com

Analytical Implementation of Web Structure Mining Using Data Analysis in Educational Domain Dr. S. P. Victor Professor CS, St. Xaviers College, Tirunelveli, Tamil Nadu, India. Mr. M. Xavier Rex Research Scholar, M. S. University, Tirunelveli, Tamil Nadu, India. be extra careful/polite during the crawling process, to avoid causing any problems for the webmaster.

Abstract The optimal web data mining analysis of web page structure acts as a key factor in educational domain which provides the systematic way of novel implementation towards real-time data with different level of implications. Our experimental setup initially focuses with retrieval of web structure such that WebPages as nodes and hyperlinks as edges in order to identify the webpage as a popular webpage or similar webpage. This paper perform a detailed study of web structure retrieval schema towards variant effect of periodic web pages in the field of educational domain which can be carried out with expected optimal output strategies. We will implement our experimental web structure restoration techniques with real time implementation of object representation in the motive of educational Domains such as a college webpage required for an open data analysis system. We will also perform algorithmic procedural strategies for the successful implementation of our proposed research technique in several sampling domains with a maximum level of improvements. In near future we will implement the Web usage techniques for the efficient data analysis domain.

Structure A traditional data mining task gets information from a database, which provides some level of explicit structure. A typical web mining task is processing unstructured or semistructured data from web pages. Even when the underlying information for web pages comes from a database, this often is obscured by HTML markup. A strategic analysis department can undermine their client archives with data mining software to determine what offers they need to send to what clients for maximum conversions rates. For example, a company is thinking about launching cotton shirts as their new product [2]. Through their client database, they can clearly determine as to how many clients have placed orders for cotton shirts over the last year and how much revenue such orders have brought to the company. After having a hold on such analysis, the company can make their decisions about which offers to send both to those clients who had placed orders on the cotton shirts and those who had not. This makes sure that the organization heads in the right direction in their marketing and not goes through a trial and error phase to learn the hard facts by spending money needlessly [3]. These analytical facts also shed light as to what the percentage of customers is who can move from your company to your competitor. The data mining also empowers companies to keep a record of fraudulent payments which can all be researched and studied through data mining [4]. This information can help develop more advanced and protective methods that can be undertaken to prevent such events from happening. Buying trends shown through web data mining can help you to make forecast on your inventories as well [5]. This is a direct analysis, which will empower the organization to fill in their stocks appropriately for each month depending on the predictions they have laid out through this analysis of buying trends [6]. The data mining technology is going through a huge evolution and new and better techniques are made available all the time to gather whatever information is required. Web data mining technology is opening avenues on not just gathering data but it is also raising a lot of concerns related to data security. There is loads of personal information available on the internet and web data mining had helped to keep the idea of the need to secure that information at the forefront [7].

Keywords: Web Mining, Hyperlink, Web Structure Mining, Pattern, Classification

Introduction When comparing web mining with traditional data mining, there are three main differences to consider [1]: Scale In traditional data mining, processing 1 million records from a database would be large job. In web mining, even 10 million pages wouldn’t be a big number. Access When doing data mining of corporate information, the data is private and often requires access rights to read. For web mining, the data is public and rarely requires access rights. But web mining has additional constraints, due to the implicit agreement with webmasters regarding automated (non-user) access to this data. This implicit agreement is that a webmaster allows crawlers access to useful data on the website, and in return the crawler (a) promises not to overload the site, and (b) has the potential to drive more traffic to the website once the search index is published. With web mining, there often is no such index, which means the crawler has to

2552

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556 © Research India Publications. http://www.ripublication.com Where ● PR: Page Rank ● pi: page I ● d: damping factor ● N: number of pages ● L: out-links ● M: in-links

Proposed Methodology The proposed methodology describes the structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Web structure mining is the process of discovering structure information from the web. This can be further divided into two kinds based on the kind of structure information used.

The implementation of the web structure mining is done in the basically procedure as follows, 1. Extracting the page Rank manual or automatic. 2. Extracting the hyperlinks in a web page. 3. Internet domain classification. 4. Major domain influence computation. 5. Identify the URL characteristics.

Hyperlinks: A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. There has been a significant body of work on hyperlink analysis [8] provide an up-to-date survey.

The actual implementation of web content extraction can be utilized by using the following java programming codes. 1. The pseudo code algorithm for calculating the rank of web pages is presented below.

Document Structure: In addition, the content within a Web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents [9].

Where e is the vector with all elements 1, € is the accuracy threshold and 1 is the norm of the vector calculating by summing up its elements. 2. Extracting the links in a webpage public class ExtractLinks{ public static void main(String args[]) throws Exception { try { String sUrl_yahoo = "http://www. mamma. com/result. php?type=web&q=hai+bird&j_q=&l=";

Figure 1: Proposed methodology for web data mining in online sales domain

Implementation Google Page Rank: Websites link to interesting websites, so they “vote” for them. The more websites vote to a website, the more interesting it is also regard the votes for recommending Websites. Every website has a starting score Which are calculated incremental? [10] If there are few links, a specific one will be chosen with high probability If there are many links, a specific one will be chosen with low probability ● Many in-links: Authority ● Many out-links: Hub The Page Rank can be calculated as follows, PR(pi)=(1−d)/N+d∑PR(pj)/L(pj)--p j∈ M(pi)

String nextLine; String webPage; StringBuffer wPage; String sSql; java. net. URL siteURL = new java. net. URL (sUrl_yahoo); java. net. URLConnection siteConn = siteURL. openConnection(); java. io. BufferedReader in = new java. io. BufferedReader ( new java. io. InputStreamReader(siteConn. getInputStream() ) ); wPage = new StringBuffer(30*1024); while ( ( nextLine = in. readLine() ) != null ) { wPage. append(nextLine); } in. close();

(1)

2553

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556 © Research India Publications. http://www.ripublication.com webPage = wPage. toString(); out. println(webPage); } catch(Exception e) { out. println("Error" + e); } } }

. net . org

3. Computation for Page Rank Algorithm Page rank algorithm is proposed by Larry page and Brin which is patented by Stanford University. It is The ranking algorithm used by Google’s search engine to compute a page rank of the web page. We compute the page rank algorithm by assuming a small universe of four web pages; A, B, C and D. The links from a page to itself or multiple outbound links from one single page to another single web page are ignored. Page Rank is initialized to the same value for all the web pages present in the web. In the Page Rank, we assume the sum of Page Rank over all pages equal to the total number of pages on the web at that time. We assume a probability distribution between 0 and 1 for all the web pages. The Page Rank transferred from a given page to the other web page of its outbound links in the next iteration is divided equally among all the outbound links of the given web page. Let us suppose, the page B has a link to pages C and A, page C has a link to page A and page D have links to all three pages. We assume the initial value for each web page as 0. 25.

Extracting the substring of domain Name from URL using java: public class ExtractLinks { public static void main(String args[]) throws Exception { String last3; if (str == null || str. length() < 3) // if needed use it for the required length as 6 for gov. in etc.. { last3 = str; } else { last3 = str. substring(str. length()-3); } }

Results and Discussion

To compute the page rank of A: If the only links in the system were from pages B, C and D to A, each outgoing link would transfer 0. 25 to Web page A to compute the Page Rank of A in the next iteration.

The actual web structure content analysis for a data warehouse after the retrieval of web content from the corresponding trusted web resource is as follows for the URL of MS University Tirunelveli. The Page rank obtained by Google ranking structure is 0. 5 through http://checkpagerank. net/index. Php

PR (A) = PR (B) + PR (C) + PR (D) With back links the equation will be PR (A) = PR (B)/2 + PR(C)/1 +PR (D)/3 Thus, upon the next iteration, page B would transfer half of its existing value or 0. 125 to page A, because Page B has 2 back-links; to page A and to page C; and the other half or 0. 125 to page C. And page C would transfer all of its existing value, which is 0. 25, to the only page it links to the web page A. If D has three outbound links. The web page D would transfer one third of its existing value or 0. 083 (0. 25/3=0. 083) values to web page A. At the completion of this iteration, page A will have a Page Rank of 0. 458. PR (A) = 0. 25/2 + 0. 25/1 + 0. 25/3 = 0. 458.

Figure 2: MS University Domain Page rank Now by executing the Java code for extracting the links connected to the specified URL of MS University can be obtained as follows,

Internet domains classification: Domain Name . com

. int . edu . gov . mil

Originally intended for sites related to the Internet itself, but now used for a wide variety of sites. Originally intended for non-commercial "organizations", but now used for a wide variety of sites. Was managed by the Internet Society for a while.

Context

MS University website hyper links extraction: http://www. msuniv. ac. in/Default. aspx http://www. msuniv. ac. in/default. aspx http://www. msuniv. ac. in/contactus. aspx https://admin. google. com/msuniv. ac. in/ http://www. i-radiolive. com/#/aod/manovani/ http://www. nvsp. in/ http://www. msuniv. ac. in/ourstory. aspx http://www. msuniv. ac. in/mission-vision. aspx http://www. msuniv. ac. in/universitymap. aspx http://www. msuniv. ac. in/university-act. aspx

Originally stood for "commercial" to indicate a site used for commercial purposes, but it has since become the most well-known top-level Internet domain, and is now used for any kind of site. Used by "International" sites, usually NATO sites. Used for educational institutions like universities. Used for US Government sites. Used for US Military sites.

2554

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556 © Research India Publications. http://www.ripublication.com http://www. ijtra. com/ http://www. researchpublish. com/ http://www. ugc. ac. in/ http://www. inflibnet. ac. in/econ/eresource. php http://www. aicte-india. org/ http://www. tn. gov. in/

http://www. msuniv. ac. in/statutes. aspx http://www. msuniv. ac. in/howtoreach. aspx http://www. msuniv. ac. in/# http://www. msuniv. ac. in/chancellor. aspx http://www. msuniv. ac. in/pro-chancellor. aspx http://www. msuniv. ac. in/vice-chancellor. aspx http://www. msuniv. ac. in/planing-board. aspx http://www. msuniv. ac. in/syndicate. aspx http://www. msuniv. ac. in/senate. aspx http://www. msuniv. ac. in/scaa. aspx http://www. msuniv. ac. in/registrar. aspx http://www. msuniv. ac. in/deans. aspx http://www. msuniv. ac. in/fo. aspx http://www. msuniv. ac. in/pio. aspx http://www. msuniv. ac. in/deputyregistrar. aspx http://www. msuniv. ac. in/assistantregistrar http://www. msuniv. ac. in/department. aspx http://www. msuniv. ac. in/fee-structure. aspx http://www. msuniv. ac. in/constituent-colleges. aspx http://www. msuniv. ac. in/affilated-colleges. aspx http://www. msuniv. ac. in/community-colleges. aspx http://www. msuniv. ac. in/PG-Extension. aspx http://www. msuniv. ac. in/Research%20Projects. pdf http://www. msuniv. ac. in/coe-contact. aspx http://www. msuniv. ac. in/revisedfee. aspx http://www. msuniv. ac. in/results. aspx http://www. msuniv. ac. in/coedownload. aspx http://www. msuniv. ac. in/Research. aspx http://www. msuniv. ac. in/Research-contact. aspx http://www. msuniv. ac. in/Evaluation-Status. aspx http://www. msuniv. ac. in/Research-Guide. aspx http://www. msuniv. ac. in/Research-Circulars. aspx http://www. msuniv. ac. in/Research-Downloads. aspx http://www. msuniv. ac. in/Research-Centres. aspx http://www. msuniv. ac. in/ddce-Director. aspx http://www. msuniv. ac. in/ddce-department. aspx http://www. msuniv. ac. in/ddce/FeeStructure. pdf http://www. msuniv. ac. in/application-form. aspx http://www. msuniv. ac. in/prospectus. aspx http://www. msuniv. ac. in/ddce-studycentres. aspx http://www. msuniv. ac. in/MSULibrary/index. html http://www. msuniv. ac. in/Library-Faculty. aspx http://www. msuniv. ac. in/iqacnew. aspx http://www. msuniv. ac. in/nssnew. aspx http://www. msuniv. ac. in/eqqopp. aspx http://www. msuniv. ac. in/ywd. aspx http://www. msuniv. ac. in/Grievances http://www. msuniv. ac. in/ddceonsite. aspx http://www. msuniv. ac. in/#carousel-generic http://14. 139. 186. 252:8080/jspui/ http://www. msuniv. ac. in/downloads. aspx http://www. msuniv. ac. in/timetable. aspx http://www. msuniv. ac. in/upgallery. aspx http://www. msuniv. ac. in/registrar@msuniv. ac. in http://www. academicroom. com/ http://orcid. org/ https://scholar. google. co. in/schhp?hl=en http://www. researchgate. net/ http://www. researcherid. com/ http://ieeexplore. ieee. org/Xplore/home. jsp

Performing the classification of domains by obtaining the last set of substrings using the java code yields the following table for MS University website hyperlinks. Table. 1: Domain Link analysis MS University Hyper links Domain count . com . int . edu/. ac. in . gov . mil . net . org

6 0 63 3 0 1 3

The final result of this domain identification is an educational university oriented web crawls structure based on its ac. in features. The actual web structure content analysis for a data warehouse after the retrieval of web content from the corresponding trusted web resource is as follows for the URL of St. Xaviers College Tirunelveli. The Page rank obtained by Google ranking structure is 0. 4 through http://checkpagerank. net/index. php

Figure 3: St. Xaviers College Domain Page rank Now by executing the Java code for extracting the links connected to the specified URL of MS University can be obtained as follows, ST. XAVIERS College Tirunelveli Webwite Hyper Links Extraction: http://stxavierstn. edu. in/# http://stxavierstn. edu. in/index. php http://stxavierstn. edu. in/about_xavier. php http://stxavierstn. edu. in/about_jesuit. php http://stxavierstn. edu. in/about_college. php http://stxavierstn. edu. in/departments. php http://stxavierstn. edu. in/s_one_courses. php http://stxavierstn. edu. in/s_two_courses. php http://stxavierstn. edu. in/abt_iqac. php http://stxavierstn. edu. in/examinations. php

2555

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 4 (2016) pp 2552-2556 © Research India Publications. http://www.ripublication.com http://stxavierstn. edu. in/404page. php http://stxavierstn. edu. in/@ http://www. xaho. com/ http://stxavierstn. edu. in/alumni. php http://stxavierstn. edu. in/Mathematics. zip http://stxavierstn. edu. in/001E. pdf http://stxavierstn. edu. in/ZOO2015. pdf http://stxavierstn. edu. in/XCOM. pdf http://stxavierstn. edu. in/PLACEMENTCELL. pdf http://stxavierstn. edu. in/aqar2014. pdf http://stxavierstn. edu. in/Shift1. pdf http://stxavierstn. edu. in/Shift2. pdf http://117. 239. 105. 123:8080/r2015o/estart. asp http://stxavierstn. edu. in/SXCCIA/ http://stxavierstn. edu. in/Webmail/ http://stxavierstn. edu. in/RBlog/ http://www. aicte-india. org/

analysis for its domain goal attainment. In our sample experiment we identified the university web portal is more emphasized on educational links rather than with the individual college links. Since this is a huge area, and there a lot of work to do, we hope this paper could be a useful starting point for identifying opportunities for further research. Our proposed methodology make it as an easy process by the novel view of periodic web data level storage and retrieval combinations, further focusing of their mutual proportion along with variational effects we achieved an data analysis process with 99 % efficiency. In near future this research will extend its range towards web usage analysis.

References [1]

Baraglia, R. Silvestri, F. (2007) "Dynamic personalization of web sites without user intervention", In Communication of the ACM 50(2): 63-67 [2] Cooley, R. Mobasher, B. and Srivastave, J. (1997) “Web Mining: Information and Pattern Discovery on the World Wide Web” In Proceedings of the 9th IEEE International Conference on Tool with Artificial Intelligence [3] Cooley, R., Mobasher, B. and Srivastava, J. “Data Preparation for Mining World Wide Web Browsing Patterns”, Journal of Knowledge and Information System, Vol. 1, Issue. 1, pp. 5–32, 1999 [4] Costa, RP and Seco, N. “Hyponymy Extraction and Web Search Behavior Analysis Based On Query Reformulation”, 11th Ibero-American Conference on Artificial Intelligence, 2008 October. [5] Kohavi, R., Mason, L. and Zheng, Z. (2004) “Lessons and Challenges from Mining Retail Ecommerce Data” Machine Learning, Vol 57, pp. 83– 113 [6] Lillian Clark, I-Hsien Ting, Chris Kimble, Peter Wright, Daniel Kudenko (2006)"Combining ethnographic and clickstream data to identify user Web browsing strategies" Journal of Information Research, Vol. 11 No. 2, January 2006 [7] Eirinaki, M., Vazirgiannis, M. (2003) "Web Mining for Web Personalization", ACM Transactions on Internet Technology, Vol. 3, No. 1, February 2003 [8] Mobasher, B., Cooley, R. and Srivastava, J. (2000) “Automatic Personalization based on web usage Mining” Communications of the ACM, Vol. 43, No. 8, pp. 142–151 [9] Mobasher, B., Dai, H., Luo, T. and Nakagawa, M. (2001) “Effective Personalization Based on Association Rule Discover from Web Usage Data” In Proceedings of WIDM 2001, Atlanta, GA, USA, pp. 9–15 [10] Nasraoui O., Petenes C., "Combining Web Usage Mining and Fuzzy Inference for Website Personalization", in Proc. of WebKDD 2003 – KDD Workshop on Web mining as a Premise to Effective and Intelligent Web Applications, Washington DC, August 2003, p. 37

Performing the classification of domains by obtaining the last set of substrings using the java code yields the following table 2 for St. Xaviers College website hyperlinks. Table 2: Domain Link analysis St. Xaviers College Tirunelveli. Hyper links Domain count . com . int . edu/. ac. in . gov . mil . net . org

1 0 26 1 0 0 1

The comparison hyperlink analysis chart is as follows,

Figure 3: Domain Hyperlink comparison analysis

Conclusion Web Structure Mining is a powerful technique used to extract the information from past behavior of users. Web Structure Mining plays an important role in this approach. Various algorithms are used in Web Structure Mining to rank the relevant pages, which treat all links equally when distributing the rank score. In this paper, we approach the research area of Web mining, focusing on the category of Web structure mining for identifying the specified URL structure content

2556