ISSN (Online) ISSN (Print)

International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 4, Issue 8 , August -2015 ISSN (Print) 2320- ...
Author: Chloe Marshall
6 downloads 5 Views 514KB Size
International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 4, Issue 8 , August -2015 ISSN (Print) 2320- 5156

Data Mining Approach To Analyze Virtual Museums Web Log Data Mr. B.V.RamaKrishna, Dr. K.V.V.S.Narayana Murthy Student, Professor Chaitanya Institute of Science & Technology Andhra Pradesh, India [email protected], manikyammurthy@gmail

ABSTRACT Virtual museums are part of digital libraries with large collections of multi dimensional data. Knowledge engineering tools facilitate extraction of meaningful information to support data mining features such as classification, decision making, associations, clustering and ranking. In this paper we analyzed raw web log data gathered from online virtual museum server. We applied knowledge engineering techniques on this log data to discover some interesting patterns from user sessions. We also performed mining on web log data for collection ranking, association rule mining and decision trees construction. These details improved the organization of virtual museums. Keywords – Knowledge Engineering, clustering, web logs, classification, virtual museums. 1. INTRODUCTION Web log file is a log file automatically created and maintained by web server [1]. This log file registers every hit to the web page, which includes view of HTML document, image, multimedia content or any other active web object. The raw web log file format is essentially one line of text for each hit on the web site. This contains information about who was visiting the web site, where from they are accessing, source to destination path, request and response portfolios. A server log file is a collection of lines in text format. Each line includes some or all of the following information:  IP address of computer making request (i.e. visitor)  Identity of computer making the request  The login ID of the visitor  The date and time of the hit  The request method  The location and name of requested file  HTTP status code  The size of the requested file

www.ijrcct.org

 The web page which referred the hit Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics and new promotions floating on the Internet [8]. There is a growing trend among companies, organizations and individuals alike to gather information through web data mining to utilize that information in their best interest. Web Data Mining primarily classified into three categories such as Web Usage Mining (WUM), Web Content Mining (WCM) and Web Structure Mining (WSM) [8]. In this paper we majorly focus on WUM which is the process of extracting useful from the server logs. WUM supports investigations on user search strategies, web behavior, interesting usage patterns, semantic information extraction. In section 2 describes the methodology of preprocessing raw log file data and filtering essentials from log file data. In section 3 describes the knowledge engineering approach over collected data using our experimental set up environment. In section 4 results are analyzed finally in section 5 conclusion of our work and future scope of the work represented. 2. METHODOLOGY 2.1 LOG FILE STRUCTURE The log files are stored in web servers. These files register user’s activities during their sessions. These raw files carry some valuable information which can be used to analyze behavior of users, web surfing patterns, session trends and profile learning. The basic log file is a collection of text lines, each line of text represent collection of fields in a format shown in figure1. Each field handles significant information about visitor activities over the web session [5]. Various fields and their roles are given as:

Page 518

International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 4, Issue 8 , August -2015 ISSN (Print) 2320- 5156 Figure2. Log File Preprocessing UID

VP

PT

TS

PLV

SR

UA

URL

RT 3. PREPARATION OF DATA SETS

Figure1. Log File Format 3.1 DATA COLLECTION UID:

VP: PT:

TS: PLV: SR:

UA: URL: RT:

User Identification i.e. IP Address of user who browsing the session. Some times user profiles are maintained to identify revisits of users. Visitors Path which represents the URL/link by search engine. Traversal Path whole paths taken by visitor during specific web site search. Time Stamp is the time duration between session start and session end. The page last visited by visitor before he leave web site. Success Rate treated as the amount of downloads or transactions made successfully by visitor during his session. The User Agent is the browser details used by visitor for requesting web server. The resource accessed by visitor any file/image/doc/object on web. Request Type is the method used for information transfer between web client and web server. Ex: GET, POST, SET.

2.2 PREPROCESSING The raw log files must be preprocessed before they applied to data mining tools [5]. The preprocessing carried out in phases each phase refines data by transforming it into convenient form for data mining process. Figure 2 depicts the phased preprocessing of raw log file which is named as ETL (Extraction Transformation Logic) in Data Mining terminology.

CLEANING

Removing noise, redundant data, anomalies from Raw Log file

FILTERING

Filtering sessions, requests, profiles of log files

TRANSFORMING Data Marts

www.ijrcct.org

Converting data into CSV/ARFF/XLS/D BF file format

The data sets prepared for this paper are extracted from web log file archives located in the directory ‘Research & Education’ of British On-line Museum. The raw web log file data are preprocessed to form tabular data sheets of Excel file format [4]. Later these are converted to Tanagra Data Mining file format which can be loaded directly into Tanagra data mining tool as training data sets to perform our experiments. 3.2 DATA SETS For analyzing user behavior patterns we constructed data set with field structure [user-id, session period, success rate, HTTP_REQ, region] [5]. A collection of 200 visitor session snap shot considered to prepare this tabular data sheet. The raw web log file processed as mentioned below to extract individual fields for this data set. For identifying Association rules among item sets (collections) in museum using Association Rule Mining algorithms we constructed data set representing a matrix of M×N where M=individual user session, N= set of painting authors. In a user session we make Boolean value=true for each artist who is the author for visited painting collection. Finally matrix represents forms a tabular sheet of various sessions and respective visits to specific collections of paintings. A collection of 150 visitor session snap shot considered to prepare this data set from raw web log file [6]. 3.3 RAW WEB LOG FILE PROCESSING To prepare above data sets for our experiments with data mining tool on museum data, raw file must be processed carefully [5]. Since each line in raw log file represents one user session details we have to extract required fields from it. Java or .net can be used to process text file information easily. During this process extracted values are placed into database tables with appropriate field names. We constructed two tables for our experimental purpose. These database tables are converted to Excel style sheets. Using the multi format saving option available in MS-Excel we gain converted these tables into tab-delimited text file format. This tab-delimited text file can be converted easily into .tdm (Tanagra data mining) file format and stored in Tanagra data set

Page 519

International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 4, Issue 8 , August -2015 ISSN (Print) 2320- 5156 directory. Now we are ready to perform data mining experiments with processed museum data. 4. EXPERIMENTAL ANALYSIS 4.1 Association Rule Mining During experiment we have taken a snapshot of visitor sessions with a collection of 150 individual sessions related to paintings gallery of museum. We constructed a binary data file with painting authors as columns and visitors visit to specific author makes a Boolean entry to respective author fields. Applying transformation to .XLS file to .TDM file we loaded training data set into Tanagra mining tool. Now selecting the association tab we are handed with set of association rule mining functions. For generating association rules we have

www.ijrcct.org

chosen Apriori-algorithm tool. Parameters are adjusted for Apriori as support = 0.05, Lifting = 1.1 and Confidence = 0.75. Figure 3 represents Tanagra mining tool generated rules for association ship among paintings selection. Association rules generated based on the item selection sequence similar to ‘Market Basket Analysis’. Frequent item sets with minimum support and ideal threshold confidence value are identified. Maximum frequent itemset chosen with 100% confidence are {J.L.David, C.Hassam, J. AntonieWatteau}, {P.Klee, E.Delacroix, S.Dali}, {Jan Van Eyck, J.Constable}, {Lenardo d Venci, Picasso}, {S.Dali,P.Klee}. These frequent item sets reveals the strength of association ships among items selection by visitors frequently thus association rules provide interesting measures of visitor interest over painting

Page 520

International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 4, Issue 8 , August -2015 ISSN (Print) 2320- 5156 Figure3. Association Rule Mining on Visitor Painting Selection collections. Association rules generated also provide support and confidence levels for painting selection which supports the collection improvement in museums. 4.2 Classification Decision Trees Our next aim is to analyze museums visitor traffic [2] and construct decision trees for various information gains. By web log raw files we already collected preprocessed data values which are transformed into TDM files and ready to use by Tanagra mining tool [4]. First we generate a decision tree for visitor

classification based on his specific visits over week days and his session time duration. Figure 4 gives the classification of visitors who fall under session duration ranges and who visits on specific weekday generally. Casual visitors are those whose session duration is always below 90 minutes, Philosopher visitors having session duration always above 90 minutes. Both visit the museum in week days [Sunday, Saturday, Friday] only. Explorer visitors are Those whose session duration is >90 and