Page-Level Data Extraction from Template Web Pages

International Journal of Scientific and Research Publications, Volume 3, Issue 2, February 2013 ISSN 2250-3153 1 Page-Level Data Extraction from Tem...
Author: Stuart Douglas
3 downloads 0 Views 543KB Size
International Journal of Scientific and Research Publications, Volume 3, Issue 2, February 2013 ISSN 2250-3153

1

Page-Level Data Extraction from Template Web Pages Ms. Deepali S. Patil*, Prof. S. K. Shinde** *

M. Tech Student, Department Of Technology, Shivaji University, Kolhapur, [email protected] ** M.E (CSE), H.O.D., Department of IT, BVCOE, Kolhapur, [email protected]

Abstract- A huge amount of information on the World Wide Web has a structured HTML form as they are generated dynamically from databases and have the same template. This paper proposes a page-level web data extraction system that extracts schema and templates from these template-based web pages automatically. The proposed system uses visual clues for comparing web pages for fixed/variant template detection. From fixed template pages, we construct pattern tree which is used to detect schema & extract data. It detects schema by applying tree merging, tree alignment and mining techniques. The experiments show a good result for the web pages used in many web data extraction. Index Terms- Information Retrieval, Multiple trees merging, Web data extraction, Wrapper Induction

I. INTRODUCTION

T

he explosive growth and the popularity of the World Wide Web make it the repository of information for all people around the world. Thus, information retrieval from web pages is useful in many applications like comparative shopping. Web pages in the Deep Web are generated dynamically from databases and often present in some template showing consistent view of their search result. Since these web pages are not meant to be processed by programs, developing a wrapper is useful for value-added services and other information integration systems. There are extensive studies of information extraction (IE) from World Wide Web. Chang et al. [2] surveyed these works and classifed them based on the automation degree into four different classes: manually constructed, supervised [11], semi-supervised and unsupervised systems Manually constructed systems require programmers to deduct the extraction rules but are costly and difficult to scale up. Supervised systems require less user skills to label sample pages for these systems to induce the extraction rules. Semi-supervised Systems do not require users to label any sample pages but require post-processing from the users to choose the pattern and indicate the data to be extracted, while unsupervised systems automatically generate the wrappers without any user interventions and receive a lot of attention. In the last few years, web data extraction has been a hot topic. A number of approaches have been reported in literature for extracting information from web pages. We review the previous work based on visual structure & DOM tree structure. IEPAD [2] is one of the first automatic record extraction systems. It identifies substrings that appear multiple times in a document encoded as a token string and discovers the repetitive patterns from a PAT tree (a binary suffix tree). ViDE extracts data records and data items from deep web pages by using visual information primarily. ViDE obtains the visual representation

and transform it into a visual block tree. Its main visual features include Position Features, Layout Features, Appearance Features, and Content Features. Basically it gets the visual information from web page layout (location, size, and font). ROADRUNNER is page-level data extraction task that starts off with first input page as its initial template. Then for each subsequent page it checks if page can be generated by current template. EXLAG [1] extracts data by analyzing equivalence classes. Large and frequently equivalence classes are extracted for template generation. Some approaches uses visual information to extract web data. ViPER [4] uses visual information for multiple sequence alignment. The proposed system is unsupervised data extraction system, which uses tree template to model the generation of dynamic web pages. It uses VIsion-based Page Segmentation Algorithm (VIPS) [10] that builds visual block tree for each page. By comparing visual blocks in two trees it detects pages belongs to fixed or variant templates .In order to deduce the schema and templates for each individual Deep Web site ,it applies tree matching for peer node recognition, tree alignment for missing and optional nodes recognition, and mining techniques for repetitive pattern detection. It detects the schema by making use of peer node recognition, matrix alignment, pattern mining, and optional node detection. In the peer node recognition step, it compares the same tag in the same level and denotes them with the same symbol if they are peer subtrees. In the matrix alignment step, it tries to align nodes in the peer matrix to get a list of aligned nodes. In the pattern mining step, it detects repetitive patterns of different lengths in the aligned list starting from length 1. In the optional node detection step, it recognizes optional nodes and groups them if a node disappears in some columns of the matrix. After finding pattern tree, data extraction is done by matching pattern tree & html tree at each level. II. PROPOSED SYSTEM Detecting the schema of a web site has been a key step for value-added services like comparative shopping and information integration systems. Our proposed work is aimed to filter schema of website & extract data. A. Problem Definitions All data instances of a web site shall conform to a common schema which can be defned as follows. Definition 1: (Structured data) A data schema can be of following types.  A basic type β represents a string of tokens, where a token is some basic units of text.  If T1, T2, . . . , Tk are types, then their ordered list < T 1, T2, ….,Tk > also forms a type T. We say that the type T is constructed from the types T 1, T2, .. , Tk using a type www.ijsrp.org

International Journal of Scientific and Research Publications, Volume 3, Issue 2, February 2013 ISSN 2250-3153

constructor of order k. An instance of the k order T is of the form , where x1,x2, . . . , xk are instances of types T1, T2, . . . , Tk , respectively. The type T is called 1. A tuple, denoted by T, if the cardinality (the number of instances) is 1 for every instantiation. 2. An option, denoted by (k )? T , if the cardinality is either 0 or 1 for every instantiation. 3. A set, denoted by {k}T , if the cardinality is greater than 1 for some instantiation. 4. A disjunction, denoted by (T 1| T2| . . . | Tk) T, if all Ti (i=1,…..k) are options and the cardinality sum of the k options((T 1-Tk ) equals 1 for every instantiation

Figure 1 : (a) A web page (b),(c) It’s two different schemas

2

Proposed system consists of following steps : Step 1: Take two web pages as input. Step 2: For each page, we apply VIsion-based Page Segmentation (VIPS) algorithm to segment Web page & to build visual block tree. Step 3: Blocks in visual trees of two web pages are compared to detect fixed/variant template pages. Step 4: For fixed template pages, we apply multiple tree merging algorithm ,which consists of following steps.  Peer Node Recognition :  Matrix Alignment :  Pattern Mining :  Optional Node Merging : Step 5 : We create pattern tree & schema is detected. Step 6 : Data extraction is done by matching pattern tree & html tree. Step 7 : From variant template pages ,data is extracted. C. Algorithms Used 1) VIPS Algorithm For each input page, we apply VIPS [10] i.e. VIsion-based Page Segmentation algorithm which is used to segment page into block structure & then these visual block trees are compared to check pages belong to fixed or variant template. In VIsion-based Page Segmentation algorithm, the visionbased content structure of a page is obtained by combining the DOM structure and visual cues. For each page, we apply VIPS [10] algorithm and visual block tree is constructed. Visual block trees are compared, and on basis of position, size, color, font we determine whether pages belong to fixed/variant template.

Definition 2: (Wrapper Induction) Given a set of n DOM trees, created from some unknown template T & values, deduce Template, schema & values from set of DOM trees alone. B. Methodology Proposed approach detects schema of web pages by constructing pattern tree. The process flow of system is as given in figure below.

(a)

Figure 2 : Process flow of System

www.ijsrp.org

International Journal of Scientific and Research Publications, Volume 3, Issue 2, February 2013 ISSN 2250-3153

11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

3

childList=repeatMining(childList,1); mergeOptional(childList); for each node c in childList if( c is a tree) then C=multipleTreeMerge(peerNode(c,M),tag(C)); Else C=c; Endif Insert C as a child of P; Endfor Return pattern tree P;



(b) Figure 3: (a)A Web Page (b) Page’s visual layout 2) Tree Merging Algorithm This module merges all input DOM trees at the same time into structure called fixed/variant pattern tree which can be used to detect template and schema of Website. We conduct four steps: peer node recognition, matrix alignment, pattern mining and optional node detection. - In peer node recognition step, two nodes with the same tag name are compared to check if they are peer subtrees. - In the matrix alignment step, the system tries to align nodes in the peer matrix to get list of aligned nodes childList. - In pattern mining step, the system takes the aligned childList as input to detect repetitive pattern. - In the last step, the system recognizes optional nodes. Algorithm MultipleTreeMerge (T, P) // T is set of DOM trees of same type; // P is the tag for the roots of T. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Initialize M; i=0; For each tree t in T j=0; For each child c in t M[j++][i]=c; EndFor i++; EndFor recognizePeerNode(M); childList=matrixAlignment(M);

Peer Node Recognition Peer Node Recognition is the problem of recognizing whether two nodes at a level are alike or not. This is the first step towards template recognition, as during peer node recognition the algorithm marks two children at the same level in two different trees as the same. In peer node recognition, 2-tree matching algorithm is used to find maximum matching between two trees. Algorithm TreeMatchScore (A, B) // A and B are two trees. 1. If A. text! = B.text 2. Return 0; 3. If ((A or B is leaf) or size (A) ==size (B)) then 4 Return 2* TreeMatching (A, B)/ size (A) + size (B); 5. Score=0.0; m=no of children of A 6. For each child a in A do 7. nodeScore=0.0; matchNo=0; 8. For each child b in B do 9. Temp=2*TreeMatching (a, b)/size (a) + size (b); 10. If (Temp>θ)\ 11. nodeScore+=temp; matchNo++; 12. Endif 13. Endfor 14. If (matchNo>0) nodeScore=matchScore/matchNo; 15. Score+=nodeScore; 16. Endfor 17. Return (score/m+2)/(size(a)+size(b)) ; TreeMatching (A, B) returns the number of matching nodes between A and B. The algorithm returns a matching score which is normalized ratio between the number of pairs in the mapping over the maximum size of the two trees .If the score is higher than a threshold(0.4),the two trees are considered peer trees.  Matrix Alignment I In Matrix Alignment, all child nodes will fill up a matrix such that all peer child nodes are taken the same symbol. The algorithm traverses the matrix row by row and tries to align every row until an aligned peer matrix is constructed. This step tries to detect optional data. Procedure matrixAlignment (M) 1. Spans=computespans(M) 2. row=0; 3. shiftLength=0; 4. while (M is not aligned) do

www.ijsrp.org

International Journal of Scientific and Research Publications, Volume 3, Issue 2, February 2013 ISSN 2250-3153

5. 6.

while (! AlignedRow(row,M)) do shiftColumn=getShiftColumn( row, shiftLength, M); 7. makeshift (row,shiftColumn,shiftLength,M); 8. End while 9. row++; 10. End while 11. childList=alignmentResult(M); The node to be shifted if M is not aligned depends on span value of node. Span value of node is defined as the maximum number of different nodes between any two consecutive occurrences of node in each column plus one. The function getShiftedColumn selects a column to be shifted from current row r (shiftColumn) and identifies the required shifted distance (shiftLength) by applying following rules in order: 1. (R1) Select, from left to right, a column c such that expected appearance of node n is not reached i.e. there exists a node with same symbol at upper row r up where M[rup][c’]=n for some c’ and r-rup < span(n).Then shiftColumn equal to c and shiftLength equal to 1. 2. (R2) If R1 fails, then we select a column c with nearest row rdown from r such that M[r down][c’]=M[r][c] for some c≠c’. In such case,shiftLength=rdown-r 3. (R3) If both rules R1 and R2 fail, we then check if r contains all data (text/image) nodes. In this, no shifting is done. 4. If all of R1, R2 and R3 fail, we select the symbol that occurs the maximum number of times on this row. Keeping all columns with that symbol in r unchanged, We shift all other columns down by one.

Figure 4: An Example of Matrix Alignment Fig 4 shows an example that describes how the algorithm proceeds. The first three rows of M1 are aligned, so the algorithm does not make any changes on them. The fourth row in M1 is not aligned, so the algorithm tries to align this row by making suitable shift for some columns according to three previous mentioned rules. According rule R1, column 3 is selected since there is a node b at row 2 such that 4-2< span (b)=3.Hence, matrix M2 is obtained. Since 4 th row in M2 is aligned, next we have to align next row (row 5 in M2) which is not aligned. According to rule R2, column 2 is selected since node e has nearest occurrence at the 8th row at column 1.Therefore,shiftColumn=2 and shiftLength=8-5=3.Similarly,we

4

can follow the selection rule at each row & get matrices M4M5,and the final aligned peer matrix M6.Here,dashes mean null nodes. The alignment results in childList is shown at rightmost of figure. 

Pattern Mining The pattern step is designed to handle set-typed data, where multiple values occur. There can be many repetitive patterns discovered and a pattern can be embedded in another pattern.In pattern mining step, the algorithm tries to discover repetitive patterns and merge them to deduce the aligned list. If a repetitive pattern is detected, a virtual node will be added to pattern tree. Procedure PatternMining (List, extent) 1. K=compLvalue(List,extend); 2. For(i=0;i=0) 5. newRep=0; 6. for(j=st+I;j+i-1

Suggest Documents