Integrating Product Data from Websites offering Microdata Markup

Integrating Product Data from Websites offering Microdata Markup Petar Petrovski, Volha Bryl, Christian Bizer Data and Web Science Research Group Univ...
Author: Ruby James
2 downloads 0 Views 1MB Size
Integrating Product Data from Websites offering Microdata Markup Petar Petrovski, Volha Bryl, Christian Bizer Data and Web Science Research Group University of Mannheim, Germany

School of Business Informatics and Mathematics

Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. 2. 3. 4. 5.

Microdata extraction Classification Feature extraction Identity resolution Data Fusion

3. Conclusions Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

2

HTML-embedded Data More and more Websites semantically markup the content of their HTML pages.

Microformats

RDFa

Microdata Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

3

Schema.org

• ask site owners to embed data to enrich search results. • 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, … • Encoding: Microdata or RDFa Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

4

Usage of Schema.org Data @ Google

Data snippets within search results Data snippets within info boxes Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

5

Websites Containing Structured Data (November 2013) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26%). 1.7 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13%) http://webdatacommons.org/structureddata/

Google, October 2013: 15% of all websites provide structured data. Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

6

Top Classes, Microdata (2013)

• schema = Schema.org • datavoc = Google‘s Rich Snippet Vocabulary

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

7

Example: Microdata, Local Business

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

8

Example: Microdata, Product

School of Business Informatics and Mathematics

The Data Integration Pipeline • Objective: integrate all data found on the web describing a specific entity (e.g. product or organization) • Motivation: enables creation of powerful applications, e.g. comparison shopping portals

• Use case: product data • Implemented Pipeline:

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

10

Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. 2. 3. 4. 5.

Microdata extraction Classification Feature extraction Identity resolution Data Fusion

3. Conclusions Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

11

Web Data Commons Extraction Framework • Web Data Commons project: extracts structured data from the Common Crawl – http://webdatacommons.org/ – http://commoncrawl.org/

• Code available at: – https://subversion.assembla.com/svn/commondata/ – Based on Anything To Triples (any23) library for extracting structured data: http://any23.apache.org

• Common Crawl 2012 – 3 billion HTML pages, 40.6 million websites – 7.3 billion statements describing 1.15 billion things – 9.4 million product offers from 9240 e-shops Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Looking Deeper into E-Commerce Data Microdata Product (2013)

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

13

Looking Deeper into E-Commerce Data Microdata Product (2012)

Example: Title and Description Title

AppleMacBook Air MC968/A 11.6-Inch Laptop

Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD Description Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000 enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best resolution…

Different descriptions follow different levels of detail

Title

Description

Various abbreviations can be found describing same features

Often imprecise values due to rounding in numeric values can be found

Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4 GB, 64 GB, Mac OS X Lion 10.7 The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics, IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. 2. 3. 4. 5.

Microdata extraction Classification Feature extraction Identity resolution Data Fusion

3. Conclusions Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

16

Product Classification • Starting from 9.4 million products: • Products with English descriptions with length grater than 20 words => 1,986,359 products from 9,240 e-shops

• Training set – 18,000 labeled products, 9 classes

• Training the model – Naïve Bayes Classifier

• Features generation – 4 step process – tokenizing and removing stop words, pruning, n-grams, TF-IDF – ~3600 features Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

17

Classification Performance The offers originate from 9,240 e-shops Category

Precision %

Recall %

#

Books

86.58

87.95

233,249

Movies, Music & Games

89.81

70.63

186,832

Electronics & Computers

92.98

88.00

219,118

Home, Garden & Tools

73.81

60.78

186,495

Grocery, Health & Beauty

70.20

72.86

120,573

Toys, Kids, Baby & Pets

75.00

64.85

114,236

Clothing, Shoes & Jewelry

88.56

89.93

206,315

Sports & Outdoors

72.83

67.90

143,156

Automotive & Industrial

73.06

65.50

168,567

Average

80.31

74.26

1,578,541

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

18

Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. 2. 3. 4. 5.

Microdata extraction Classification Feature extraction Identity resolution Data Fusion

3. Conclusions Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

19

Product Feature Extraction

• Low precision (69%) for identity resolution without product feature extraction – Used later as a baseline for identity resolution • We developed the Free Text Preprocessor – Makes the data more structured by extracting new propertyvalue pairs from free-text properties – https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

20

Free Text Preprocessor by Example "Apple iPod nano (8 GB, 6th generation, Graphite)" . "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

21

Free Text Preprocessor by Example "Apple iPod nano (8 GB, 6th generation, Graphite)" . "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .

"Apple" . "iPod nano" . "8GB" . "1.5-inch" .

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

22

Silk Free Text Preprocessor by Example "Apple iPod nano (8 GB, 6th generation, Graphite)" .

"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .

"Apple" . "iPod nano" .

"8GB" . "1.5-inch" .

Free Text Preprocessor Specification Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

23

Extractors – Bag-of-words • Learning • Creating a list of words for every feature in the training set

Brand Storage Display

Samsung Benq Apple Cannon … 64 GB megabytes 512GB … 42-inch 3.5-inches Inches 15.24cm …

• Extraction • Matching tokens against the learned lists

• Pros

• Good for extracting nominal and numerical (with units of measurement) attributes

• Cons

• Bad for extracting multi-token values • Inconclusive for values that refer to more than one feature

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

24

Extractors – Feature-Value Pairs Learns feature-value pairs from the structured data ..

Extraction •

Tagging – taking n-grams up to 4 and matching against the values from the training set



Parsing – taking the combination of feature-value pairs that best describes an object from the training dataset

• Pros •

Extracting multi-token values

Cons •

Inconclusive for values that refer to more than one feature Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

25

Extractors – Manual Configuration Manually configure features and extraction methods 1. Regular expressions •

2.

E.g. Processor - \d*\.?\d+GHz

Dictionary search •

E.g. Dictionary of brands (Samsung, Panasonic, Lenovo, Apple)

• Pros • Extraction process can be fine-tuned according to the data • Good solution when no training (structured) data are available

• Cons • Needs domain knowledge • Non-trivial to efficiently pick extraction methods manually Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

26

Extraction Experiments • Dataset for extraction 5,000 electronic products from WDC • Training dataset (structured data) – 20 electronics products Amazon dataset

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

27

Extraction Accuracy • Extraction using Combination configuration (bag-of-words for Brand, Storage and Display; feature-value pairs for Model and Dimension; custom regular expression for the Processor) Brand

Model

Storage

Display

Processor

Dimension

iPod Nano

.92

.98

.86

.49

.12

.78

Galaxy SII

.72

.87

.89

.81

.40

.91

GalaxyTab 7.7

.80

.92

.89

.85

.72

.93

Ixus 120IS

1

.96

N/A

.89

N/A

.56

Vaio VPC

.99

.65

.81

.77

.73

.32

Viera 42

.95

.72

N/A

.82

N/A

.64

Sandisk

1

1

.85

N/A

N/A

.31

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

28

Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. 2. 3. 4. 5.

Microdata extraction Classification Feature extraction Identity resolution Data Fusion

3. Conclusions Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

29

Identity Resolution •

We used Silk – a tool for discovering relationships between data items within different linked data sources Provides a expressive language for defining linkage rules Uses genetic programming to learn linkage rules Has shown high performance on various datasets https://www.assembla.com/spaces/silk/wiki/Home

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

30

Identity Resolution Experiments • Gold standard: 5,000 links manually annotated • 2,500 positive/2,500 negative • 20 electronics products Amazon dataset (reference set)

• Experiment on 5 configurations – – – – –

Baseline (no feature extraction step) Bag-of-words Feature-value pairs Manual configuration Combinations Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

31

Silk Output: Learned Linkage Rule :Aggregation func= average

:Aggregation func= max

:Comparison func = Levensthein threhold = 1.134

:Comparison func = Jaccard threhold = 0.23

:Comparison func = Jaccard threhold = 0.02

:Transform

:Transform

:Transform

:Transform

lowerCase

lowerCase

tokenize

tokenize

:Property

:Property

:Property

:Property

:Property

:Property

wdc:Model

amazon:Model

wdc:Display

amazon:Display

wdc:Storage

amazon:Storage

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

32

Identity Resolution Results Precision %

Recall %

F-Measure %

Baseline

69

90

78.1

Bag-of-words

75

82

77.9

Feature-value pairs

80

77

78.4

Custom

82

80

80.9

Combination

85

80

82.4

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

33

Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. 2. 3. 4. 5.

Microdata extraction Classification Feature extraction Identity resolution Data Fusion

3. Conclusions Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

34

Data Fusion • Input: clusters of products after identity resolution

• Properties worth fusing/combining – AggregateRating and Review Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

35

Fusion Results Product

Offers

Reviews

Ratings

iPod Nano 8GB

829

84

0

iPhone 4 16GB

624

35

52

Sony Ericsson Xperia Mini

450

31

12

iPad 16GB

423

40

48

Motorola XOOM 32GB

270

12

0

Samsun Galaxy SII

142

8

0

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

36

Conclusions • By using Microdata, thousands of websites help us to understand their content • We have implemented the 5-step data integration pipeline – From Microdata markup to an integrated dataset

• A newly introduced feature extraction step is crucial for the precision of data integration – Identity resolution precision increases from 69% to 85%

• Future work – Automatically learning regular expressions – Automatically discovering combinations of extractors

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

37

Questions?

Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

38