Web graphs from web archives : seen from different angles

Web graphs from web archives : seen from different angles Valérie Beaudouin (Télécom ParisTech) Zeynep Pehlivan (Télécom ParisTech) Peter Stirling (Bi...

Author: Angel Henry

0 downloads 2 Views 2MB Size

Report

Download PDF

Recommend Documents

Web Spam LiWA - Living Web Archives

Corroborating Information from Web Sources

Order from Maryland Metrics web:

Turning pure Web Page Storages into Living Web Archives

The Holberg graphs 24th January The world seen from Bergen

Web 2.0 Proxy: Upgrading Websites from Web 1.0 to Web 2.0

Extracting Dutch Hypernymy Pairs from the Web

Information Extraction from the World Wide Web

NEIL: Extracting Visual Knowledge from Web Data

Migration from Proxy.cgi to Web SSO

Building Bilingual Dictionaries From Parallel Web Documents

Symbolic Knowledge from the World Wide Web

Extracting Art Style Periods from the Web

Report on Technologies for Living Web archives

Extraction of Semantic Information from Web Resources

Information Extraction from the World Wide Web

Mining Hidden Phrase Definitions from the Web

Pattern extraction from the world wide web

WHEN constructing corpora from web content, the

DATA CLUSTERING: FROM DOCUMENTS TO THE WEB

Better data quality from your web form

Zero-shot Entity Extraction from Web Pages

Machine Translation Detection from Monolingual Web-Text

Consuming Web Services from Service Manager

Web graphs from web archives : seen from different angles Valérie Beaudouin (Télécom ParisTech) Zeynep Pehlivan (Télécom ParisTech) Peter Stirling (Bibliothèque nationale de France)

IIPC Web Archiving Conference, 14th April 2016

MAPPING THE WWI WEB

IIPC Web Archiving Conference 2016

2

Using web archives to study the First World War • Research project: “The future of digitised heritage online: the example of the Great War” aimed to examine the circulation of images online – Partnership BnF, Télécom ParisTech and BDIC – Project 2013-2016, financed by Labex “Pasts in the Present”

• Create a map of websites related to the war and study of the place of an important discussion forum • Use of data mining approaches on the BnF web archives – How is the corpus defined and analysed? What data, metadata and tools are needed? What organisational structure is needed? – Sites selected for the BnF crawl on the centenary of WWI

IIPC Web Archiving Conference 2016

3

BUILDING A COLLECTION

IIPC Web Archiving Conference 2016

4

“The Great War on the Web” • BnF collection on the centenary of WWI • Collection policy and selection in cooperation between BnF librarians and partners – Sites about WWI and/or its commemoration – Relevance, topicality, singularity, originality, navigability

• Classification by type of site – – – – –

Official : international, european, national,territorial Public : scientific, heritage, education Associations (including military) Personal (including military) Media

IIPC Web Archiving Conference 2016

5

Organisation of the crawl • Use of a shared selection tool (BCWeb) – – – –

Seed URL Crawl settings Theme Keywords

• Two/three crawls per year from 2013 to 2019 – November 2013, March 2014, August 2014, November 2014, April 2015, October 2015, February 2016

IIPC Web Archiving Conference 2016

6

7

CARTOGRAPHY: METHODOLOGY

IIPC Web Archiving Conference 2016

8

From collections to metadata • Use of metadata to create web graphs – choice of WAT format – Developed by Internet Archive – https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Me tadata+File+Specification

• WAT files contain information extracted from WARC records, stored in JSON – All links from HTML pages, with the kind of link and associated text – Title, format, meta tags… – No content, images, etc.

• Also need other information – Information from BCWeb (seed URL, classification) – Dates of crawls and seeds included in each crawl IIPC Web Archiving Conference 2016

9

Cartography: method • Corpus (input): – BnF Web Archives / “Great War on the Web”, sites selected by librarians and partners – Classification of sites by type of producer – 4 crawls between November 2013 and November 2014 – August 2014 : 482 seed URLs, 7 million URLs collected

• Processing: – Extraction of metadata (links between documents) – Definition and classification of links and nodes

• Cartography (output) IIPC Web Archiving Conference 2016

10

Framework WAT BCWeb

• • • •

Data Extraction

Database

Graph generation

Visualisation

Extraction : Python Database : Hadoop / PigLatin Graph generated in GEXF format Visualisation using Gephi IIPC Web Archiving Conference 2016

11

Methodology: definition of nodes • The graph has to represent a heterogeneous mass produced by the crawl – Difficulty of defining of a “site” within the archives – Differences in crawl depth – “Out of topic” content introduced by crawl process and outgoing links

• The strategy used for defining nodes in the graphs has an impact on their interpretation • Need to define strategies in three areas: – Aggregation of nodes – Filter content using keywords – Use only collected URLs (the corpus) or include outgoing links IIPC Web Archiving Conference 2016

12

Aggregation of nodes • Two approaches were used to aggregate URLs • Group URLS according to host • http://www.lemonde.fr/centenaire-14-18/1.html • Node => lemonde.fr

• Group URLs according to seed URL in BCWeb, or failing that by host • • • •

http://www.lemonde.fr/centenaire-14-18/1.html Node => lemonde.fr/centenaire-14-18/ http://www.lemonde.fr/planete/ Node => lemonde.fr

IIPC Web Archiving Conference 2016

13

Aggregation of nodes Host

Seed URL

IIPC Web Archiving Conference 2016

14

Filtering by keyword • Filters can be applied to URLs using a list of keywords, defined using seed list • Non filtered – 252,399 nodes – Very difficult to manipulate with Gephi on an average computer – Difficult to interpret

• Filtered – All content related to subject – Graph easier to manipulate – Risk of missing a lot of relevant content

IIPC Web Archiving Conference 2016

15

Filtering by keyword Not filtered

Filtered

IIPC Web Archiving Conference 2016

16

Limit to corpus • Choice between limiting the analysis to corpus defined in BCWeb or including all URLs – Study links only between sites selected seed URLs in BCWeb – Include links and nodes outside the corpus defined in BCWeb, from outgoing links in the files collected

• All URLs – 252,399 nodes – Very difficult to manipulate with Gephi on an average computer – Difficult to interpret

• Corpus BCWeb – 419 nodes – Much easier to interpret – Limiting to selection removes need to apply filters IIPC Web Archiving Conference 2016

17

Corpus BCWeb v. all URLs All URLs

Only corpus BCWeb

IIPC Web Archiving Conference 2016

18

Cartography: Comparison of approaches Aggregation

Filters

Corpus

# Nodes

# Links

Host

Filtered

BCWeb

456

3,356

Host

Filtered

All

15,148

27,968

Host

Not Filtered

BCWeb

483

6,603

Host

Not Filtered

All

252,207

521,414

Seed URL

Filtered

BCWeb

419

2,274

Seed URL

Filtered

All

15,310

28,910

Seed URL

Not Filtered

BCWeb

462

3,469

Seed URL

Not Filtered

All

252,399

525,460

IIPC Web Archiving Conference 2016

19

Choice of methodology • Each node is a Seed URL in BCWeb, rather than a host – Avoids over-representation of sites that only partly concern WWI such as social networks and media – A link (incoming or outgoing): at least one connection between the two nodes

• Filtering by keyword is unnecessary when the analysis is based only on the corpus in BCWeb • Choices: – Stay within the corpus defined in BCWeb – Use the Seed URL in BCWeb as aggregating node – No filtering of URLs using keywords

IIPC Web Archiving Conference 2016

20

EXAMPLES OF RESULTS

IIPC Web Archiving Conference 2016

21

Nov 2014 crawl: Incoming and outgoing links (at least 30)

22

cgma.wordpress.com troupesdemarine.org vlecalvez.free.fr

chtimiste.com sourcesdelagrandeguerre.fr 19emeri.canalblog.com 74eri.canalblog.com reims1418.wordpress.com gallica.bnf.fr historial.org

Liens entrants

indre1418.canalblog.com

Liens sortants

cheminsdememoire.gouv.fr cndp.fr/crdp-reims crid1418.org combattant.14-18.pagesperso-orange.fr memoiredeshommes.sga.defense.gouv.fr verdun-meuse.fr guerre1418.fr

centenaire.org pages14-18.mesdiscussions.net 0

50

100

150

200

250

300

350

23

Nov 2014 crawl: Incoming links (at least 20)

24

QUESTIONS?

IIPC Web Archiving Conference 2016

25