Web graphs from web archives : seen from different angles Valérie Beaudouin (Télécom ParisTech) Zeynep Pehlivan (Télécom ParisTech) Peter Stirling (Bibliothèque nationale de France)
IIPC Web Archiving Conference, 14th April 2016
MAPPING THE WWI WEB
IIPC Web Archiving Conference 2016
2
Using web archives to study the First World War • Research project: “The future of digitised heritage online: the example of the Great War” aimed to examine the circulation of images online – Partnership BnF, Télécom ParisTech and BDIC – Project 2013-2016, financed by Labex “Pasts in the Present”
• Create a map of websites related to the war and study of the place of an important discussion forum • Use of data mining approaches on the BnF web archives – How is the corpus defined and analysed? What data, metadata and tools are needed? What organisational structure is needed? – Sites selected for the BnF crawl on the centenary of WWI
IIPC Web Archiving Conference 2016
3
BUILDING A COLLECTION
IIPC Web Archiving Conference 2016
4
“The Great War on the Web” • BnF collection on the centenary of WWI • Collection policy and selection in cooperation between BnF librarians and partners – Sites about WWI and/or its commemoration – Relevance, topicality, singularity, originality, navigability
• Classification by type of site – – – – –
Official : international, european, national,territorial Public : scientific, heritage, education Associations (including military) Personal (including military) Media
IIPC Web Archiving Conference 2016
5
Organisation of the crawl • Use of a shared selection tool (BCWeb) – – – –
Seed URL Crawl settings Theme Keywords
• Two/three crawls per year from 2013 to 2019 – November 2013, March 2014, August 2014, November 2014, April 2015, October 2015, February 2016
IIPC Web Archiving Conference 2016
6
7
CARTOGRAPHY: METHODOLOGY
IIPC Web Archiving Conference 2016
8
From collections to metadata • Use of metadata to create web graphs – choice of WAT format – Developed by Internet Archive – https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Me tadata+File+Specification
• WAT files contain information extracted from WARC records, stored in JSON – All links from HTML pages, with the kind of link and associated text – Title, format, meta tags… – No content, images, etc.
• Also need other information – Information from BCWeb (seed URL, classification) – Dates of crawls and seeds included in each crawl IIPC Web Archiving Conference 2016
9
Cartography: method • Corpus (input): – BnF Web Archives / “Great War on the Web”, sites selected by librarians and partners – Classification of sites by type of producer – 4 crawls between November 2013 and November 2014 – August 2014 : 482 seed URLs, 7 million URLs collected
• Processing: – Extraction of metadata (links between documents) – Definition and classification of links and nodes
• Cartography (output) IIPC Web Archiving Conference 2016
10
Framework WAT BCWeb
• • • •
Data Extraction
Database
Graph generation
Visualisation
Extraction : Python Database : Hadoop / PigLatin Graph generated in GEXF format Visualisation using Gephi IIPC Web Archiving Conference 2016
11
Methodology: definition of nodes • The graph has to represent a heterogeneous mass produced by the crawl – Difficulty of defining of a “site” within the archives – Differences in crawl depth – “Out of topic” content introduced by crawl process and outgoing links
• The strategy used for defining nodes in the graphs has an impact on their interpretation • Need to define strategies in three areas: – Aggregation of nodes – Filter content using keywords – Use only collected URLs (the corpus) or include outgoing links IIPC Web Archiving Conference 2016
12
Aggregation of nodes • Two approaches were used to aggregate URLs • Group URLS according to host • http://www.lemonde.fr/centenaire-14-18/1.html • Node => lemonde.fr
• Group URLs according to seed URL in BCWeb, or failing that by host • • • •
http://www.lemonde.fr/centenaire-14-18/1.html Node => lemonde.fr/centenaire-14-18/ http://www.lemonde.fr/planete/ Node => lemonde.fr
IIPC Web Archiving Conference 2016
13
Aggregation of nodes Host
Seed URL
IIPC Web Archiving Conference 2016
14
Filtering by keyword • Filters can be applied to URLs using a list of keywords, defined using seed list • Non filtered – 252,399 nodes – Very difficult to manipulate with Gephi on an average computer – Difficult to interpret
• Filtered – All content related to subject – Graph easier to manipulate – Risk of missing a lot of relevant content
IIPC Web Archiving Conference 2016
15
Filtering by keyword Not filtered
Filtered
IIPC Web Archiving Conference 2016
16
Limit to corpus • Choice between limiting the analysis to corpus defined in BCWeb or including all URLs – Study links only between sites selected seed URLs in BCWeb – Include links and nodes outside the corpus defined in BCWeb, from outgoing links in the files collected
• All URLs – 252,399 nodes – Very difficult to manipulate with Gephi on an average computer – Difficult to interpret
• Corpus BCWeb – 419 nodes – Much easier to interpret – Limiting to selection removes need to apply filters IIPC Web Archiving Conference 2016
17
Corpus BCWeb v. all URLs All URLs
Only corpus BCWeb
IIPC Web Archiving Conference 2016
18
Cartography: Comparison of approaches Aggregation
Filters
Corpus
# Nodes
# Links
Host
Filtered
BCWeb
456
3,356
Host
Filtered
All
15,148
27,968
Host
Not Filtered
BCWeb
483
6,603
Host
Not Filtered
All
252,207
521,414
Seed URL
Filtered
BCWeb
419
2,274
Seed URL
Filtered
All
15,310
28,910
Seed URL
Not Filtered
BCWeb
462
3,469
Seed URL
Not Filtered
All
252,399
525,460
IIPC Web Archiving Conference 2016
19
Choice of methodology • Each node is a Seed URL in BCWeb, rather than a host – Avoids over-representation of sites that only partly concern WWI such as social networks and media – A link (incoming or outgoing): at least one connection between the two nodes
• Filtering by keyword is unnecessary when the analysis is based only on the corpus in BCWeb • Choices: – Stay within the corpus defined in BCWeb – Use the Seed URL in BCWeb as aggregating node – No filtering of URLs using keywords
IIPC Web Archiving Conference 2016
20
EXAMPLES OF RESULTS
IIPC Web Archiving Conference 2016
21
Nov 2014 crawl: Incoming and outgoing links (at least 30)
22
cgma.wordpress.com troupesdemarine.org vlecalvez.free.fr
chtimiste.com sourcesdelagrandeguerre.fr 19emeri.canalblog.com 74eri.canalblog.com reims1418.wordpress.com gallica.bnf.fr historial.org
Liens entrants
indre1418.canalblog.com
Liens sortants
cheminsdememoire.gouv.fr cndp.fr/crdp-reims crid1418.org combattant.14-18.pagesperso-orange.fr memoiredeshommes.sga.defense.gouv.fr verdun-meuse.fr guerre1418.fr
centenaire.org pages14-18.mesdiscussions.net 0
50
100
150
200
250
300
350
23
Nov 2014 crawl: Incoming links (at least 20)
24
QUESTIONS?
IIPC Web Archiving Conference 2016
25