CERN (Open) Data Services Tibor Šimko @tiborsimko

GL17 · 30 November 2015 · Amsterdam, The Netherlands

@tiborsimko

1 / 50

Invenio

@tiborsimko

2 / 50

What is Invenio? digital library and document repository software – mature platform: first public release in 2002 – rich data: articles, books, notes, photos, videos, data, code

originated in high-energy physics – institutional repository: CERN Document Server – integrated library system: CERN Library – disciplinary repository: INSPIRE

nowadays co-developed by an international collaboration

participating in and collaborating with several EU projects

@tiborsimko

3 / 50

Invenio applications

institutional repositories & ILS

subject-based repositories

long-tail of science

world-wide installations

@tiborsimko

4 / 50

“Small data”

@tiborsimko

5 / 50

CERN Document Server

@tiborsimko

6 / 50

cds.cern.ch

@tiborsimko

7 / 50

“Related data files”

@tiborsimko

8 / 50

INSPIRE

@tiborsimko

9 / 50

inspirehep.net

@tiborsimko

10 / 50

“Data behind plots”

@tiborsimko

11 / 50

HEPDATA

@tiborsimko

12 / 50

“Interactive data behind plots”

@tiborsimko

13 / 50

Zenodo

@tiborsimko

14 / 50

zenodo.org

@tiborsimko

15 / 50

“Code you can cite”

https://guides.github.com/activities/citable-code

@tiborsimko

16 / 50

“Data ↔ Code ↔ Paper”

data (DATAVERSE) ↔ code (ZENODO) ↔ paper (arXiv)

arXiv:1401.0080 · hep-ex/0011057 @tiborsimko

17 / 50

“Big data”

@tiborsimko

18 / 50

CERN LHC Experiments

@tiborsimko

19 / 50

Large Scale Solutions

Primary site: 100k cores (10k nodes), 100k disks (50 PB), 21k NIC Grid: 13 Tier-1 sites, 155 Tier-2 sites, 10 Gbps optical fibre @tiborsimko

20 / 50

LHC Data Pyramid

∼GB/analysis

∼TB/analysis

↑ analysis

∼PB/year



∼GB/sec

@tiborsimko

21 / 50

CERN Analysis Preservation

@tiborsimko

22 / 50

Preserve an analysis?

@tiborsimko

23 / 50

System architecture Analysis TWiki

CADI

SVN

CDS

GitHub

analysis-preservation.cern.ch

SharePoint

file storage abstraction layer

AFS

@tiborsimko

Box

Ceph

CASTOR

Drive

INSPIRE ...

EOS

S3

24 / 50

Pilot project

@tiborsimko

25 / 50

Example: LHCb analysis

@tiborsimko

26 / 50

Example: CMS statistics

@tiborsimko

27 / 50

Knowledge capture

@tiborsimko

28 / 50

Knowledge representation prototype: extended MARC21 – “technical” metadata: beyond bytes e.g. 256 computer file characteristics $a characteristics $e events $t text $b bytes $f files ... – “knowledge” metadata: semantics e.g. 505 formatted contents note CSV column information $t title $g miscellaneous

internal format: JSON MARC21 JSON

JSON Schema

EAD @tiborsimko

29 / 50

json-schema.org {

}

"title": "Example Schema", "type": "object", "properties": { "firstName": { "type": "string" }, "lastName": { "type": "string" }, "age": { "description": "Age in years", "type": "integer", "minimum": 0 } }, "required": ["firstName", "lastName"]

@tiborsimko

30 / 50

Knowledge modelling

@tiborsimko

31 / 50

Knowledge modelling

@tiborsimko

32 / 50

DASPOS

@tiborsimko

33 / 50

JSON Schema based description "primary_dataset": [ { "@type": "dcat:Dataset", "title": "/Mu/Run2010B-Apr21ReReco-v1/AOD", "description": "Mu primary dataset in AOD format from "licence": "CC0 waiver", "persistent_identifiers": [ { "identifier": "10.7483/OPENDATA.CMS.B8MR.C4A2", "scheme": "DOI" } ], "issued": "2011-04-26 11:32:43", "modified": "2011-05-02 21:22:30", @tiborsimko

34 / 50

Open Archival Info System

SIP = Submission Information Package · AIP = Archival Information Package · DIP = Dissemination Information Package

@tiborsimko

35 / 50

CERN Open Data

@tiborsimko

36 / 50

Open data policies Data policies restricted → embargo period → open “[...] Data with high abstraction, such as AOD, will be conditionally made publicly available after an embargo period of 5 years after publication for 10% of the data and 10 years for 100% of the data [...]” —ALICE Data Policy

Challenges audience – – – –

general public high-school students citizen scientists data miners

computing – exploring in the browser – specialised VMs @tiborsimko

37 / 50

opendata.cern.ch

@tiborsimko

38 / 50

Visualise detector events

@tiborsimko

39 / 50

Basic histogramming

@tiborsimko

40 / 50

CMS primary datasets

@tiborsimko

41 / 50

CMS primary datasets

@tiborsimko

42 / 50

IPython notebooks

@tiborsimko

43 / 50

CernVM virtual machines

@tiborsimko

44 / 50

OS ↔ Data ↔ Software validation

selection

usage

reduced

software validation

intermediate

selection

usage

VM

software primary

validation

usage

selection

software

@tiborsimko

45 / 50

OS/Software influence in medecine

8.8±6.6% (volume) and 2.8±1.3% (cortical thickness) @tiborsimko

46 / 50

Open data? Who cares?

82,000 distinct users visited the site 21,000 distinct users viewed data records 16,000 distinct users used event display 3,000 distinct users used histogramming @tiborsimko

47 / 50

Conclusions

@tiborsimko

48 / 50

CERN (open) data services

@tiborsimko

49 / 50

CERN (open) data services Invenio http://invenio-software.org http://github.com/inveniosoftware @inveniosoftware [email protected]

CERN Analysis Preservation http://analysis-preservation.cern.ch http://github.com/cernanalysispreservation [email protected]

CERN Open Data http://opendata.cern.ch http://github.com/cernopendata [email protected] CERN IT J. Delgado Fernandez, J. Kunˇcar, T. Smith, T. Šimko · CERN Library S. Dallmeier-Tiessen, P. Fokianos, P. Herterich · ALICE M. Gheata, C. Grigoras · ATLAS K. Cranmer, L. Heinrich, D. Rousseau, F. Socher · CMS A. Calderon, A. Huffman, K. Lassila-Perini, T. McCauley, A. Rao, A. Rodriguez Marrero · LHCb S. Amerio, B. Couturier, A. Trisovic · CERN CernVM J. Blomer · CERN EOS L. Mascetti · DASPOS M. Hildreth · DPHEP F. Berghaus, J. Shiers @tiborsimko

50 / 50