CERN (Open) Data Services Tibor Šimko @tiborsimko
GL17 · 30 November 2015 · Amsterdam, The Netherlands
@tiborsimko
1 / 50
Invenio
@tiborsimko
2 / 50
What is Invenio? digital library and document repository software – mature platform: first public release in 2002 – rich data: articles, books, notes, photos, videos, data, code
originated in high-energy physics – institutional repository: CERN Document Server – integrated library system: CERN Library – disciplinary repository: INSPIRE
nowadays co-developed by an international collaboration
participating in and collaborating with several EU projects
@tiborsimko
3 / 50
Invenio applications
institutional repositories & ILS
subject-based repositories
long-tail of science
world-wide installations
@tiborsimko
4 / 50
“Small data”
@tiborsimko
5 / 50
CERN Document Server
@tiborsimko
6 / 50
cds.cern.ch
@tiborsimko
7 / 50
“Related data files”
@tiborsimko
8 / 50
INSPIRE
@tiborsimko
9 / 50
inspirehep.net
@tiborsimko
10 / 50
“Data behind plots”
@tiborsimko
11 / 50
HEPDATA
@tiborsimko
12 / 50
“Interactive data behind plots”
@tiborsimko
13 / 50
Zenodo
@tiborsimko
14 / 50
zenodo.org
@tiborsimko
15 / 50
“Code you can cite”
https://guides.github.com/activities/citable-code
@tiborsimko
16 / 50
“Data ↔ Code ↔ Paper”
data (DATAVERSE) ↔ code (ZENODO) ↔ paper (arXiv)
arXiv:1401.0080 · hep-ex/0011057 @tiborsimko
17 / 50
“Big data”
@tiborsimko
18 / 50
CERN LHC Experiments
@tiborsimko
19 / 50
Large Scale Solutions
Primary site: 100k cores (10k nodes), 100k disks (50 PB), 21k NIC Grid: 13 Tier-1 sites, 155 Tier-2 sites, 10 Gbps optical fibre @tiborsimko
20 / 50
LHC Data Pyramid
∼GB/analysis
∼TB/analysis
↑ analysis
∼PB/year
↑
∼GB/sec
@tiborsimko
21 / 50
CERN Analysis Preservation
@tiborsimko
22 / 50
Preserve an analysis?
@tiborsimko
23 / 50
System architecture Analysis TWiki
CADI
SVN
CDS
GitHub
analysis-preservation.cern.ch
SharePoint
file storage abstraction layer
AFS
@tiborsimko
Box
Ceph
CASTOR
Drive
INSPIRE ...
EOS
S3
24 / 50
Pilot project
@tiborsimko
25 / 50
Example: LHCb analysis
@tiborsimko
26 / 50
Example: CMS statistics
@tiborsimko
27 / 50
Knowledge capture
@tiborsimko
28 / 50
Knowledge representation prototype: extended MARC21 – “technical” metadata: beyond bytes e.g. 256 computer file characteristics $a characteristics $e events $t text $b bytes $f files ... – “knowledge” metadata: semantics e.g. 505 formatted contents note CSV column information $t title $g miscellaneous
internal format: JSON MARC21 JSON
JSON Schema
EAD @tiborsimko
29 / 50
json-schema.org {
}
"title": "Example Schema", "type": "object", "properties": { "firstName": { "type": "string" }, "lastName": { "type": "string" }, "age": { "description": "Age in years", "type": "integer", "minimum": 0 } }, "required": ["firstName", "lastName"]
@tiborsimko
30 / 50
Knowledge modelling
@tiborsimko
31 / 50
Knowledge modelling
@tiborsimko
32 / 50
DASPOS
@tiborsimko
33 / 50
JSON Schema based description "primary_dataset": [ { "@type": "dcat:Dataset", "title": "/Mu/Run2010B-Apr21ReReco-v1/AOD", "description": "Mu primary dataset in AOD format from "licence": "CC0 waiver", "persistent_identifiers": [ { "identifier": "10.7483/OPENDATA.CMS.B8MR.C4A2", "scheme": "DOI" } ], "issued": "2011-04-26 11:32:43", "modified": "2011-05-02 21:22:30", @tiborsimko
34 / 50
Open Archival Info System
SIP = Submission Information Package · AIP = Archival Information Package · DIP = Dissemination Information Package
@tiborsimko
35 / 50
CERN Open Data
@tiborsimko
36 / 50
Open data policies Data policies restricted → embargo period → open “[...] Data with high abstraction, such as AOD, will be conditionally made publicly available after an embargo period of 5 years after publication for 10% of the data and 10 years for 100% of the data [...]” —ALICE Data Policy
Challenges audience – – – –
general public high-school students citizen scientists data miners
computing – exploring in the browser – specialised VMs @tiborsimko
37 / 50
opendata.cern.ch
@tiborsimko
38 / 50
Visualise detector events
@tiborsimko
39 / 50
Basic histogramming
@tiborsimko
40 / 50
CMS primary datasets
@tiborsimko
41 / 50
CMS primary datasets
@tiborsimko
42 / 50
IPython notebooks
@tiborsimko
43 / 50
CernVM virtual machines
@tiborsimko
44 / 50
OS ↔ Data ↔ Software validation
selection
usage
reduced
software validation
intermediate
selection
usage
VM
software primary
validation
usage
selection
software
@tiborsimko
45 / 50
OS/Software influence in medecine
8.8±6.6% (volume) and 2.8±1.3% (cortical thickness) @tiborsimko
46 / 50
Open data? Who cares?
82,000 distinct users visited the site 21,000 distinct users viewed data records 16,000 distinct users used event display 3,000 distinct users used histogramming @tiborsimko
47 / 50
Conclusions
@tiborsimko
48 / 50
CERN (open) data services
@tiborsimko
49 / 50
CERN (open) data services Invenio http://invenio-software.org http://github.com/inveniosoftware @inveniosoftware
[email protected]
CERN Analysis Preservation http://analysis-preservation.cern.ch http://github.com/cernanalysispreservation
[email protected]
CERN Open Data http://opendata.cern.ch http://github.com/cernopendata
[email protected] CERN IT J. Delgado Fernandez, J. Kunˇcar, T. Smith, T. Šimko · CERN Library S. Dallmeier-Tiessen, P. Fokianos, P. Herterich · ALICE M. Gheata, C. Grigoras · ATLAS K. Cranmer, L. Heinrich, D. Rousseau, F. Socher · CMS A. Calderon, A. Huffman, K. Lassila-Perini, T. McCauley, A. Rao, A. Rodriguez Marrero · LHCb S. Amerio, B. Couturier, A. Trisovic · CERN CernVM J. Blomer · CERN EOS L. Mascetti · DASPOS M. Hildreth · DPHEP F. Berghaus, J. Shiers @tiborsimko
50 / 50