DEUTSCH-FRANZÖSISCHE SOMMERUNIVERSITÄT FÜR NACHWUCHSWISSENSCHAFTLER 2011
UNIVERSITÉ DʼÉTÉ FRANCO-ALLEMANDE POUR JEUNES CHERCHEURS 2011
CLOUD COMPUTING : HERAUSFORDERUNGEN UND MÖGLICHKEITEN
CLOUD COMPUTING : DÉFIS ET OPPORTUNITÉS
HPC in the Cloud - a Match Made in Heaven? 17.7. – 22.7. 2011 Dr. Helmut Heller, Dr. Matteo Lanati Leibniz Supercomputing Centre Munich, Germany
Donnerstag, 21. Juli 2011
Thank you for the invitation and the great week here! 2 Donnerstag, 21. Juli 2011
3 Donnerstag, 21. Juli 2011
4 Donnerstag, 21. Juli 2011
Overview
• Leibniz Supercomputing Centre • What is a Supercomputer? • Grid Computing, Globus, IGE • Why not HPC in the cloud? • Cloud investigations at LRZ • Summary
5 Donnerstag, 21. Juli 2011
The Leibniz Supercomputing Centre is… •
•
•
Computer Centre (~170 employees) for all Munich Universities with
-
more than 80,000 students and more than 26,000 employees approximately 8,500 scientists
Regional Computer Centre for all Bavarian Universities
-
Capacity computing Special equipment Backup and Archiving Centre (more than 5,000 TeraByte) Distributed File Systems Competence centre (Grid, Networks, HPC, IT Management)
National Supercomputing Centre
-
Member of Gauss Centre for Supercomputing Integrated in European HPC and Grid projects 6
Donnerstag, 21. Juli 2011
Research at LRZ • • • • •
IT Management
-
Service Management, ITIL (Process Refinements, Mapping to tools, Benchmarking), Security Management in Federations,Virtualization
Piloting new Network Technologies
-
100 Gbit Ethernet WANs
Long-term Archiving Grids
-
Management of Virtual Organizations, Grid Monitoring, Grid Middleware (Globus), Security and Intrusion Detection on the Grid Projects: IGE, D-Grid, DEISA, PRACE, EGI, MAPPER, eIRGSP3, DGSI, …
HPC
-
Energy efficiency Application scaling 7
Donnerstag, 21. Juli 2011
Where Are We Located? Garching, Munich, Alps
View from the roof of LRZ
Located between Munich airport (MUC) and city centre
8 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Leibniz Supercomputing Centre
Donnerstag, 21. Juli 2011
The LRZ Buildings Before and After Extension
10 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
LRZ Building Extension Facts • Ground-breaking Ceremony: October 2009 • Completion of building extension: Q2 2011 • Dedicated (maximum): 13 MW 20 kV power line • Extension of current (3 MW) power and cooling
•
capacities to 10 MW (2011) Computer room floor space extension:
Compute cube: 1,764 m2 -> Compute cuboid: 3,160 m2
• 1.6 MW diesel generator and 300 kW static USV •
for highly critical LRZ services, not for HPC 9 x 1.2 MW fly wheel UPS (20 secs autonomy, short interruptions and spikes)
11 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
… and Amazon?
• Former supermarket warehouse near Dublin bought in 2011 for new data center
LRZ
Amazon
3,160 m2 22,300 m2
12 Donnerstag, 21. Juli 2011
Amazon Cluster Compute Instance
• 23 GB of memory • 33.5 EC2 Compute Units (2 x Intel Xeon
X5570, quad-core “Nehalem” architecture)
• 1,690 GB of instance storage • 64-bit platform • I/O Performance: Very High (10 Gigabit Ethernet)
• Name: cc1.4xlarge • Maximum of 8 instances => 64 cores 13 Donnerstag, 21. Juli 2011
Evolution of Peak Performance and Memory
Linux-Cluster
x 15,000
Cluster
EC2 CU
14 Donnerstag, 21. Juli 2011
What is a PetaFlop? K
Kilo M
Mega G
Giga T
Tera P
Peta E
Exa
1.000 1.000.000 1.000.000.000 1.000.000.000.000 1.000.000.000.000.000 1.000.000.000.000.000.000
Tausend
Million
Milliarde
Billion
Billiarde
Trillion
103 106 109 1012 1015 1018
Thousand Million Billion Trillion Quadrillion Quintillion
Flop/s: Floating Point Operations per Second
1 TB =1 Million books each with 500 pages => 20 km height! 1 TFlop/s (a human needs 1 second per Flop) => hammer 25
rows of 1 mm nails around the equator in 1 second 15
Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
LRZ as Supercomputing Centre -
-
National supercomputing system SGI Altix 4700 - 9,728 compute cores (Intel Itanium2 Montecito) - 62.3 TFlop/s peak performance - 56.5 TFlop/s Linpack benchmark - 39 TByte Total Memory - 660 TByte attached Disk space - Weighs 103 metric tons - Consums ~1 MVA - On 24 m x 12 m footprint
- To be replaced by IBM iDataplex (Sandy Bridge based): SuperMUC
SGI Altix 4700 Linux IA64 Large Shared Memory SGI Altix (IA64) SGI UV (X86)
Linux X86 (Intel & AMD) heterogeneous Cluster
256 core SGI Ultraviolet, 512 core ICE for scientists in the state of Bavaria Linux Hosting and Housing 128 dual core cpu SGI Altix 4700 for scientists in the state of Bavaria Linux cluster with more than 4000 cores (> 6000 cores incl. attended housing) mainly for the Munich universities and for scientists in the state of Bavaria More than 500 virtual and 100 physical servers for general IT services
16 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Where does HPC start?
• Capability computing => HPC • Capacity computing => HTC • Top500 list of fastest machines
17 Donnerstag, 21. Juli 2011
Top 10 HPC Machines
18 Donnerstag, 21. Juli 2011
HPC Performance
19 Donnerstag, 21. Juli 2011
China Catching Up!
20 Donnerstag, 21. Juli 2011
Where does HPC start?
• Capability computing => HPC • Capacity computing => HTC • Top500 list of fastest machines
LRZ: SuperMIG: 166, HLRB2: 198
• Lifetime of an HPC machine: 6 years only!
• Amazon at very low end: 451 • Public cloud not really HPC! 21 Donnerstag, 21. Juli 2011
Cloud, Grid, HPC
• Cloud is the culmination of the journey towards utility computing that started in HPC with
-
remote computing (telnet!) Grid computing (Globus, Unicore, gLite) virtualization, true on-demand
• Cloud should complement Grid
22 Donnerstag, 21. Juli 2011
INITIATIVE FOR GLOBUS IN EUROPE
The European Grid Ecosystem Software providers
Computing Centers
gLite ARC dCache UNICORE
23
Donnerstag, 21. Juli 2011
INITIATIVE FOR GLOBUS IN EUROPE
Tier-0: PRACE-1IP
• PRACE: Top level HPC, Petaflop computing • Six European supercomputers of highest performance
Computing Centers
• Globus tools used: – Gsissh for interactive access – GridFTP for high speed data transfer
24
Donnerstag, 21. Juli 2011
INITIATIVE FOR GLOBUS IN EUROPE
Tier-1: PRACE-2IP
• Globus tools used in PRACE – Gsissh as primary interactive access method to PRACE through Gsissh door nodes – GridFTP as primary high performance data transfer tool through door nodes to GPFS – MyProxy @ LRZ – GRAM (GT5) available on request
25
Donnerstag, 21. Juli 2011
INITIATIVE FOR GLOBUS IN EUROPE
Tier-2: EGI
• IGE is a middleware provider and will bring Globus components into UMD – Service Level Agreement (SLA) signed with EGI
• Service provider for European Grids • Third level support for Globus in Europe • Globus can act as bridge between tiers • EGI is developing a Cloud strategy
UNICORE gLite ARC dCache
26
Donnerstag, 21. Juli 2011
INITIATIVE FOR GLOBUS IN EUROPE
IGE Meetings
• European Globus Community Forum (EGCF) – sign up for it here: http://www.ige-project.eu/hub/egcf
• GlobusEurope conference – 19th of September 2011 – Lyon, co-located with EGI Technical Forum – registration form: http://www.ige-project.eu/
27
Donnerstag, 21. Juli 2011
INITIATIVE FOR GLOBUS IN EUROPE
Cloud and Grid
• Cloud computing – commercial (Amazon, Google, Microsoft, SUN, ...) – virtualization (your job is your VM) – environment provisioning – emphasis on resource management
• Cloud and Grid – a cloud can be part of a Grid (MoU with StratusLab!) – cloud does not emphasize the “sharing”: proprietary technologies (vendor lock-in), commercial
• Grid – federation of many distributed resources – emphasis on job management 28
Donnerstag, 21. Juli 2011
Matthias Brehm, LRZ High Performance Group Donnerstag, 21. Juli 2011
Compute Resources at LRZ •
•
•
HLRB2 (2012: SuperMUC)
-
SGI Altix 4700 9728 cores 62.3 TFlop/s peak
Linux Cluster
-
3054 cores 22 TFlop/s peak
Remote Visualization:
-
RVS1, GVS1, GVS2 http://www.lrz.de/services/compute/visualisation/visualisation_2/
30 Donnerstag, 21. Juli 2011
… and Amazon?
31 Donnerstag, 21. Juli 2011
Why Do We Need Faster and Bigger Computers ? •
Simulations in Science, Engineering and Technology
-
-
-
•
Astrophysics: Formation of galaxies, nuclear burning, star atmosphere
/-physics
Biochemistry : Molecular dynamics of complete cells Chemistry: Structure and properties of molecules, catalysts Geophysics: Understanding earthquakes inner structure of the earth, seismic waves Engineering: Coupled fluid dynamics and structural simulations Aerodynamics of airplanes, cars, and trains
Crash simulations Fluid Dynamics Understanding and modeling turbulence Combustion Medicine: Imaging, blood flow Material Sciences: Melting processes and crystal growth Meteorology/Climatology: Weather forecast, human impact on climate Physics: Superconductivity, high energy physics Computer Science/Informatics: Software engineering methods for parallel applications Mathematics: Adaptive methods of differential equations
Other Disciplines also need bigger computers
-
Databases Data-mining WebServices
32 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
HLRB II Usage by Research Areas
33 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Case: Pouring Water Rüde et al., University of Erlangen
34 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Case: Falling Drop Rüde et al., University of Erlangen
35 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
(Tsunami) Waves Krafczyk, TU Braunschweig
36 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
(Tsunami) Waves and Buildings Krafczyk, TU Braunschweig
37 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
The Aquarius Project: Cold Dark Matter under a Numerical Microscope Max-Planck-Institute for Astrophysics, Garching, S. White,V. Springel •
First ever one-billion particle simulation of a Milky Way-sized dark matter halo (i) for exploring the formation of our Galaxy; (ii) for searching for signals from dark matter annihilation; and (iii) for designing experiments for direct detection of dark matter.
•
This is an extremely ambitious project feasible only with the unique capabilities of HLRB-II. The large number of fast processors, the large amount of memory, and the extremely fast communication can all be harnessed effectively by our code in a worldleading effort to address science of major professional and public impact.
Volker Springel MPI for Astrophysics
38 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Cholesterol in a Biomembrane •
• •
•
39 Donnerstag, 21. Juli 2011
Long range Coulomb forces make long range communication necessary Many particles (millions) Small time step of femtoseconds dictated by fast motions in the system Biologically relevant processes take microseconds to hours
Why Special HPC Computers?
Why HPC?
40 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Why “Super” Computers?
• Molecular dynamics: -
Atoms n=106
Timesteps m = 106 (femto to micro seconds) O(n2 * m) * 20 variables * 100 operations/time step = 20 * 1020 operations => 200 000 hours on 4096 CPUs: 22 years! not possible!
• The long-range coupling (Coulomb or
gravity) necessitates a strong coupling on the computational side
• This coupling means a low latency high
bandwidth interconnect in the computer 41
Donnerstag, 21. Juli 2011
HPC: Parallel and High Bandwidth
• 10,000s of CPU cores needed to crack the computational problem of 20 * 1020 operations => parallel
• The computational load has to be divided between the CPU cores
-
either distributing particles onto CPUs or distributing force computations (and each CPU has a complete data set)
• Physical long-range forces necessitate data exchange in every time step between all nodes => high bandwidth and low latency 42 Donnerstag, 21. Juli 2011
SuperMUC System Overview Peak: 3 PF expected: 8000 cores)
1 Fat node island (>8000 cores) 43
Donnerstag, 21. Juli 2011
nodes
Slide from M. Brehm, LRZ
New LRZ Computer SuperMUC 2
are e c n a n inte a m . l c 6 m (in
a) 21m
44 Donnerstag, 21. Juli 2011
Financing Scheme for Investment and Operating Costs (gross, incl.VAT) 50% Free State of Bavaria, 50% Federal German Gov. 2010-2014 Phase 1
2014-2016 Phase 2
High End System Investment Costs (Hardware and Software)
53 Mio €
~ 19 Mio €
Operating Costs
32 Mio €
~ 29 Mio €
Fixed SUM
85 Mio €
~48 Mio €
+ Extension buildings (construction and infrastructure)
49 Mio €
Total sum
134 Mio € Funding for Phase 2 is announced but not legally secured
45 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
Growth of power consumption 35.000
30.000
power consumption in MWh
25.000
20.000
15.000
10.000
5.000
46 Donnerstag, 21. Juli 2011
12 20
11 20
10 20
09 20
08 20
07 20
06 20
05 20
20
04
03 20
02 20
01 20
00 20
99 19
98 19
97 19
19
96
0
Facing the Petascale Challenge: Power & Cooling 8.000.000 !
LRZ Power & Cooling Budget
7.000.000 !
Annual Power Costs
6.000.000 !
Energy Cost 2010: 0.158 € / kWh
5.000.000 !
Today: PUE ~1.4 Goal: PUE ~1.15
4.000.000 !
3.000.000 !
2.000.000 !
1.000.000 !
0! 1998
1999
2000
2001
2002
2003
2004
47 Donnerstag, 21. Juli 2011
2005
2006
2007
2008
2009
2010
2011
2012
Leibniz Supercomputing Centre: Facing the Peta-scale Challenge • •
• •
•
•
LRZ as a part of GCS shall become a European centre for supercomputing Current peta-scale systems have a power requirement of several MW, power is becoming the biggest issue for supercomputing centres ~100 m2/MW of floor space for power distribution equipment System power needed to operate supercomputing systems still increases Desire to drive PUE to as close to 1 as possible
Use new cooling technologies (direct liquid cooling, highly optimized indirect cooling technologies)
48 Donnerstag, 21. Juli 2011
Pillars of Energy Efficient Computing • Latest semi-conductor technology • Use of energy-saving processors • Choice of most appropriate hardware for the scientific problem at hand
• Reduction of power drain in the power supply chain • Improved & energy saving cooling technologies (e.g., direct water cooling) • Re-use of waste heat
• Monitoring of computers and Infrastructure • Energy aware scheduling • Dynamic frequency and voltage scaling • Monitoring and optimization of scientific applications • Resource sharing
Energy efficient hardware
Energy efficient Energy aware infrastructure software environment 49
Donnerstag, 21. Juli 2011
SuperMUC Hardware Data Node Islands
SuperMUC System Parameters (Q2, 2012)
Number of Islands (thin+fat) Number of nodes Number of cores Processor Types (Thin + Fat) Peak Performance [PF] (proj. enAre system) Linpack [PF] Total size of memory (TByte) Expected electrical power consumpAon of total system (kW) Maximum tolerable supply termperature of cooling Loop 2 Outlet temperature of compute node coolant (range) (°C) Topology (pruned tree, pruning factor) IB Technology Parallel File System (SCRATCH, WORK): GPFS NAS User Storage (HOME): NetApp Size of parallel storage (Pbyte) Size of NAS user storage (TByte) Aggregate theoreAcal bandwidth to/from GPFS (GByte/s) Aggregate theoreAcal bandwidth to/from NAS storage (GByte/s)
18+1 >9,000 >140,000 SB-‐EP+WM-‐EX 3.02 2.21 >300 2782 45 33 to 50 Island 1:4 FDR10 GPFS NetApp 10.7 2+2 200 10
Number
18
Nodes per Island >500 Cores per Node 16 Processors Dual socket Intel SB-EP Memory
2 GB/core, 32 GB/node
MigraTonssystem (will be Fat Node Island, July 2011)
1 205 8200 WM-‐EX 79 32.03 160
Number of Islands Number of nodes Number of cores (40 cores/node) Type of processor chips Peak Performance (TF) Total size of memory (TByte) Shared Memory per node [GB] 50 Donnerstag, 21. Juli 2011
Slide from M. Brehm, LRZ
„Deep Hierarchy“ Architecture Indicating relevant parameters MPI/HW latencies, „moderately saturated“ bandwidths per core values give impression of general magnitude
Note: SB will have 8 cores/socket
IB core switch
~0.5 GB/s ~3 GB/s ?
IB leaf switch
~2000 ns
IB leaf switch
90 ns ~0.12 GB/s ~3000 ns
chipset
socket
memory
socket
memory
node
node
ccNUMA node I/O to disk:
6.4 GB/s 50 ns
intra-socket:
~0.0012 GB/s
~80 GB/s
~10000 ns
< 5 ns © 2011 Donnerstag, 21. Juli 2011
Leibniz Supercomputing Centre, R. Bader
51
… and Amazon?
• Amazon Cluster Compute Instances -
I/O Performance: Very High (10 Gigabit Ethernet)
-
no clear numbers available
-
typical HW-latency: 10,000 ns to 20,000 ns (MPI via TCP slow!)
Bandwidth: 1 GB/s (10 Gb/s) BUT: what is the switch topology? Fat tree? Or blocking?
high latency kills parallel, tightly coupled simulations
• SuperMUC -
Bandwidth: 80 GB/s, 6.4 GB/s, 0.5 GB/s, 0.12 GB/s HW-latency: 5 ns, …, 3,000 ns (specialized MPI implementation)
• => no true HPC at Amazon (or elsewhere)! 52 Donnerstag, 21. Juli 2011
for the inter node measurements, the two processes are each placed on a differe node, forcing the benchmark to use the interconnect; the multi inter node test runs multiple sets of PingPong benchmarks, with each pair processes placed on different nodes using an 8x2 mapping.
DEISA Benchmarks: Latency
Figure 12 – Timings for the PingPong benchmark on EC2 and XT4.
When running the PingPong benchmark with both processes placed on the same node, t performance exhibited by both EC2 and XT4 is comparable. For message larger than 12 53 is in fact not surprising and due to the differen Bytes, EC2 shows better performance. This Donnerstag, Juli 2011 in21.the architecture and size of the nodes between the two systems: an XT4 node consists of
on most platforms we have tested them on: SU3_AHiggs, GENE2 and GADGET. The maximum size of high-performance cluster jobs you can run on the EC2 infrastructure is 8 nodes (i.e. 64 cores). Given the poor scaling behaviour of the IMB benchmarks, we decided to use small datasets to be able to run jobs on a single node.
DEISA Benchmarks: Applications
All three applications behaved in a very similar way: the single node performance of the EC2 is very good and is comparable to both the performance you can achieve on either JUROPA or HECToR XT4. However, increasing the job size to two nodes, the performance immediately stalls. Table 13 summarise the performance of GENE on the high-performance cluster compute images.
GENE: time per time step Cyclone dataset SU3_AHIGGS: total time 323 lattice, 10.000 iterations GADGET: 2nd + 3rd iteration Small dataset
4 cores
8 cores
16 cores
32 cores
1.054s
0.629s
0.625s
0.791s
-
661.90s
612.60s
-
97.54s
49.12s
47.65s
-
Table 13 – Performance of GENE, SU3_AHIGGS and GADGET on Amazon EC2's high-performance cluster compute images
Based on the results from the IMB benchmarks, this performance exhibited by the three scientific codes is no surprise. While EC2 can hold its own in terms of straight computing power inside a single node, the lack of a low-latency, high-bandwidth network will result in very poor scalability for tightly coupled HPC applications, which rely on fast communication more than on fast computing. 54 Donnerstag, 21. Juli 2011
Problem: Access Times! Getting data from:
Getting some food from:
CPU register
1ns
fridge
L2 cache
10ns
microwave
100s ~ 2min
Memory
80 ns
pizza service
800s ~ 15min
Network(IB)
200 ns
city mall
GPU(PCIe)
50.000 ns
mum sends cake 500.000 s~1 week
Harddisk
500.000 ns
grown in own garden
10s
2000s ~ 0.5h
5Ms ~ 2months Slide from F. Jamitzky, LRZ
55 Donnerstag, 21. Juli 2011
Cloud Ingredients
• NIST Cloud definition: -
On-demand self-service
-
Performance loss around 10% for I/O (disk, network)
Broad network access Resource pooling (multi-tenant model) Rapid elasticity Measured Service (pay-as-you-go) All of the above in combination
• Virtualization (to realize IaaS) 10 % not acceptable for top end HPC
56 Donnerstag, 21. Juli 2011
Why is Cloud Uptake Slow? (1) • Virtualization overhead (10%) -
need to understand the hardware (CPU architecture)
-
data protection (Patriot Act!)
need to understand network (RDMA, bandwidth) need to know hardware layout (topology) no oversubscription in HPC
• Security of Clouds doubted compliance (German data protection agency does not allow personal data in the Cloud) loosing control over data in the cloud access to low ports (root!) user’s VM images can be old and open to attack (hacker!)
57 Donnerstag, 21. Juli 2011
Why is Cloud Uptake Slow? (2)
• Commercial Cloud deficiencies -
limited capabilities and power of the typical underlying hardware
-
public Cloud still far more expensive than HPC center (Rutrell Yasin)
HW is not homogenous lack of high-speed, low latency interconnects (MPI!) sharing leads to (unpredictable) load imbalances Failures when creating large instances
• Commercial Cloud too expensive
getting big data in/out/stored is expensive private Cloud (NERSC, NASA AMES) adoption is higher
• Software licensing: no licensing model for on-demand access (Matlab, Ansys, ...) 58 Donnerstag, 21. Juli 2011
Why is Cloud Uptake Slow? (3)
• Operational concerns -
part of workflow has to be lifted into the Cloud
-
SLAs needed, how to get agreements
• Managerial concerns (industry!) in case of problems: operators of Amazon are not directly accessible
• Verifiability/reproducibility of results • No cloud bursting possible for real HPC applications
-
they already oversubscribe the HPC computers, long queue waiting times! 59
Donnerstag, 21. Juli 2011
Why is Cloud Uptake Slow? (4)
• Depending on type of application -
embarrassingly parallel type works well in Cloud strongly coupled (weather forecast, molecular dynamics, astrophysics, …) is not well suited
• Long runs subject to node instability -
in LBNL tests 1 out of 10 runs on EC2 failed
60 Donnerstag, 21. Juli 2011
HPC Cloud Efforts •
EGI (low end of performance pyramid) is developing a cloud strategy!
•
CSC, SARA, HLRS started a public cloud. SARA provides free access for industry, but only some demo projects going.
• • •
USA: Cloud first strategy (political level)
•
NERSC Magellan Cloud experiments Countries, like China, where there is not yet an established HPC infrastructure are more engaged in Cloud+HPC as they try to jump the development
-
Public cloud for free in China, yet very little interest from Chinese companies.
Most cloud uptake in SME, government (highest level of evaluation, lowest acceptance), and universities, not in big companies. SME up to now have not even taken to HPC at all.
61 Donnerstag, 21. Juli 2011
Cloud Investigation in DEISA
• CSC, Finnland, built their own, script
based SW to manage a Cloud on their Linux cluster.
-
No IB due to missing drivers for virtualization
-
No IB at the moment, only 1 Gb ethernet
• HLRS, Germany used OpenNebula • SARA, The Netherlands, used
OpenNebula and did network speed tests:
-
20 Gb IB but only with IPoIB (Virtio driver) 10 Gb ethernet 62
Donnerstag, 21. Juli 2011
DEISA Cloud Results
• Network speed tests (iperf ) were disappointing
-
IPoIB reached 2.4 Gb/s 10 GbE with Virtio driver: 2.4 Gb/s 10 GbE pass through: 8.5Gb/s
-
but has severe security implications and can’t be used
• Final verdict:
Cloud currently (April 2011) not usable in HPC
63 Donnerstag, 21. Juli 2011
Cloud Investigations at LRZ • • •
No public true HPC cloud available Cloud types:
-
public (Amazon, Google, Azure, …) private (the cloud YOU build!) hybrid (bring public/other resources in if needed: bursting)
What do the users want?
-
HPC (but at the low end) IaaS to run “their” flavor of OS Controlled environment: Supernova Factory codes have sensitivity to OS/compiler version at LRZ requests from
-
LCG community => Scientific Linux biophysicists => Debian mathematicians => Hadoop 64
Donnerstag, 21. Juli 2011
Cloud Investigations at LRZ
• Build our own cloud for our users • Use familiar Grid interface to give IaaS => Nimbus!
• Nimbus overview and components • Advanced scenarios (Grid and cloud) • Short comparison with Eucalyptus and Open Nebula
65 Donnerstag, 21. Juli 2011
NIMBUS Overview
• Open Source tools • Infrastructure as a Service • Scientists’ needs • Grid (Globus) interface
66 Donnerstag, 21. Juli 2011
Components
Site Manager Service
Image Repository
Workspace Service
Cumulus
Protocols Webservices or HTTP
Amazon’s S3 REST API implementation
Platforms Apache Axis or Apache CXF
No image propagation
Frontends WSRF or EC2 (describe/run/reboot/terminate instances; describe images; add/delete keypair) 67 Donnerstag, 21. Juli 2011
Backend storage system support is growing
Nimbus SW Stack !"#$%&'()(*"+,-./0123"+%45$-6(7558 )49,":69:%46 !"#$%&'()*2,-)#D'()%
+838.8%*7,-)#D'()%
I+JK*HLAG*',M*N8)#O
!HBP
HQ
!"#$%&'()*AG2
+838.8%*AG2
!"#$%&'()*H)#