Seagate ExaScale HPC storage ... Possibility or pipedream ?
Torben Kling Petersen, PhD Principal Engineer Seagate Systems Group, HPC © 2015 Seagate, Inc. All Rights Reserved.
1
Agenda ??? • Seagate ?? • Some serious bragging ….. • L-Series – Disk enhancements – Performance enhancements
• G-Series • A-Series
THIS IS NOT A SALES PITCH !
2
About Seagate
Of the world’s digital information
End-to-end cloud solutions: mobile > desktop > data center, 25,000 recoveries/ mo 98%+ satisfied customers
$3.8B in Q115 Revenues
Nearly 1 million hard drives a day
Supplier to 8 of the 10 world’s largest cloud service providers
Technology leader: 1st in SMR & HAMR technologies, 14,000+ patents, $900M R&D
Complete hard drive portfolio: HDD, SAS, SATA, SSD, hybrid
World-class integrated supply chain in 100+ locations
3
Seagate Device Portfolio The Industry’s Broadest Catalog of Storage Devices ENTERPRISE & CLOUD
PCIe SSD
SAS SFF SSD
SATA SFF SSD
15K SFF HDD
10K SFF HDD
CLIENT (DESKTOP & MOBILE)
Desktop HDD / SSHD
Laptop HDD / SSHD
Consumer systems
S-ATA SSDs
Client SSD
Ultra Mobile HDD / SSHD
Enterprise SSDs
Enterprise 12 Gbit SAS SSDs
7.2K SFF HDD
Performance 3.5 HDD
NAS
Ultra Thin HDD
Enterprise NAS
Traditional Nearline
Value Nearline
SURVEILLANCE
NAS HDD
Compute Accelerators
PCIe based Flash accelerators
Traditional Nearline
Surveillance HDD
Kinetic HDD
Archive HDD
VIDEO & MEDIA
Video 3.5 HDD
Video 2.5 HDD
NVMe
Custom ASICs
NVMe, M2/M4 PCIe Storage
Market leading custom ASICs, wear leveling algorithms etc 4
4
ClusterStor Product Line Overview The Complete Portfolio for HPC & Big Data Lustre Solutions
Lustre Secure
› Up to 120+ GB/s per rack › Lustre 2.5
› Up to 60 GB/s per rack › Lustre 2.5 on SE-Linux
G200/300 with ISS
A200 - Object Store
› Up to 100 GB/s per rack › IBM SS 4.2
› Tiered Archive › Up to 5 PB per rack
ISV S olutions
Scality
OpenStack Swift
Spectrum Scale
CP-2/3584
› NL SAS › 8 TB › 7.2K RPM
› HPC Drive › 4TB › 10K RPM
› 24 x 2.5’ drives or SSDs › Dual Controllers
› 24 x 8 TB drives › Dual Controllers
SATA
SAS
SP-3224
SP-3424
› Up to 84 x 8 TB drives › Dual Controllers
Flash accelerators
SSD
› SMR Drive › 8TB
Cleversafe
› SAS SSD › 1.3 TB
© 2016 Seagate, Inc. Under NDA with Atos.
› NVMe › 1.3 TB
› PCIe x 16 › NVMe › 10 GB/s
5
20 PB Lustre File System 130+ GB/s Lustre File System 140+ GB/s Lustre File System
55 PB Lustre File System 1.6 TB/sec Lustre File System
500+ GB/s Lustre File System 1 TB/sec Lustre File System © 2016 Seagate, Inc. All Rights Reserved.
6
Real storage leadership ….. Rank
Name
Total Cores
Rmax (TFLOPS)
Power (KW)
File system
Size
54,902,400
17808
Lustre/H2FS
12.4 PB
~750 GB/s
17,590,000
27,112,550
8209
Lustre
10.5 PB
240 GB/s
1572864
17,173,224
20,132,659
7890
Lustre
55 PB
850 GB/s
RIKEN AICS
705024
10,510,000
11,280,384
12659
Lustre
40 PB
965 GB/s
DOE/SC/Argonne National Lab.
786432
8,586,612
10,066,330
3945
GPFS
28.8 PB
240 GB/s
76 PB
1,600 GB/s
Computer
Site
TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi Cray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x
National Super Computer Center in Guangzhou
3120000
33,862,700
DOE/SC/Oak Ridge National Laboratory
560640
Rpeak (TFLOPS)
Perf
1
Tianhe-2
2
Titan
3
Sequoia
4
K computer
5
Mira
6
Trinity
Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect
DOE/NNSA/LANL/SNL
301056
8,100,900
11,078,861
7
Piz Daint
Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x
Swiss National Supercomputing Centre (CSCS)
115984
6,271,000
7,788,853
2325
Lustre
2.5 PB
138 GB/s
8
Shaheen II
Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect
KAUST, Saudi Arabia
196,608
5,537,000
7,235,000
2,834
Lustre
17 PB
500 GB/s
9
Hazel Hen
Cray XC40, Xeon E5-2680v3 12C 2.5GHz, Aries interconnect
HLRS - Stuttgart
185088
5,640,170
7,403,520
7 PB
~ 100 GB/s
10
Stampede
PowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi
TACC/ Univ. of Texas
462462
5,168,110
8,520,112
14 PB
150 GB/s
BlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect Fujitsu, SPARC64 VIIIfx 2.0GHz, , Tofu interconnect BlueGene/Q, Power BQC 16C 1.60GHz, Custom
DOE/NNSA/LLNL
n.b. NCSA Bluewaters
24 PB
1100 GB/s (Lustre 2.1.3)
Lustre
Lustre
4510
Lustre
7
The Concept: Fully integrated, fully balanced, no bottlenecks … ClusterStor Scalable Storage Unit • Intel Ivy bridge or Haswell CPUs • F/EDR, 100 GbE & 2x40GbE, all SAS infrastructure • SBB v3 Form Factor, PCIe Gen-3 • Embedded RAID & Lustre support
ClusterStor/Sonexion Manager
Lustre File System 2.5 / 2.7 Spectrum Scale 4.2 Data Protection Layer (PD-RAID/Grid-RAID)
Linux OS Unified System Management
Embedded server modules
(GEM-USM) 8
Lustre solutions
© 2015 Seagate, Inc. All Rights Reserved.
9
Seagate and Intel join forces on Lustre® Agreement signed February 19
• Seagate are transitioning from OpenSFS to Intel IEEL Lustre as the baseline – Beginning with the Lustre 2.7 release planned for 2H 2016 – Seagate distribution will contain our specific Lustre features including more than 260 patches mainly focused on running Lustre at extreme scale – Lustre on ClusterStor/Sonexion is a super set distribution
• Seagate Lustre Dev and Support team will continue to support our customers – Largest support capability in the Industry (Intel's Lustre team + Seagate Lustre team) – Seagate support will work with customers and escalate any IEEL issues to Intel.
• Seagate will continue to improve Lustre – Continue to test and improve the quality of Lustre 2.7+ particularly at scale – Seagate will develop some unique Lustre features 10
Current Lustre based solutions line-up ClusterStor Manager GUI + CLI
CS-1500
CS-9000
L300
Modular 4U24s 2U Mgmt unit Any 19” rack Lustre 2.1+
Rackbased solution 4U Mgmt unit Lustre 2.5+
Rackbased solution 2U Mgmt unit 2U DNE MDS Lustre 2.5+/2.7.x
Up to 10 GB/s/rack Up to 481 TB/rack
Up to 110+ GB/s/rack Up to 3.4 PB/rack
Up to 10 GB/s/rack Up to 3.4 PB/rack
11
CS-2/3584 - Scalable Storage Unit (SSU) – Lustre OSS •
Ultra HD - CS-2/3584 SSU - OSS – 5U84 Enclosure – completely H/A • • •
–
Pair of H/A Embedded Application Servers • •
– –
CS9000 = 9 - 10 GB/s IOR over IB L300 = 12 - 18 GB/sec IOR over IB
Only 50 C delta with drawer open
IB F/EDR or 40/100 GbE Network Link Data Protection/Integrity (Grid-RAID, 8+2) •
– –
Two (2) trays of 42 HDD’s each with 6/12 Gbit SAS Dual-ported 3.5” NL SAS & SSD HDD Support 350+ MB/s SAS available bandwidth per HDD
Grid-RAID - 2 OSS’s per SSU, 1 OST’s per OSS
2x SSD OSS journal disks for increased performance 64 Usable Data Disks per SSU • •
2TB x 64 - 128TB usable per SSU 4TB x 64 - 256TB usable per SSU
• •
6TB x 64 - 384TB usable per SSU 8TB x 64 - 512TB usable per SSU
Embedded server modules
12
ClusterStor L300 – Performance Density Configuration
System Management Unit (SMU) 2U24, Server x 2 (Laguna Seca)
SMU
MMU
Meta Data Management Unit (MMU) 2U24, Server x 2 (Laguna Seca)
SSU 5U84 Disk Enclosure OSS x 2 (Laguna Seca) > 12GB to 18GB/sec per 5U84
OSS #1
SSU
OSS #2
ClusterStor T oR & Mgt Switch, Rack, Cables, PDU Factory Integration & T est WIBS / Journals / Small Block Acceleration
7.2K & 10K RPM HDDs
WIBS / Journals / Small Block Acceleration
7.2K & 10K RPM HDDs
13
New ClusterStor L300 Embedded Application Sever Mezz/daughter slot/connector
PCI HBA 12Gbit SAS (EDR/Omni) EDR HBA Slot/connector mezz/daughter card Installed installed
New L300 Object Storage Server - -
PCI Slot for Network HBA Intel Omni-Path or Mellanox EDR
14
ClusterStor GRIDRAID Feature
Benefit
De-clustered RAID 6 : Up to 4 00% faster to repair Rebuild of 8TB drive – MD RAID ~ 33.3 h ours, GridRAID ~ 6 h
Recover from a d isk failure a nd return to full d ata p rotection faster
Repeal Amdahl’s Law: speed of a p arallel system is g ated b y the performance o f the slowest component
Minimizes application impact to widely striped file p erformance
Minimize file system fragmentation
Improved allocation a nd layout maximizes sequential d ata p lacement
4 to1 Reduction in OST/NSDs
Simplifies scalability challenges
ClusterStor Integrated Management
CLI a nd GUI configuration, monitoring a nd management reduces Opex
Traditional RAID Parity Rebuild Disk Pool #1 Parity Rebuild Disk Pool #2 Parity Rebuild Disk Pool #3
GridRAID OSS/NSD Server
OSS/NSD Server Parity Rebuild Disk Pool #1
Parity Rebuild Disk Pool #4
15
ClusterStor Grid RAID Declustered Parity - Geometry ●
● ● ●
●
PD RAID geometry for an array is defined as: - P drive (N+K+A) - example: 41 (8+2+2) P is the total number of disks in the array N is the number of data blocks per stripe K is the number of Parity blocks per stripe A is the number of distributed spare disk drives
16
L300 File system performance Rack Aggregates/Totals Expansion racks (HDDs/SSDs)
8TB HDD TBs: (U/R)
IOR perf GB/s*
Power kW
SSU #6
574/ 14
3580 / 4592
Up to 84
14.9
SSU #5
492 / 12
3072 / 3936
Up to 72
12.6
SSU #4
410 / 10
2560 / 3280
Up to 60
10.9
SSU #3
328 / 8
2048 / 2624
Up to 48
9.2
SSU #2
246 / 6
1536 / 1968
Up to 36
7.4
SSU #1
164 / 4
1024 / 1312
Up to 24
5.7
SSU #0
82 / 2
512 / 656
Up to 12
4.0
# drives:
© 2015 Seagate, Inc. All Rights Reserved.
17
ClusterStor L300 HPC Disk Drive
Enterprise Performance 3.5 HDD High level product description • • – – – • • – • – •
19
ClusterStor L300 HPC 4TB SAS HDD HPC Industry First;; Best Mixed Application Workload Value Performance Leader World-beating performance over other 3.5in HDDs: Speeding data ingest, extraction and access
600
Capacity Strong 4TB of storage for big data applications
400
500
CS HPC HDD
CS HPC HDD
300
Reliable Workhorse 2M hour MTBF and 750TB/year ratings for reliability under the toughest workloads your users throw at it Power Efficient Seagate’s PowerBalance feature provides significant power benefits for minimal performance tradeoffs
200
CS HPC HDD NL 7.2K RPM HDD
NL 7.2K RPM HDD
NL 7.2K RPM HDD
100 0 Random writes (4K IOPS, WCD)
Random reads Sequential data rate (4KQ16 IOPS) (MB/s)
20
L300 - File system performance Rack Aggregates/Totals HPC drive (HDDs/SSDs)
4TB HDD TBs: (U/R)
IOR perf GB/s*
Power kW
SSU #6
574/ 14
1792 / 2240
Up to 126
14.9
SSU #5
492 / 12
1536 / 1920
Up to 108
12.6
SSU #4
410 / 10
1280 / 1600
Up to 90
10.9
SSU #3
328 / 8
1024 / 1280
Up to 72
9.2
SSU #2
246 / 6
768 / 960
Up to 54
7.4
SSU #1
164 / 4
512 / 640
Up to 36
5.7
SSU #0
82 / 2
256 / 320
Up to 18
4.0
# drives:
© 2015 Seagate, Inc. All Rights Reserved.
21
The “Data Capacitor” Utilizing Seagate Flash technology
© 2015 Seagate, Inc. All Rights Reserved.
22
Seagate 1200.2 SAS SSD Enterprise-focused Feature Set •
Enterprise Grade Performance & Features – – –
•
Enterprise Grade Data Protection – – – –
•
24Gb/s Active-Active (High I/O performance) Wide capacity range (200GB to 4TB-class) with multiple endurance options in one platform Multi-host, dual port supports “No Single Point of F ailure” T10-DIF End-to-End ECC Internal and External No danger of ‘Silent Data Corruption’ Power loss data protection (PLDP) provides mechanism to save data/operations in process Encryption to NSA Standard, SED and FIPS-compliance prevents unauthorized access to stored data
Enterprise Grade Endurance –
Best F it Applications Server Virtualization Examples: V Mware vSphere, Microsoft Hyper-V, Linux K VM, Zen
Databases Examples: OLTP, Oracle, S AP, SQL-Server, E xchange, NO-SQL, MySQL, MongoDB
HPC applications Examples: L ustre, S pectrum S cale, BeeGFS, e tc ...
Software Defined Storage Examples: Microsoft S torage Spaces, Nexenta, V mware vSAN
5-Year Drive Life Even Under Write-Intensive Workloads 23
Nytro® PCIe Flash Accelerator Cards Lowest Latency and Highest Efficiency •
Latency-Optimized – Controller with DRAM for minimized latency –
•
•
Consistently high performance and low latency
Best F it Applications Transactional DB XP6500
Density-Optimized –
Maximum capacity & performance within a form factor
–
Performance scales with Queue Depth / Thread Count
–
IO intensive and virtualized workloads
Virtualized and IO-intensive XP6302
Dense environments
Thermally-Optimized – –
Single-planar, NAND-down design for optimal cooling Read-intensive and power / thermally sensitive applications
Enterprise XP6209 & XP6210
24
Nytro® XF1440 / XM1440 PCIe SSDs Balanced Power and Performance Innovative Data Center Storage Solutions – – – –
PCIe Gen3 delivers higher sustained transfer speeds NVMe protocol for consistent response time Multiple form factors: SFF 2.5” 7mm and M.2 Addressing read intensive and mixed workloads
Best F it Applications Direct Attached Storage
Reducing Total Cost of Ownership – –
$/Watt cost advantage Power/performance optimized solutions ( 8GB/sec per 5U84 (Clustered) ~ 20K File Creates per Sec ~ 2 Billion Files
NSD (MD) Server #1
NSD (MD) Server #2
ClusterStor ToR & Mgt Switch, Rack, Cables, PDU Factory Integration & Test Up to (7) 5U84’s in base rack
Metadata SSD Pool ~10K File Creates / sec ~ 1Billion files, 800 GB SSD x 2
User Data Pool ~4GB/sec HDD x qty (40)
Metadata SSD Pool ~10K File Creates / sec ~ 1Billion files, 800 GB SSD x 2
User Data Pool ~4GB/sec HDD x qty (40)
30
ClusterStor Spectrum Scale Performance Density Rack Configuration ETN 1
Key components: ›
ClusterStor Manager Node (2U enclosure) • 2 HA management servers • 10 drives
ETN 1
ETN 2
ETN 2
High Speed Network 1
High Speed Network 1
High Speed Network 2
High Speed Network 2
CSM
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
NSD
Base Rack
Expansion Rack
› 2 Management switches
Performance: › Up to 56GB /sec p er rack
Key components: › 5U84 Enclosure Configured as NSDs + Disk • 2 HA Embedded NSD Servers • 76 to 80 7.2K RPM HDDs • 4 to 8 SSD
› 42U reinforced Rack • Custom cable harness • Up to 7 enclosures in each rack (base + expansion)
31
ClusterStor Spectrum Scale Capacity Optimized Rack Configuration ETN 1
Key components: ›
ClusterStor Manager Node (2U enclosure) • 2 HA management servers • 10 drives › 2 Management switches
Performance: › Up to 32GB /sec p er rack
Key components: › 5U84 Enclosure Configured as NSDs + Disk • 2 HA Embedded NSD Servers • 76 to 80 7.2K RPM HDDs • 4 to 8 SSD
› 5U84 Enclosure Configured as JBODs • 84 7.2K RPM HDDs • SAS connected to NSD servers, 1 to 1 ratio
› 42U reinforced Rack
ETN 1
ETN 2
ETN 2
High Speed Network 1
High Speed Network 1
High Speed Network 2
High Speed Network 2
CSM
JBOD JBOD
JBOD
NSD
JBOD
JBOD
NSD
NSD
JBOD
JBOD
JBOD
NSD
NSD
Base Rack
Expansion Rack 32
Introducing ClusterStor G300 Highest Performance De-clustered RAID solution for Spectrum Scale Up to 112 GB/s per rack throughput EDR/OmniPath/100 and dual bonded 40 GbE All Flash Array SSU Addressing demanding application workloads
More from our ClusterStor HPC Drive Next Gen SSU enclosure for highest levels of disk based throughput
Pre-tested and configured integrated solution Created from converged storage building blocks assuring fast, accurate installation and easy, modular expansion 33
Object Storage based archiving solutions
© 2015 Seagate, Inc. All Rights Reserved.
34
ClusterStor A200 Active Archive Product Overview ClusterStor A200 Combined with ClusterStor HSM or TSM to provide automatic policy-driven data migration & retrieval Object API & portfolio of network based interfaces (POSIX, pNFS, CIFS, S3, HDF5, non-POSIX …) Unlimited scalability (file system size up to 2214 bytes) High density storage up to 3.6PB* usable per rack Utilizes network erasure coding to provide high levels of data availability and data durability No single points of failure, resiliant across single maintenance events Dual 10Gb Ethernet node connectivity IB as an option
Packaged as upgrade to ClusterStor
Spectrum Scale 4.1
Lustre 2.5.x
HSM
CS A200
ClusterStor
* moving to 5+ PB/rack in late 2016 © 2016 Seagate, Inc. Under NDA with Atos.
35 35
ClusterStor A200: Resiliency Built-In Redundant TOR switches › Combines data and management network traffic › VLANs used to segregate network traffic › 10GbE with 40GbE TOR uplinks
Management Unit › 1x2U24 enclosure › 2x Embedded Controllers
Storage Units
42U Rack with wiring loom & power cables › Dual PDUs › 2U spare space reserved for future configuration options › Blanking plates as required
› Titan v2 5U84 enclosures – 6x is the minimum config › 82 SMR S-ATA HDD (8TB) › Single Embedded Storage Controller › Dual 10GbE network connections › Resiliency withj 2SSU failures (12SSU’s minimum)
36
Economic Benefits of SMR drives Backed by Seagate Object store SMR Drives Shingled Technology increases capacity of a platter by 30-40% › Write tracks are overlapped by up to 50% of write width › Read head is much smaller & can reliably read narrower tracks SMR Drives are optimal for object stores as most data is static/WORM
Read Head Write Head Updates destroy portion of next track
› Updates require special intelligence and may be expensive in terms of performance › Wide tracks in each band are often reserved for updates CS A200 manages SMR Drives directly to optimize workflow & caching › A200 avoids the ”Read-Update-Write” problem by using Copy-On-Write !! 37