Accelerating Ceph for Database Workloads with an all PCIe SSD Cluster Reddy Chagam – Principal Engineer & Chief SDS Architect Tushar Gohad – Senior Staff Engineer Intel Corporation April 19, 2016 Acknowledgements: Orlando Moreno, Dan Ferber (Intel)
Legal Disclaimer Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Ceph v0.94.3 Hammer, v10.1.2 Jewel Release, CentOS 7.2, 3.10-327 Kernel, CBT used for testing and data acquisition. OSD System Config: Intel Xeon E5-2699 v4 2x@ 2.20 GHz, 44 cores w/ HT, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe SSDs, partitioned into 4 OSD’s each, 16 OSD’s total per node. FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 36 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4. Ceph public and cluster networks 2x 10GbE each. FIO 2.2.8 with LibRBD engine. Sysbench 0.5 for MySQL testing. Tests run by Intel DCG Storage Group in Intel lab. Ceph configuration and CBT YAML file provided in backup slides. For more information go to http://www.intel.com/performance. Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. © 2016 Intel Corporation.
*Other names and brands may be claimed as the property of others.
Agenda • • • • • •
Transition to NVMe flash NVMe architecture with Ceph Database & Ceph – leading flash use case The “All NVMe” high-density Ceph Cluster MySQL workload performance results Summary and next steps
*Other names and brands may be claimed as the property of others.
Storage Evolution Yesterday
Today
Near Term Revolutionary Storage Class Memory
Memory & Storage
3D XPoint™ Technology based Apache Pass (AEP) for DDR4
Latency: ~100X Size of Data: ~1,000X World’s Fastest NVMe SSD
Storage
NAND based Intel PCIe SSDs for NVMe 3D NAND based Intel PCIe Ramping in 2016
3D XPoint™ Technology based Optane™ SSD for NVMe
Next Gen NVM enables world’s fastest NVMe SSD and revolutionary storage class memory
NVMe SSD accelerates performance for latency sensitive workloads on Ceph
*Other names and brands may be claimed as the property of others.
3D XPoint ™ Memory Media
Data Center Form Factors for U.2 2.5in (SFF-8639)
M.2
7mm
80, and 110mm lengths, Smallest footprint of PCIe, use for boot or for max storage density
*Other names and brands may be claimed as the property of others.
Add-in-card
15mm
2.5in makes up the majority of SSDs sold today because of ease of deployment, hotplug, serviceability, and small form factor
Add-in-card (AIC) has maximum system compatibility with existing servers and most reliable compliance program. Higher power envelope, and options for height and length
Intel Platforms
Tick-Tock Development Model Thurley Platform
Romley Platform
Intel® Microarchitecture Codename Nehalem
Intel® Microarchitecture Codename Sandy Bridge
Grantley Platform (Today) Intel® Microarchitecture Codename Haswell
Nehalem
Westmere
Sandy Bridge
Ivy Bridge
Haswell
Broadwell
45nm
32nm
32nm
22nm
22nm
14nm
New Microarchitecture
New Process Technology
New Microarchitecture
New Process Technology
New Microarchitecture
New Process Technology
Tylersburg PCH
Tock
Tick
Patsburg PCH
Tock
Tick
Wellsburg PCH
Tock
Tick
Xeon E5 v4 socket compatible with v3 series - improves Ceph performance
*Other names and brands may be claimed as the property of others.
Higher
Ceph Workloads
Storage Performance
Databases
(IOPS, Throughput)
Cloud DVR
Remote Disks
Lower
Test & Dev
Lower
BigData
CDN
VDI
Boot Volumes
HPC
Block
NVM Focus
Object
Storage Capacity (PB)
*Other names and brands may be claimed as the property of others.
Mobile Content Depot
Enterprise Dropbox App Storage Backup, Archive
Higher
Ceph - NVM Usages Today’s Focus Virtual Machine Guest Application VM
Baremetal User Application Kernel RBD Kernel RADOS RADOS Protocol
RADOS Protocol
NVM
LIBRBD RADOS
Kernel
Kernel RBD RADOS
Client caching w/ write through
RADOS Protocol
RADOS Node
OSD data
metadata
RADOS Node
RocksDB BlueRocksEnv BlueFS
Filestore Caching
Caching
10-25 GbE
File System NVM
NVM
RADOS Protocol
RADOS Protocol
OSD Journal
Container Application
HypervisorQemu/Virtio
Caching
NVM
Container
NVM
Caching
NVM
NVM
Production - FileStore *Other names and brands may be claimed as the property of others.
NVM
NVM
Tech Preview – BlueStore Today’s Focus
Journaling Read cache OSD data
Ceph and Percona Server MySQL Integration Virtual Machine
Virtual Machine
Linux Container
Guest VM Application MySQL
Guest VM Application MySQL
Application
Hypervisor Qemu/Virtio
Hypervisor Qemu/Virtio
RBD RADOS
MySQL Host
RBD RADOS
Kernel RBD RADOS
IP Fabric OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD MON SSD
SSD
SSD
SSD
MON SSD
SSD
Ceph Storage Cluster
*Other names and brands may be claimed as the property of others.
Deployment Considerations • Bootable Ceph volumes (OS & MySQL data) • MySQL RBD volumes (all in one, separate)
SSD
SSD
Configurations • Good: NVMe SSD for Journal/Cache, HDDs as OSD data drive • Better: NVMe SSD as Journal, High capacity SATA (or) 3D-NAND NVMe SSD for data drive • Best: All NVMe SSD
An “All-NVMe” high-density Ceph Cluster Configuration 5-Node all-NVMe Ceph Cluster
10x Client Systems + 1x Ceph MON
Dual-Xeon E5
[email protected], 44C HT, 128GB DDR4
Dual-socket Xeon E5
[email protected]
Centos 7.2, 3.10-327, Ceph v10.1.2, bluestore async
36 Cores HT, 128GB DDR4
Ceph OSD16
NVMe4
Supermicro *Other names and brands may be claimed as the1028U-TN10RT+ property of others.
Test-set 2
NVMe2
Cluster NW 2x 10GbE
NVMe3
Ceph OSD4
Ceph OSD3
Ceph OSD2
Ceph OSD1
NVMe1
Test-set 1
FIO
Docker1 (krbd)
Docker2 (krbd)
MySQL DB Server
MySQL DB Server
Docker3
Docker4
Sysbench Client
Sysbench Client
Public NW 2x 10GbE
DB containers - 16 vCPUs, 32GB mem, 200GB RBD volume, 100GB MySQL dataset, InnoDB buf cache 25GB (25%)
Client containers – 16 vCPUs, 32GB RAM FIO 2.8, Sysbench 0.5
Multi-partitioning flash devices • High performance NVMe devices are capable of high parallelism at low latency • DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128
• High Resiliency of “Data Center” Class NVMe devices
• Reduces lock contention within a single OSD process • Lower latency at all queue-depths, biggest impact to random reads
• Introduces the concept of multiple OSD’s on the same physical device • Conceptually similar crushmap data placement rules as managing disks in an enclosure
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
*Other names and brands may be claimed as the property of others.
NVMe SSD
Ceph OSD4
Ceph OSD3
• By using multiple OSD partitions, Ceph performance scales linearly
Ceph OSD2
• Power loss protection, full data path protection, device level telemetry
Ceph OSD1
• At least 10 Drive writes per day
Partitioning multiple OSD’s per NVMe Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison 5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc, 1 OSD/NVMe
12
2 OSD/NVMe
Single Node CPU Utilization Comparison - 4K Random Reads@QD32 4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc 90
4 OSD/NVMe
80 70
Double OSD
8
Quad OSD
Single OSD
6 4
% CPU Utilization
Avg Latency (ms)
10
60 50 40 30 20
2
10 0
0 IOPS
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1 OSD/NVMe
Multiple OSD’s per NVMe result in higher performance, lower latency, and better CPU utilization
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark *Otherparameters. names and brands may be claimed as the property of others.
2 OSD/NVMe
4 OSD/NVMe
4K Random Read/Write Performance and Latency (Baseline FIO Test) IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix 5 nodes, 80 OSDs, Xeon E5 2699v4 Dual Socket / 128GB Ram / 2x10GbE Ceph 10.1.2 w/ BlueStore w/ async msgr. 6 RBD FIO Clients 12 11
Average Latency (ms)
~220k 100% 4k Random Write 10 IOPS @~5 ms avg
~560k 70/30% (OLTP) Random IOPS @~3 ms avg
9 8
~1.6 M 100% 4k Random Read IOPS @~2.2 ms avg
7 6 5 4
~1.4 M 100% 4k Random
3
Read IOPS @~1 ms avg
2 1 0
0
200000
400000
600000
800000
1000000
1200000
IOPS 100% Rand Read
100% Rand Write
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark *Other names and brands may be claimed as the property of others. parameters.
70% Rand Read
1400000
1600000
1800000
Sysbench MySQL OLTP Performance (100% SELECT) Sysbench Thread Scaling - Latency vs QPS – 100% read (Point SELECTs) 5 nodes, 80 OSDs, Xeon E5 2699v4 Dual Socket / 128GB Ram / 2x10GbE Ceph 10.1.2 w/ BlueStore w/ async msgr. 20 Docker-rbd Sysbench Clients (16vCPUs, 32GB) 35
~1.3 million QPS (Aggregate, 20 clients) 8 Sysbench threads
Avg Latency (ms)
30 25 20
1 million QPS (Aggregate, 20 clients) @~11 ms avg
15
~55000 QPS per client w/ 2 Sysbench threads
10 5 0 0
200000
400000
600000
800000
Aggregate Queries Per Second (QPS) 100% Random Read
InnoDB buf pool = 25%, SQL dataset = 100GB *Other names and brands may be claimed as the property of others.
1000000
1200000
1400000
Sysbench MySQL OLTP Performance (100% UPDATE, 70/30% SELECT/UPDATE) Sysbench Thread Scaling - Latency vs QPS – 100% Write (Index UPDATEs), 70/30% OLTP 5 nodes, 80 OSDs, Xeon E5 2699v4 Dual Socket / 128GB Ram / 2x10GbE Ceph 10.1.2 w/ BlueStore w/ async msgr. 20 Docker-rbd Sysbench Clients (16vCPU, 32GB) 500 450 ~100k
~5500 QPS w/ 1 Sysbench client (2-4 threads)
400
Avg Latency (ms)
Write QPS@~200 ms avg (Aggregate, 20 clients)
350 300
~400k 70/30% OLTP QPS@~50 ms avg ~25000 QPS w/ 1 Sysbench client (4-8 threads)
250 200 150 100 50
0 0
100000
200000
300000
400000
Aggregate Queries Per Second (QPS) 100% Random Write
InnoDB buf pool = 25%, SQL dataset = 100GB *Other names and brands may be claimed as the property of others.
70/30% Read/Write
500000
600000
Summary & Conclusions • NVMe Flash storage for low latency workloads • Ceph a compelling case for database workloads • With Ceph, 1.4 million random IOPS is achievable in 5U with ~1ms latency today. Ceph performance is only getting better! • Using Xeon E5 v4 standard high-volume servers and Intel NVMe SSDs, you can now deploy a high performance Ceph cluster for database workloads • Next steps: • Evaluation on large scale cluster • Ceph community collaboration in improving write latency
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark *Other names and brands may be claimed as the property of others. parameters.
Thank you- Any Questions? Refer to backup slides for additional configuration and details
Backup
Intel Ceph Contributions CRUSH Placement Algorithm improvements (straw2 bucket type) New Key/Value Store Backend (rocksdb)
2014
Bluestore Backend Optimizations for NVM
PMStore (NVM-optimized backend based on libpmem)
Bluestore SPDK Optimizations
Cache-tiering with SSDs (Write support)
2015
Giant*
2016 Hammer
Cache-tiering with SSDs (Read support)
Virtual Storage Manager (VSM) Open Sourced
Erasure Coding support with ISA-L
RADOS I/O Hinting (35% better EC Write erformance)
Infernalis Industry-first Ceph Cluster to break
1 Million 4k Random IOPs CeTune Open Sourced
Jewel
Client-side Block Cache (librbd) RGW, Bluestore
Compression, Encryption (w/ ISA-L, QAT backend) 19
*Other names and brands may be claimed as the property of others.
Configuration Detail – ceph.conf [global] enable experimental unrecoverable data corrupting features = bluestore rocksdb osd objectstore = bluestore ms_type = async rbd readahead disable after bytes = 0 rbd readahead max bytes = 4194304 bluestore default buffered read = true
auth client required = none auth cluster required = none auth service required = none filestore xattr use omap = true cluster network = 192.168.142.0/24, 192.168.143.0/24 private network = 192.168.144.0/24, 192.168.145.0/24 log file = /var/log/ceph/$name.log log to syslog = false mon compact on trim = false osd pg bits = 8 osd pgp bits = 8 mon pg warn max object skew = 100000 mon pg warn min per osd = 0 mon pg warn max per osd = 32768
debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0 perf = true mutex_perf_counter = true throttler_perf_counter = false rbd cache = false 20
*Other names and brands may be claimed as the property of others.
Configuration Detail – ceph.conf (continued) [mon] mon data =/home/bmpa/tmp_cbt/ceph/mon.$id mon_max_pool_pg_num=166496 mon_osd_max_split_count = 10000 mon_pg_warn_max_per_osd = 10000 [mon.a] host = ft02 mon addr = 192.168.142.202:6789
[osd] osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k,delaylog osd_mkfs_options_xfs = -f -i size=2048 osd_op_threads = 32 filestore_queue_max_ops=5000 filestore_queue_committing_max_ops=5000 journal_max_write_entries=1000 journal_queue_max_ops=3000 objecter_inflight_ops=102400 filestore_wbthrottle_enable=false filestore_queue_max_bytes=1048576000 filestore_queue_committing_max_bytes=1048576000 journal_max_write_bytes=1048576000 journal_queue_max_bytes=1048576000 ms_dispatch_throttle_bytes=1048576000 objecter_infilght_op_bytes=1048576000 osd_mkfs_type = xfs filestore_max_sync_interval=10 osd_client_message_size_cap = 0 osd_client_message_cap = 0 osd_enable_op_tracker = false filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 filestore_op_threads = 6 21
*Other names and brands may be claimed as the property of others.
Configuration Detail - CBT YAML File cluster: user: "bmpa" head: "ft01" clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"] osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"] mons: ft02: a: "192.168.142.202:6789" osds_per_node: 16 fs: xfs mkfs_opts: '-f -i size=2048 -n size=64k' mount_opts: '-o inode64,noatime,logbsize=256k' conf_file: '/home/bmpa/cbt/ceph.conf' use_existing: False newstore_block: True rebuild_every_test: False clusterid: "ceph" iterations: 1 tmp_dir: "/home/bmpa/tmp_cbt" pool_profiles: 2rep: pg_size: 8192 pgp_size: 8192 replication: 2 *Other names and brands may be claimed as the property of others.
benchmarks: librbdfio: time: 300 ramp: 300 vol_size: 10 mode: ['randrw'] rwmixread: [0,70,100] op_size: [4096] procs_per_volume: [1] volumes_per_client: [10] use_existing_volumes: False iodepth: [4,8,16,32,64,128] osd_ra: [4096] norandommap: True cmd_path: '/usr/local/bin/fio' pool_profile: '2rep' log_avg_msec: 250 `
22
Storage Node Diagram
Two CPU Sockets: Socket 0 and Socket 1 Socket 0 • 2 NVMes • Intel X540-AT2 (10Gbps) • 64GB: 8x 8GB 2133 DIMMs Socket 1 • 2 NVMes • 64GB: 8x 8GB 2133 DIMMs *Other names and brands may be claimed as the property of others.
Explore additional optimizations using cgroups, IRQ affinity 23
High Performance Ceph Node Hardware Building Blocks • Generally available server designs built for high density and high performance • High density 1U standard high volume server • Dual socket 3rd Generation Xeon E5 (2699v3) • 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector • Multiple 10Gb network ports, additional slots for 40Gb networking
• Intel DC P3700 NVMe drives are available in 2.5” drive form-factor • Allowing easier service in a datacenter environment
*Other names and brands may be claimed as the property of others.
MySQL configuration file (my.cnf) [client] port = 3306 socket = /var/run/mysqld/mysqld.sock [mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice =0 [ mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 datadir = /data basedir = /usr tmpdir = /tmp lc-messages-dir = /usr/share/mysql skip-external-locking bind-address = 0.0.0.0 max_allowed_packet = 16M thread_stack = 192K thread_cache_size =8 query_cache_limit = 1M query_cache_size = 16M log_error = /var/log/mysql/error.log expire_logs_days = 10 max_binlog_size = 100M
*Other names and brands may be claimed as the property of others.
performance_schema=off innodb_buffer_pool_size = 25G innodb_flush_method = O_DIRECT innodb_log_file_size=4G thread_cache_size=16 innodb_file_per_table innodb_checksums = 0 innodb_flush_log_at_trx_commit = 0 innodb_write_io_threads = 8 innodb_page_cleaners= 16 innodb_read_io_threads = 8 max_connections = 50000 [mysqldump] quick quote-names max_allowed_packet
= 16M
[mysql] !includedir /etc/mysql/conf.d/
Sysbench commands prepare sysbench --test=/root/benchmarks/sysbench/sysbench/tests/db/parallel_prepare.lua --mysql-user=sbtest --mysqlpassword=sbtest --oltp-tables-count=32 --num-threads=128 --oltp-table-size=14000000 --mysql-table-engine=innodb -mysql-port=$1 --mysql-host=172.17.0.1 run READ sysbench --mysql-host=${host} --mysql-port=${mysql_port} \--mysql-user=sbtest --mysql-password=sbtest --mysql-db=sbtest -mysql-engine=innodb --oltp-tables-count=32 --oltp-table-size=14000000 -test=/root/benchmarks/sysbench/sysbench/tests/db/oltp.lua --oltp-read-only=on --oltp-simple-ranges=0 --oltp-sumranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-point-selects=10 --rand-type=uniform -num-threads=${threads} --report-interval=60 --warmup-time=400 --max-time=300 --max-requests=0 --percentile=99 run WRITE sysbench --mysql-host=${host} --mysql-port=${mysql_port} --mysql-user=sbtest --mysql-password=sbtest --mysql-db=sbtest -mysql-engine=innodb --oltp-tables-count=32 --oltp-table-size=14000000 -test=/root/benchmarks/sysbench/sysbench/tests/db/oltp.lua --oltp-read-only=off --oltp-simple-ranges=0 --oltp-sumranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=100 --oltp-point-selects=0 --rand-type=uniform --num-threads=${threads} --report-interval=60 --warmup-time=400 --max-time=300 --max-requests=0 --percentile=99 run *Other names and brands may be claimed as the property of others.
Docker Commands Database containers docker run -ti --privileged --volume /sys:/sys --volume /dev:/dev -d -p 2201:22 -p 13306:3306 -cpuset-cpus="1-16,36-43" -m 48G --oom-kill-disable --name database1 ubuntu:14.04.3_20160414db /bin/bash
Client containers docker run -ti -p 3301:22 -d --name client1 ubuntu:14.04.3_20160414-sysbench /bin/bash
*Other names and brands may be claimed as the property of others.
RBD Commands ceph osd pool create database 8192 8192 rbd create --size 204800 vol1 --pool database --image-feature layering rbd snap create database/vol1@master
rbd snap ls database/vol1 rbd snap protect database/vol1@master rbd clone database/vol1@master database/vol2
rbd feature disable database/vol2 exclusive-lock object-map fast-diff deep-flatten rbd flatten database/vol2
*Other names and brands may be claimed as the property of others.
An "All-NVMe” high-density Ceph Cluster Configuration SuperMicro FatTwin (1x dual-socket XeonE5 v3)
FIO/Sysbench FIO RBD Client FIO RBD Client FIO RBD Client
CBT / Zabbix / Monitoring
Ceph MON
SuperMicro FatTwin (4x dual-socket XeonE5 v3)
SuperMicro FatTwin (4x dual-socket XeonE5 v3)
Intel Xeon E5 v4 22 Core CPUs Intel P3700 NVMe PCI-e Flash
Intel PCSD (4x dual-socket Xeon E5 v3)
FIO/Sysbench FIO RBD Client FIO RBD Client
FIO/Sysbench FIO RBD Client FIO RBD Client FIO RBD Client
Ceph network (192.168.142.0/24) – 2x10Gbps Ceph cluster network (192.168.144.0/24) – 2x10Gbps
Ceph OSD16
NVMe1
…
Ceph OSD4
SuperMicro 1028U
Ceph OSD3
NVMe4
Ceph OSD2
SuperMicro 1028U
NVMe2
NVMe3
Ceph OSD1
NVMe4
…
Ceph OSD16
NVMe1
Ceph OSD4
Ceph OSD3
Ceph OSD2
NVMe3
Ceph OSD1
NVMe2
…
Ceph OSD16
NVMe1
Ceph OSD4
SuperMicro 1028U
Ceph OSD3
NVMe4
Ceph OSD2
NVMe3
Ceph OSD1
NVMe2
…
Ceph OSD16
SuperMicro 1028U
NVMe1
Ceph OSD4
NVMe4
Ceph OSD3
NVMe2
Ceph OSD2
NVMe3
Ceph OSD1
Ceph OSD16
Ceph OSD4
Ceph OSD3
Ceph OSD2
Ceph OSD1
NVMe1
…
NVMe3
NVMe2
NVMe4
SuperMicro 1028U
Ceph Storage Cluster
• 5-Node all-NVMe Ceph Cluster based on Intel Xeon E5-2699v4, 44 core HT, 128GB DDR4 • Storage: Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node • Networking: 2x10GbE public, 2x10GbE cluster, partitioned, replication factor 2 • Ceph 10.1.2 Jewel Release, CentOS 7.2, 3.10.0-327.13.1.el7 Kernel
• 10x FIO/Sysbench Clients: Intel Xeon E5-2699 v3 @ 2.30 GHz, 36 cores w/ HT, 128GB DDR4 • Docker with kernel RBD volumes – 2 database and 2 client containers per node • Database containers – 16 vCPUs, 32GB RAM, 250GB RBD volume Client containers – 16 vCPUs, *Other•names and brands may be claimed as the property of32GB others. RAM
Easily serviceable NVMe Drives