Scaled RDMA Performance and Storage Design with Windows Server 2012 R2 Dan Lovinger Principal Software Engineer Windows File Server Microsoft 2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Outline SMB3 Application Workloads – Real Hardware Methodology 2012 Results and Discussion* Comparison to 2012 R2 RTM Scaling to Racks and Full Deployments
*There’s a paper you can download!
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Goals
Demonstrate SMB3 is valid Best Choice for application workloads Evaluate potential of new server hardware with SMB3 Evaluate performance of RDMA-capable fabric(s) Demonstrate that it is reasonable to consider remotely deployed storage for highly scaled server environments. Chart a future performance course, and metrics to use
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Key SMB3 Application Workloads
Hyper-V (virtualization) SQL 8K Random VHDs and database tables Pure read, plus read/write mix 512K Sequential Backup, disk migration, decision support/data mining Pure read Can be >512K, but performance and requirements largely the same Also 64K
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EchoStreams FlacheSAN2
Appliance combining SAS HBAs, enterprise SSDs and high speed networking, Windows Server 2012 and Storage Spaces. Networking: 3x Mellanox ConnectX-3 FDR InfiniBand HCAs Storage
CPU: 2x Intel Xeon E5-2650 (8c16t 2.00Ghz)
Latest version with E5-2665 2.40GHz CPUs + Mezz
DRAM: 32GB Client generic white box Networking: 3x Mellanox ConnectX-3 FDR InfiniBand HCAs CPU: 2x Intel Xeon E5-2680 (8c16t 2.70Ghz) DRAM: 128GB
EchoStreams FlacheSAN2
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
LSI HBA
Mellanox FDR
5x mirrored 4-column 2-copy Space, exposed as SMB3 shares
Mellanox FDR
Mellanox FDR
Mellanox FDR
5x LSI 2308-based PCIe Gen 3.0 SAS HBA (6 possible) 8x Intel 520 SSDs per controller Total: five groups of eight for 40 total SSDs (48 possible)
Mellanox FDR
Mellanox FDR
Client
x5 HBA+SSD Groups Intel 520 SSD
Intel 520 SSD
Intel 520 SSD
Intel 520 SSD
Intel 520 SSD
Intel 520 SSD
Intel 520 SSD
Intel 520 SSD
Methodology
Client workload generator: Microsoft SQLIO Affinitized to run on specific CPU cores Two instances, one per socket Server virtual drives Each share exposes two 100GB files Client instances split load per-socket Goal to emulate typical NUMA-aware modern application E.g. Windows Hyper-V, guests running with affinity to specific socket(s) and core(s), accessing per-VM VHDs Units: KB MB GB = decimal: 103 106 109 KiB MiB GiB = IEC60027-2: 210 220 230
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Metric: Overhead
Cycles/Byte Standard measure of CPU bandwidth efficiency 𝑐 𝐵
=
%𝑃𝑟𝑖𝑣𝑖𝑙𝑒𝑔𝑒𝑑 𝐶𝑃𝑈 𝑈𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 × 𝐶𝑜𝑟𝑒 𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 × #𝐶𝑜𝑟𝑒𝑠 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑖𝑛 𝐵𝑦𝑡𝑒𝑠
Privileged CPU utilization from Windows Performance counters Discounts any unrelated activity, and from load generator itself Core clock is not constant - must configure system under test to minimize processor frequency variation: Hyperthreading disabled TurboBoost and SpeedStep disabled Virtualization disabled BIOS deep C-states disabled Windows power plan to Max Performance Re-enabling can improve performance, i.e. results are conservative.
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Metric: Latency
Two client-visible components of latency: Wire
Visible in Windows perfmon “stalls”
Filesystem (NTFS) processing time Storage processing time
Measured as 90th percentile Captured with Windows Performance Analyzer Individual I/O latencies 1M samples or 1 minute, with warm-up Unexpected latency increase can indicate bottleneck being reached E.g. CPU saturation or other overhead
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Wire = Client - Server
EchoStreams FlacheSAN2 LSI HBA
Client Latency
Server
Client
Bit transmission time Includes request queuing on/off adapter
Wire Latency
Server Latency
x5 HBA+SSD Groups Intel 520 SSD
...
...
...
...
...
..
...
Latency Methodology Windows Performance Toolkit xperf -on fileio … xperf -d trace.etl xperf –i trace.etl -o trace.txt -a dumper Correlate relavent fileio events Trace both sides of the wire ~simultaneously, post warmup Difference the client and server side histograms
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
9
Result 1: Single I/O Latency • Single random I/O to single share • Used to establish base latency expected of systems • Consistent, good performance, exposing wire and SSD latencies Latency (us)
90th Percentile Read
90th Percentile Write
Size (KiB) 1 8 64 512
Client Server Wire 204 176 197 159 419 366 1297 1112
Client 29 38 52 185
Server 153 113 366 1355
Wire 119 65 303 1143
Cumulative Latencies Server
Relative Client - Server Latency
Client
1K
100%
150
80%
100
60% 40%
50
20%
0
0% 10
100
Latency (us)
1000
10
100
Latency (us)
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
1000
Wire Latency (us)
# Samples (Thousand)
Server Latency
34 49 63 212
8K
64K
512K
200
150 100 50 0 0%
20%
40%
60%
Percentile
80%
100%
Result 2: Small I/O Scaling - Read
Client CPU comfortable Server CPU saturates at high thread count Note relatively low server CPU clock (2.00 GHz) 1KiB
8KiB
43.3 43.3 41.4
%CPU IOPs 90th (us) c/B Srv 7.9 10.8 64500 310 14.8 21.7 123600 365 24.0 48.4 211500 445
7.9 7.0 6.5
9.7 16.4 26.3
9.8 20.3 46.6
560
41.7
35.7
84.5
327050
530
6.4
40.0
82.5
1040
44.9
46.6
99.9
425900
955
7.2
58.2
100.0
Threads
IOPs
90th (us)
c/B
1 (20 I/O) 2 (40 I/O) 4 (80 I/O) 8 (160 IO) 16 (320 IO)
76650 144050 244250
265 320 390
360950 438400
%CPU
1000
300
750
200
500 Saturation 250
100 Scaling
0 1
2
4
8
0 16
1T
IO/s Latency
Wire Latency (us)
400
90th %ile Latency (us)
IO/s (Thousands)
1250
%CPU Srv
Scaled 8KiB Read Wire Latency at 20 IO/Thread
Scaled 8KiB Read at 20 IO/Thread 500
%CPU
2T
4T
8T
200 150 100
50 0 10%
20%
Threads 2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
30%
40%
50%
60%
Percentile
70%
80%
90%
100%
Result 3: Small I/O Scaling – 60/40
Similar to read As expected, since load is not bandwidth-limited Scaling may increase on bi-directional links, if available 1KiB
Threads
IOPs
8KiB 90th R(us)
90th W(us)
c/B
%CPU
IOPs
90th R(us)
90th W(us)
c/B
%CPU
1 (20 I/O)
69800
300
350
44.3
7.3
70700
310
270
7.1
9.5
2 (40 I/O)
125900
355
410
45.3
13.5
124950
370
340
7.0
16.6
4 (80 I/O)
206450
435
495
43.4
21.3
210850
450
410
6.9
27.4
8 (160 I/O)
319150
545
635
39.6
30.0
328150
575
510
6.8
42.5
16 (320 I/O)
424850
960
1140
47.1
47.5
375900
1235
1330
7.4
52.7
Mixed 8KiB Wire Latency at 8T 20 IO/Thread
Wire Latency (us)
read
write
300 200 100
0 10%
20%
30%
40%
50%
60%
Percentile 2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
70%
80%
90%
100%
Result 4: Large I/O (Read)
13.99
2520
%CPU 1.22 1.06 1.05 1.06
6.9 12.2 20.8 29.0
1.09
35.3
64KiB 400
512KiB
3.0 2.0
300
64KiB
1.0 0.0
200 100 0
10%
30%
50%
Percentile
70%
90%
GByte/s
4.0
16.40
0.31
11.6
Scaling of 512KiB IO
90th Latency
90th Latency
50th Latency
50th Latency
10 0
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
4.7 7.6 9.8 10.9
GBytes/s
4 3 2 1 0 2
0.31 0.29 0.29 0.30
GBytes/s
20
1
%CPU
19900
Scaling of 64KiB IO 64KiB Wire Latency (us)
512KiB Wire Latency (ms)
Large Read Wire Latency at 8T 20 IO/Thread 512KiB
512KiB GBytes/s 90th (us) c/B 6.64 1630 11.34 2570 14.41 4970 15.68 9930
4
8 16
Threads
20
20
10
10
0
0 1
2
4
8
Threads
16
Latency (ms)
Threads 1 (20 I/O) 2 (40 I/O) 4 (80 I/O) 8 (160 I/O) 16 (320 I/O)
64KiB GBytes/s 90th (us) c/B 2.45 550 4.95 630 8.58 780 11.84 1300
GByte/s
Full bandwidth (16+ GBps!) achievable, at very low CPU 512KB reaches limit of network at just under 16 threads Multichannel round-robin leads to some latency variation near limit CPU limit much better behaved, by comparison
Latency (ms)
Conclusions (Windows Server 2012)
Maximum bandwidth
High IOPS to real storage
16.4GB/s (~5.5GB/s/adapter) 0.31 c/B overhead For 512KiB I/Os 376,000 to FlacheSAN2 6.4 c/B overhead For 8KiB I/Os
Near-constant latency profile
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Approaching RTM – Small I/O 8KiB Random Read
WS2012 IOPS*
WS “WIP” IOPS*
Δ IOPS
Δ c/B
1x54Gbps NIC
~330,000
~460,000
+36%
-17%
2x54Gbps NIC
~660,000
* fictitious storage (/dev/zero)
~860,000
+30%
-15%
As of Windows 2012 R2 ‘MP’ Preview Intermediate results from local-only internal optimizations Enhanced NUMA awareness Improved request batching, locking, cacheline false sharing, etc Future improvements expected from Further optimizations Use of iWARP/InfiniBand remote invalidation Refer to earlier Greg Kramer / Tom Talpey Presentation for final!
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
2012 to 2012 R2 Same Client, Server increases CPU by 20% SSDs age about 9 months Mezzanine LSI Adapter option installed, sixth SSD group now available
E5-2650
E5-2665
Normal
2.0 Ghz
2.4Ghz
+20%
Turbo
2.8 Ghz
3.1Ghz
+11%
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
16
2012 to 2012 R2 at 5 Groups
Small Read @ 20QD/T Up 30% at limit, above nominal 20% from clock alone
583K 1KiB 549K 8KiB
500,000
+30%
400,000
IO/s
600,000
1KiB - 2012
300,000
1KiB - 2012 R2
8KiB - 2012 200,000
8KiB - 2012 R2
100,000
0 1
2
4
8
16
Threads
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
17
2012 to 2012 R2 – 5 Group Latency 1200
End to end latency improves very significantly at saturation
1000
Latency (us)
800 600
8KiB - 2012 8KiB - 2012 R2
400 200 0 1
2
4
8
16
Threads
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
18
2012 R2 5 → 6 Groups Small Read, now 24QD/T +20%, as expected, until CPU saturation and max TDP
700,000
600,000 500,000 400,000
IO/s
1KiB - 5 1KiB - 6
300,000
8KiB - 5 8KiB - 6
200,000 100,000 0 1
2
4
8
16
Improvement -> 6
Threads 30% 20%
10%
1KiB
0%
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
8KiB 1
2
4
8
16
Threads
19
2012 R2 Balanced v. High Perf 600,000
1000 900
700
400,000
600
IO/s
100% Read
800
300,000
500 400
200,000
Latency (us)
Impact of power management varies over load Same final destination near saturation
500,000
300
8 - High - IO/s 8 - Balanced - Rd90 8 - High - Rd90
200
100,000
8 - Balanced - IO/s
100 0
0 1
2
4
8
16
Threads 600,000
4000
60:40 R/W
3500
500,000
IO/s
2500
300,000
2000 1500
200,000
1000 100,000
Latency (us)
3000 400,000
8 - Balanced - IO/s 8 - High - IO/s 8 - Balanced - Rd90 8 - High - Rd90
500
0
0 1
2
4
8
16
Threads
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
20
Scaling …
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
21
Classic Cluster-in-a-box Storage Connectivity
Server A
Server B
JBOD A JBOD B
Great 2-point resiliency and easy shared storage Limited in scale and resiliency 24-120 shared storage devices possible
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Scale-out File Server Storage Connectivity
Great scale and resiliency No single point of failure Dual path to storage devices from each server 48-280 shared storage devices possible Scale-out fileserver allows for resource/load balancing
JBOD A
JBOD B Server A Server B Server C Server D
JBOD C
JBOD D ** connectivity shown for single server
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
And Now For Something Different
!
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
24
Performance: • 100% Reads – 4KiB Block > 1Million IOPs • 100% Reads – 8KiB Block > 500K IOPs • 100% Writes – 4KiB >600K IOPs • 100% Writes – 8KiB > 300K IOPs Configuration as tested: V6616 - SLC • 4 x Dual Port Mellanox ConnectX-3 • 2 x Internal Gateways • 8c Sandy Bridge at 1.8 GHz • 48GB DRAM • Windows 2012 R2 • Failover Cluster • 8 1TB Shares exported – 2 Per Client Planned configuration for GA: • MLC: 64TB, 32TB, 12TB • SLC: 16TB *Samples and POC gear available immediately
Interconnect • 40 GbE – RoCE RDMA • SMB 3.0 + SMB Direct 4 External Clients • 2 x Dual Port Mellanox ConnectX-3 • 4c Xeon at 2.53 GHz • 24 GB DRAM • Windows 2012 R2, SQLIO
References
Windows Server 2012 EchoStreams FlacheSAN2 (paper) http://www.microsoft.com/en-us/download/details.aspx?id=38432 EchoStreams FlacheSAN2 http://www.echostreams.com/flachesan2.html SMB 3.0 Specification (MS-SMB2) http://msdn.microsoft.com/en-us/library/cc246482.aspx SMB Direct Specification (MS-SMBD) http://msdn.microsoft.com/en-us/library/hh536346.aspx Windows Performance Analyzer http://go.microsoft.com/fwlink/?LinkId=214551 Contact danlo -at- microsoft.com
2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.