Scaled RDMA Performance and Storage Design with Windows Server 2012 R2

Scaled RDMA Performance and Storage Design with Windows Server 2012 R2 Dan Lovinger Principal Software Engineer Windows File Server Microsoft 2013 Sto...
Author: Randolf Hopkins
6 downloads 2 Views 1MB Size
Scaled RDMA Performance and Storage Design with Windows Server 2012 R2 Dan Lovinger Principal Software Engineer Windows File Server Microsoft 2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Outline SMB3 Application Workloads – Real Hardware  Methodology  2012 Results and Discussion*  Comparison to 2012 R2 RTM  Scaling to Racks and Full Deployments 

*There’s a paper you can download!

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Goals    



Demonstrate SMB3 is valid Best Choice for application workloads Evaluate potential of new server hardware with SMB3 Evaluate performance of RDMA-capable fabric(s) Demonstrate that it is reasonable to consider remotely deployed storage for highly scaled server environments. Chart a future performance course, and metrics to use

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Key SMB3 Application Workloads   



Hyper-V (virtualization) SQL 8K Random  VHDs and database tables  Pure read, plus read/write mix 512K Sequential  Backup, disk migration, decision support/data mining  Pure read  Can be >512K, but performance and requirements largely the same  Also 64K

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

EchoStreams FlacheSAN2 

Appliance combining SAS HBAs, enterprise SSDs and high speed networking, Windows Server 2012 and Storage Spaces.  Networking: 3x Mellanox ConnectX-3 FDR InfiniBand HCAs  Storage

CPU: 2x Intel Xeon E5-2650 (8c16t 2.00Ghz) 

Latest version with E5-2665 2.40GHz CPUs + Mezz

DRAM: 32GB Client generic white box  Networking: 3x Mellanox ConnectX-3 FDR InfiniBand HCAs  CPU: 2x Intel Xeon E5-2680 (8c16t 2.70Ghz)  DRAM: 128GB

EchoStreams FlacheSAN2



2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

LSI HBA



Mellanox FDR

5x mirrored 4-column 2-copy Space, exposed as SMB3 shares

Mellanox FDR



Mellanox FDR



Mellanox FDR



5x LSI 2308-based PCIe Gen 3.0 SAS HBA (6 possible) 8x Intel 520 SSDs per controller Total: five groups of eight for 40 total SSDs (48 possible)

Mellanox FDR



Mellanox FDR



Client

x5 HBA+SSD Groups Intel 520 SSD

Intel 520 SSD

Intel 520 SSD

Intel 520 SSD

Intel 520 SSD

Intel 520 SSD

Intel 520 SSD

Intel 520 SSD

Methodology 







Client workload generator: Microsoft SQLIO  Affinitized to run on specific CPU cores  Two instances, one per socket Server virtual drives  Each share exposes two 100GB files  Client instances split load per-socket Goal to emulate typical NUMA-aware modern application  E.g. Windows Hyper-V, guests running with affinity to specific socket(s) and core(s), accessing per-VM VHDs Units:  KB MB GB = decimal: 103 106 109  KiB MiB GiB = IEC60027-2: 210 220 230

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Metric: Overhead 

Cycles/Byte  Standard measure of CPU bandwidth efficiency  𝑐 𝐵

 



=

%𝑃𝑟𝑖𝑣𝑖𝑙𝑒𝑔𝑒𝑑 𝐶𝑃𝑈 𝑈𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 × 𝐶𝑜𝑟𝑒 𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 × #𝐶𝑜𝑟𝑒𝑠 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ 𝑖𝑛 𝐵𝑦𝑡𝑒𝑠

Privileged CPU utilization from Windows Performance counters  Discounts any unrelated activity, and from load generator itself Core clock is not constant - must configure system under test to minimize processor frequency variation:  Hyperthreading disabled  TurboBoost and SpeedStep disabled  Virtualization disabled  BIOS deep C-states disabled  Windows power plan to Max Performance Re-enabling can improve performance, i.e. results are conservative.

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Metric: Latency



Two client-visible components of latency:  Wire  



Visible in Windows perfmon “stalls”

Filesystem (NTFS) processing time Storage processing time

Measured as 90th percentile  Captured with Windows Performance Analyzer  Individual I/O latencies  1M samples or 1 minute, with warm-up Unexpected latency increase can indicate bottleneck being reached  E.g. CPU saturation or other overhead

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Wire = Client - Server

EchoStreams FlacheSAN2 LSI HBA



Client Latency

Server 



Client

Bit transmission time Includes request queuing on/off adapter 



Wire Latency

Server Latency

x5 HBA+SSD Groups Intel 520 SSD

...

...

...

...

...

..

...

Latency Methodology Windows Performance Toolkit  xperf -on fileio … xperf -d trace.etl  xperf –i trace.etl -o trace.txt -a dumper  Correlate relavent fileio events  Trace both sides of the wire ~simultaneously, post warmup  Difference the client and server side histograms 

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

9

Result 1: Single I/O Latency • Single random I/O to single share • Used to establish base latency expected of systems • Consistent, good performance, exposing wire and SSD latencies Latency (us)

90th Percentile Read

90th Percentile Write

Size (KiB) 1 8 64 512

Client Server Wire 204 176 197 159 419 366 1297 1112

Client 29 38 52 185

Server 153 113 366 1355

Wire 119 65 303 1143

Cumulative Latencies Server

Relative Client - Server Latency

Client

1K

100%

150

80%

100

60% 40%

50

20%

0

0% 10

100

Latency (us)

1000

10

100

Latency (us)

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

1000

Wire Latency (us)

# Samples (Thousand)

Server Latency

34 49 63 212

8K

64K

512K

200

150 100 50 0 0%

20%

40%

60%

Percentile

80%

100%

Result 2: Small I/O Scaling - Read  

Client CPU comfortable Server CPU saturates at high thread count  Note relatively low server CPU clock (2.00 GHz) 1KiB

8KiB

43.3 43.3 41.4

%CPU IOPs 90th (us) c/B Srv 7.9 10.8 64500 310 14.8 21.7 123600 365 24.0 48.4 211500 445

7.9 7.0 6.5

9.7 16.4 26.3

9.8 20.3 46.6

560

41.7

35.7

84.5

327050

530

6.4

40.0

82.5

1040

44.9

46.6

99.9

425900

955

7.2

58.2

100.0

Threads

IOPs

90th (us)

c/B

1 (20 I/O) 2 (40 I/O) 4 (80 I/O) 8 (160 IO) 16 (320 IO)

76650 144050 244250

265 320 390

360950 438400

%CPU

1000

300

750

200

500 Saturation 250

100 Scaling

0 1

2

4

8

0 16

1T

IO/s Latency

Wire Latency (us)

400

90th %ile Latency (us)

IO/s (Thousands)

1250

%CPU Srv

Scaled 8KiB Read Wire Latency at 20 IO/Thread

Scaled 8KiB Read at 20 IO/Thread 500

%CPU

2T

4T

8T

200 150 100

50 0 10%

20%

Threads 2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

30%

40%

50%

60%

Percentile

70%

80%

90%

100%

Result 3: Small I/O Scaling – 60/40  

Similar to read As expected, since load is not bandwidth-limited  Scaling may increase on bi-directional links, if available 1KiB

Threads

IOPs

8KiB 90th R(us)

90th W(us)

c/B

%CPU

IOPs

90th R(us)

90th W(us)

c/B

%CPU

1 (20 I/O)

69800

300

350

44.3

7.3

70700

310

270

7.1

9.5

2 (40 I/O)

125900

355

410

45.3

13.5

124950

370

340

7.0

16.6

4 (80 I/O)

206450

435

495

43.4

21.3

210850

450

410

6.9

27.4

8 (160 I/O)

319150

545

635

39.6

30.0

328150

575

510

6.8

42.5

16 (320 I/O)

424850

960

1140

47.1

47.5

375900

1235

1330

7.4

52.7

Mixed 8KiB Wire Latency at 8T 20 IO/Thread

Wire Latency (us)

read

write

300 200 100

0 10%

20%

30%

40%

50%

60%

Percentile 2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

70%

80%

90%

100%

Result 4: Large I/O (Read)

13.99

2520

%CPU 1.22 1.06 1.05 1.06

6.9 12.2 20.8 29.0

1.09

35.3

64KiB 400

512KiB

3.0 2.0

300

64KiB

1.0 0.0

200 100 0

10%

30%

50%

Percentile

70%

90%

GByte/s

4.0

16.40

0.31

11.6

Scaling of 512KiB IO

90th Latency

90th Latency

50th Latency

50th Latency

10 0

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

4.7 7.6 9.8 10.9

GBytes/s

4 3 2 1 0 2

0.31 0.29 0.29 0.30

GBytes/s

20

1

%CPU

19900

Scaling of 64KiB IO 64KiB Wire Latency (us)

512KiB Wire Latency (ms)

Large Read Wire Latency at 8T 20 IO/Thread 512KiB

512KiB GBytes/s 90th (us) c/B 6.64 1630 11.34 2570 14.41 4970 15.68 9930

4

8 16

Threads

20

20

10

10

0

0 1

2

4

8

Threads

16

Latency (ms)

Threads 1 (20 I/O) 2 (40 I/O) 4 (80 I/O) 8 (160 I/O) 16 (320 I/O)

64KiB GBytes/s 90th (us) c/B 2.45 550 4.95 630 8.58 780 11.84 1300

GByte/s



Full bandwidth (16+ GBps!) achievable, at very low CPU 512KB reaches limit of network at just under 16 threads  Multichannel round-robin leads to some latency variation near limit  CPU limit much better behaved, by comparison 

Latency (ms)



Conclusions (Windows Server 2012)



Maximum bandwidth   



High IOPS to real storage   



16.4GB/s (~5.5GB/s/adapter) 0.31 c/B overhead For 512KiB I/Os 376,000 to FlacheSAN2 6.4 c/B overhead For 8KiB I/Os

Near-constant latency profile

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Approaching RTM – Small I/O 8KiB Random Read

WS2012 IOPS*

WS “WIP” IOPS*

Δ IOPS

Δ c/B

1x54Gbps NIC

~330,000

~460,000

+36%

-17%

2x54Gbps NIC

~660,000

 



* fictitious storage (/dev/zero)

~860,000

+30%

-15%

As of Windows 2012 R2 ‘MP’ Preview Intermediate results from local-only internal optimizations  Enhanced NUMA awareness  Improved request batching, locking, cacheline false sharing, etc Future improvements expected from  Further optimizations  Use of iWARP/InfiniBand remote invalidation  Refer to earlier Greg Kramer / Tom Talpey Presentation for final!

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

2012 to 2012 R2 Same Client, Server increases CPU by 20%  SSDs age about 9 months  Mezzanine LSI Adapter option installed, sixth SSD group now available 

E5-2650

E5-2665

Normal

2.0 Ghz

2.4Ghz

+20%

Turbo

2.8 Ghz

3.1Ghz

+11%

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

16

2012 to 2012 R2 at 5 Groups



Small Read @ 20QD/T Up 30% at limit, above nominal 20% from clock alone

583K 1KiB 549K 8KiB

500,000

+30%

400,000

IO/s



600,000

1KiB - 2012

300,000

1KiB - 2012 R2

8KiB - 2012 200,000

8KiB - 2012 R2

100,000

0 1

2

4

8

16

Threads

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

17

2012 to 2012 R2 – 5 Group Latency 1200

End to end latency improves very significantly at saturation

1000

Latency (us)



800 600

8KiB - 2012 8KiB - 2012 R2

400 200 0 1

2

4

8

16

Threads

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

18

2012 R2 5 → 6 Groups Small Read, now 24QD/T  +20%, as expected, until CPU saturation and max TDP

700,000



600,000 500,000 400,000

IO/s

1KiB - 5 1KiB - 6

300,000

8KiB - 5 8KiB - 6

200,000 100,000 0 1

2

4

8

16

Improvement -> 6

Threads 30% 20%

10%

1KiB

0%

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

8KiB 1

2

4

8

16

Threads

19

2012 R2 Balanced v. High Perf 600,000

1000 900



700

400,000

600

IO/s

100% Read

800

300,000

500 400

200,000

Latency (us)

Impact of power management varies over load  Same final destination near saturation

500,000

300

8 - High - IO/s 8 - Balanced - Rd90 8 - High - Rd90

200

100,000

8 - Balanced - IO/s

100 0

0 1

2

4

8

16

Threads 600,000

4000

60:40 R/W

3500

500,000

IO/s

2500

300,000

2000 1500

200,000

1000 100,000

Latency (us)

3000 400,000

8 - Balanced - IO/s 8 - High - IO/s 8 - Balanced - Rd90 8 - High - Rd90

500

0

0 1

2

4

8

16

Threads

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

20

Scaling …

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

21

Classic Cluster-in-a-box Storage Connectivity

Server A

Server B

JBOD A JBOD B  



Great 2-point resiliency and easy shared storage Limited in scale and resiliency 24-120 shared storage devices possible

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Scale-out File Server Storage Connectivity

 

 

Great scale and resiliency No single point of failure  Dual path to storage devices from each server 48-280 shared storage devices possible Scale-out fileserver allows for resource/load balancing

JBOD A

JBOD B Server A Server B Server C Server D

JBOD C

JBOD D ** connectivity shown for single server

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

And Now For Something Different

!

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

24

Performance: • 100% Reads – 4KiB Block > 1Million IOPs • 100% Reads – 8KiB Block > 500K IOPs • 100% Writes – 4KiB >600K IOPs • 100% Writes – 8KiB > 300K IOPs Configuration as tested: V6616 - SLC • 4 x Dual Port Mellanox ConnectX-3 • 2 x Internal Gateways • 8c Sandy Bridge at 1.8 GHz • 48GB DRAM • Windows 2012 R2 • Failover Cluster • 8 1TB Shares exported – 2 Per Client Planned configuration for GA: • MLC: 64TB, 32TB, 12TB • SLC: 16TB *Samples and POC gear available immediately

Interconnect • 40 GbE – RoCE RDMA • SMB 3.0 + SMB Direct 4 External Clients • 2 x Dual Port Mellanox ConnectX-3 • 4c Xeon at 2.53 GHz • 24 GB DRAM • Windows 2012 R2, SQLIO

References  

 

 

Windows Server 2012 EchoStreams FlacheSAN2 (paper)  http://www.microsoft.com/en-us/download/details.aspx?id=38432 EchoStreams FlacheSAN2  http://www.echostreams.com/flachesan2.html SMB 3.0 Specification (MS-SMB2)  http://msdn.microsoft.com/en-us/library/cc246482.aspx SMB Direct Specification (MS-SMBD)  http://msdn.microsoft.com/en-us/library/hh536346.aspx Windows Performance Analyzer  http://go.microsoft.com/fwlink/?LinkId=214551 Contact  danlo -at- microsoft.com

2013 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.

Suggest Documents