QDR on Intel Computing Platforms

Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms Hari Subramoni, Matthew Koop and Dhabaleswar. K. Pa...
Author: Donna Griffin
1 downloads 1 Views 1MB Size
Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms

Hari Subramoni, Matthew Koop and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University HotI '09

• • • • •

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

HotI '09

• Commodity clusters are becoming more popular for High Performance Computing (HPC) Systems • Modern clusters are being designed with multi-core processors • Introduces multi-level communication • Intra-node (intra-socket, inter-socket) • Inter-node

HotI '09

Node Socket Chip

Chip

Core

Core

Compute Cluster

Node Socket Chip

Chip

Core

Core

Core

Core

Core

Core

Network Fabric

Socket

Socket

Chip

Chip

Chip

Chip

Core

Core

Core

Core

Core

Core

Core

Core

Compute Cluster • Cache Hierarchies • Memory Architecture • Inter Processor Connections • Memory controllers • MPI Library Design HotI '09

• Interconnect Speed • Network Performance • Network Topology • Network Congestion • MPI Library Design

• Communication characteristics • Message distribution • Mapping of processes into cores/nodes – Block vs. cyclic

• Traditionally, intra-node communication performance has been better than inter-node communication performance – Most applications use `block’ distributions

• Are such practices still valid on modern clusters?

HotI '09

• • • •

• •



First true quad core processor with L3 cache sharing 45 nm manufacturing process Uses QuickPath Interconnect Technology HyperThreading allows execution of multiple threads per core in a seamless manner Turbo boost technology allows automatic over clocking of processors Integrated memory controller supporting multiple memory channels gives very high memory bandwidth Has impact on Intra-node Communication Performance HotI '09

Socket Nehalem Core 0

Nehalem Core 3

32 KB L1D Cache

32 KB L1D Cache

256 KB L2 Cache

256 KB L2 Cache

8MB L3 Cache DDR3 Memory Controller

QuickPath Interconnect

• An industry standard for low latency, high bandwidth, System Area Networks • Multiple features – Two communication types • Channel Semantics • Memory Semantics (RDMA mechanism)

– Multiple virtual lanes – Quality of Service (QoS) support

• Double Data Rate (DDR) with 20 Gbps bandwidth has been there • Quad Data Rate (QDR) with 40 Gbps bandwidth is available recently • Has impact on Inter-node communication performance HotI '09

• • • • •

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

HotI '09

• What are the intra-node and inter-node communication performance of Nehalem-based clusters with InfiniBand DDR and QDR? • How do these communication performance compare with previous generation Intel processors (Clovertown and Harpertown) with similar InfiniBand DDR and QDR? • With rapid advances in processor and networking technologies, are the relative performance between intra-node and inter-node changing? • How such changes can be characterized? • Can such characterization be used to analyze application performance across different systems? HotI '09

• • • • •

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

HotI '09

• Absolute Performance of Intra-node and Inter-node communication – Different combinations of Intel processor platforms and InfiniBand (DDR and QDR)

• Characterization of Relative Performance between Intranode and Inter-node communication – Use such characterization to analyze application-level performance

HotI '09



Applications have different communication characteristics – Latency sensitive – Bandwidth (uni-directional) sensitive – Bandwidth (bi-directional) sensitive



Introduce a set of metrics Communication Balance Ratio (CBR) – – – –



CBR-Latency = Latency_Intra / Latency_Inter CBR-Bandwidth = Bandwidth_Intra / Bandwidth_Inter CBR-Bi-BW = Bi-BW_Intra / Bi_BW_Inter CBR-Multi-BW = Multi-BW_Intra / Multi-BW_Inter

CBR-x=1 => Cluster is Balanced wrt metric x • Applications sensitive to metric x can be mapped anywhere in the cluster without any significant impact on overall performance

HotI '09

• • • • •

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

HotI '09

• Three different compute platforms – Intel Clovertown • Intel Xeon E5345 Dual quad-core processors operating at 2.33 GHz • 6GB RAM, 4MB cache • PCIe 1.1 interface

– Intel Harpertown • Dual quad-core processors operating at 2.83 GHz • 8GB RAM, 6MB cache • PCIe 2.0 interface

– Intel Nehalem • Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz • 12GB RAM, 8MB cache • PCIe 2.0 interface HotI '09



Two different InfiniBand Host Channel Adapters – Dual port ConnectX DDR adapter – Dual port ConnectX QDR adapter



Two different InfiniBand Switches – Flextronics 144 port DDR switch – Mellanox 24 port QDR switch



Five different platform-interconnect combinations – – – – –

• • •

NH-QDR – Intel Nehalem machines using ConnectX QDR HCA’s NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s HT-QDR – Intel Harpertown machines using ConnectX QDR HCA’s HT-DDR – Intel Harpertown machines using ConnectX DDR HCA’s CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s

Open Fabrics Enterprise Distribution (OFED) 1.4.1 drivers Red Hat Enterprise Linux 4U4 MPI Stack used – MVAPICH2-1.2p1 HotI '09

• High Performance MPI Library for IB and 10GE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 960 organizations in 51 countries – More than 32,000 downloads from OSU site directly – Empowering many TOP500 clusters • 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) – Also supports uDAPL device to work with any network supporting uDAPL – http://mvapich.cse.ohio-state.edu/ HotI '09

• OSU Microbenchmarks (OMB) – Version 3.1.1 – http://mvapich.cse.ohio-state.edu/benchmarks/

• Intel Collective Microbenchmarks (IMB) – Version 3.2 – http://software.intel.com/en-us/articles/intel-mpi-benchmarks/

• HPC Challenge Benchmark (HPCC) – Version 1.3.1 – http://icl.cs.utk.edu/hpcc/

• NAS Parallel Benchmarks (NPB) – Version 3.3 – http://www.nas.nasa.gov/ HotI '09

• Absolute Performance – – – – –

Inter-node latency and bandwidth Intra-node latency and bandwidth Collective All-to-all HPCC NAS

• Communication Balance Ratio – – – –

CBR-Latency CBR-Bandwidth (uni-directional) CBR-Bandwidth (bi-directional) CBR-Bandwidth (multi-pair)

• Impact of CBR on Application Performance HotI '09

Latency (us)

Latency (us)

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

1.76 1.63 1.55 1.05 1.05

1600 1400 1200 1000 800 600 400 200 0

Message Size (Bytes) HT-QDR NH-DDR

HT-DDR CT-DDR

Message Size (Bytes) HT-QDR NH-DDR

NH-QDR

HT-DDR CT-DDR

NH-QDR

• Harpertown systems deliver best small message latency • Up to 10% improvement in large message latency for NH-QDR over HT-QDR HotI '09

3029

3000

2575

2500

1943

2000

1943

1500

1556

1000 500 0

Message Size (Bytes) HT-QDR

HT-DDR

NH-DDR

CT-DDR

Bidirectional Bandwidth (MBps)

Bandwidth (MBps)

3500

NH-QDR

7000 6000 5236

5000

5042

4000

3870 3743

3000

3011

2000 1000 0

Message Size (Bytes) HT-QDR

HT-DDR

NH-DDR

CT-DDR

NH-QDR

• Nehalem systems offer a peak uni-directional bandwidth of 3029 MBps and bi-directional bandwidth of 5236 MBps • NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR HotI '09

2500

1

2000

0.8 0.6

Latency (us)

Latency (us)

1.2

0.57 0.57 0.57

0.4 0.2

0.35

1500 1000 500

0.35

0

0

Message Size (Bytes) HT-QDR

HT-DDR

NH-DDR

CT-DDR

Message Size (Bytes) HT-QDR NH-DDR

NH-QDR

HT-DDR CT-DDR

• Intra-Socket small message latency of 0.35 us • Nehalem systems give up to 40% improvement in Intra-Node latency for various message sizes HotI '09

NH-QDR

7474 7474 3282 3282

2208

Message Size (Bytes) HT-QDR

HT-DDR

NH-DDR

CT-DDR

Bidirectional Bandwidth (MBps)

Bandwidth (MBps)

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

NH-QDR

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

6826 6826 2779 2779

1738

Message Size (Bytes) HT-QDR

HT-DDR

NH-DDR

CT-DDR

NH-QDR

• Intra-Socket bandwidth (7474 MBps) and bidirectional bandwidth (6826 MBps) show the high memory bandwidth of Nehalem systems • Drop in performance at large message size due to cache collisions HotI '09

30000 25000 20000 15000 10000 5000 0

Different Send/Recv Buffers Bandwidth (MBps)

Bandwidth (MBps)

Same Send/Recv Buffers 26954 26954

16382 16382 21898

25000 20000

HT-DDR

NH-DDR

CT-DDR

9843

10000

4609 4609

5000

1168

0

Message Size (Bytes) HT-QDR

9843

15000

Message Size (Bytes)

NH-QDR

HT-QDR

HT-DDR

NH-DDR

CT-DDR

NH-QDR

• Different send/recv buffers are used to negate the caching effect • Nehalem systems show superior memory bandwidth with different send/recv buffers HotI '09

350 328.4

250 200

160.2

150 100

Latency (us)

Latency (us)

300

72.1

50 0

1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0

Message Size (Bytes) NH-QDR

NH-DDR

Message Size (Bytes) NH-QDR

CT-DDR

NH-DDR

CT-DDR

• A 43% to 55% improvement by using QDR HCA over a DDR HCA • Harpertown numbers not shown due to unavailability of more number of nodes HotI '09

Bandwidth (MBps)

3000

• Baseline numbers are taken on CT-DDR

2500 2000

• NH-DDR shows a 13% improvement in performance over Harpertown and Clovertown systems

1500 1000 500 0 Minimum Ping Pong Bandwidth HT-QDR NH-DDR

HT-DDR Baseline

NH-QDR

HotI '09

• NH-QDR shows a 38% improvement in performance over NH-DDR systems

Bandwidth (MBps)

Bandwidth (MBps)

800 700 600 500 400 300 200 100 0

800 700 600 500 400 300 200 100 0 Randomly Ordered Ring Bandwidth

Naturally Ordered Ring Bandwidth HT-QDR

HT-DDR

NH-DDR

Baseline

NH-QDR

• Up to 190% improvement in Naturally Ordered Ring bandwidth for NH-QDR

HT-QDR

HT-DDR

NH-DDR

Baseline

NH-QDR

• Up to 130% improvement in Randomly Ordered Ring bandwidth for NH-QDR

HotI '09

Class C – 32 processes

1.2 1 0.8 0.6 0.4 0.2 0

Normalized Time

Normalized Time

Class B – 32 processes

CG

FT

IS

LU

MG

NAS Benchmarks NH-DDR

1.2 1 0.8 0.6 0.4 0.2 0 CG

FT

IS

LU

NAS Benchmarks

NH-QDR

NH-DDR

NH-QDR

• Numbers normalized to NH-DDR • NH-QDR shows clear benefits over NH-DDR for multiple applications HotI '09

MG

Communication Balance Ratio (L_intra/L_inter)

2.5 2

HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced

1.5 1 0.5 0

Message Size (Bytes)

• Useful for Latency bound applications • Harpertown more balanced for applications using small to medium sized messages • HT-DDR more balanced for applications using large messages followed by NH-QDR HotI '09

Communication Balance Ratio (BW_intra/BW_inter)

3.5 3

HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced

2.5 2 1.5 1 0.5 0

Message Size (Bytes)

• Useful for Bandwidth bound applications • Nehalem systems more balanced for applications using small to medium sized messages • Harpertown systems more balanced for applications using large messages HotI '09

Communication Balance Ratio (Bi_BW_intra/Bi_BW_inter)

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced

Message Size (Bytes)

• Useful for Applications using frequent bidirectional communication pattern • Nehalem systems balanced for applications using small to medium sized messages in bidirectional communication pattern • NH-QDR balanced for all message sizes HotI '09

Communication Balance Ratio (MP_BW_intra/MP_BW_inter)

16 14

HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced

12 10 8 6 4 2 0

Message Size (Bytes)

• Useful for Communication intensive applications • NH-QDR balanced for applications using mainly small to medium sized messages • Harpertown balanced for applications using mainly large messages HotI '09

Normalized Time

1.2 1 0.8 0.6 0.4 0.2 0

• NH-QDR is more balanced than NHDDR especially for medium to large messages • Process mapping should have less impact with NH-QDR than NH-DDR for applications using medium to large messages

CG

EP

FT

LU

MG

NAS Benchmarks NH-QDR-Block

NH-QDR-Cyclic

NH-DDR-Block

NH-DDR-Cyclic

• We compare NPB performance for block and cyclic process mapping • Numbers normalized to NH-DDR-Cyclic • NH-QDR has very similar performance for both block and cyclic mapping for multiple applications • CG & FT uses a lot of large messages, hence show difference • MG is not communication intensive • LU uses small messages where CBR for NH-QDR and NH-DDR is similar

HotI '09

• • • • •

Introduction Problem Statement Approach Performance Evaluation and Results Conclusions and Future Work

HotI '09

• Studied absolute communication performance of various Intel computing platforms with InfiniBand DDR and QDR • Proposed a set of metrics related to Communication Balance Ratio (CBR) • Evaluated these metrics for various computing platforms and InfiniBand DDR and QDR • Nehalem systems with InfiniBand QDR give the best absolute performance for latency and bandwidth in most cases • Nehalem based systems alter the CBR metrics • Nehalem systems with InfiniBand QDR interconnects also offer best communication balance in most cases • Plan to perform larger scale evaluations and study impact of these systems on the performance of end applications HotI '09

{subramon, koop, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/

HotI '09

Suggest Documents