Evaluating Performance of BLAST on Intel Xeon and Itanium2 Processors

Evaluating Performance of BLAST on Intel Xeon and Itanium2 Processors Ramesh Radhakrishnan Rizwan Ali Garima Kochar Onur Celebioglu Jenwei Hsieh Scal...
Author: Garey Wilkinson
0 downloads 2 Views 519KB Size
Evaluating Performance of BLAST on Intel Xeon and Itanium2 Processors

Ramesh Radhakrishnan Rizwan Ali Garima Kochar Onur Celebioglu Jenwei Hsieh Scalable Systems Group

Kalyana Chadalavada Ramesh Rajagopalan

HPCC Enterprise Solutions Dell Inc

Agenda ƒ Objectives ƒ Platform Comparison • Xeon and Itanium2 Processors • Cache and Memory architecture ƒ BLAST • Application Characteristics ƒ Experimental Setup ƒ Performance Analysis • Memory System Performance • Application performance • Workload Characterization ƒ Future Work www.dell.com/hpcc

HPCC Building Blocks

Application Application

Parallel Applications (STAR-CD, Fluent, BLAST..)

Middleware Middleware

MPI/Pro

OS OS Protocol Protocol Interconnect Interconnect Platform Platform

www.dell.com/hpcc

MPICH

MVICH

Linux

TCP

PVM

Windows

VIA

GM

Elan Infiniband

Fast Ethernet

Gigabit Ethernet

Myrinet

Quadrics

Dell PowerEdge Servers (IA32 & IA64)

Objectives ƒ Evaluate performance of BLAST on different Intel Processor Architectures • Nocona – 90nm Xeon • Prestonia – 130 nm Xeon • Madison – Itanium2

ƒ Platform Comparison • Impact of Processor FSB and Memory differences

ƒ BLAST • Application Performance • Application Characteristics

www.dell.com/hpcc

Platform Comparison ƒ Dell PowerEdge Servers • PE1750 (IA32) • Dual 3.2Ghz Processors, 533MHz FSB • L2 Cache: 512KB, L3: 1MB • DDR-266 MHz

• PE1850 (EM64T) • • • •

Dual 3.2Ghz Processors, 800MHz FSB Dual 3.6Ghz Processors, 800MHz FSB L2 Cache: 1024KB DDR2-400 MHz

• PE3250 (IA64) • Dual 1.5Ghz Itanium2 Processors, 400MHz FSB • L2 Cache: 256KB, L3: 6MB • DDR-200 MHz www.dell.com/hpcc

Processor Comparison

• Xeon DP (130nm) - Prestonia • 1.8Ghz – 3.2Ghz, 400MHz – 533MHz (FSB) • 20 stage pipeline

• Xeon DP (90nm) - Nocona • 2.8Ghz – 3.6Ghz, 800 MHz (FSB) • 31 stage pipeline • x86 64bit Extensions

• Itanium2 - Madison • 1.0Ghz - 1.5Ghz, 400MHz FSB • 64-bit EPIC architecture

www.dell.com/hpcc

Cache and Memory subsystem Comparison

• Memory Subsystem Differences: • DDR vs. DDR2

• Cache Architectures PE1750 Xeon 130nm

PE1850 Xeon 90nm

PE3250 Itanium2

L1 (Inst Cache) 12K µops Trace Cache 12K µops Trace Cache 16KB L1 (Data Cache) 8KB

16KB

16KB

L2 Cache

512KB

1024KB

256KB

L3 Cache

1MB

N/A

6MB

www.dell.com/hpcc

Memory Subsystem Performance • Theoretical Peak Bandwidth: • PE3250 – 6.4 GB/s • PE1850 – 6.4GB/s • PE1750 – 4.2GB/s

• Sustainable memory bandwidth:

Copy www.dell.com/hpcc

Scale

Add

2162

3524 3646

3737

PE1750 (DDR266)

2194

2431

3391

PE1850 (DDR2-400) 3155 3675

2427

3666

4000 3500 3000 2500 2000 1500 1000 500 0

3282

Throughput (MB/s)

PE3250 (DDR200)

Triad

Cache Performance • Cache Access and Memory Read Latency (using LMbench) Cache/ Memory Levels

PE1750 3.2 GHz (130nm Xeon) DDR266

PE1850 3.2 GHz (90nm Xeon) DDR2-400

PE3250 1.5 GHz (Itanium2) DDR-200

Time (nano seconds) L1

0.63ns

1.25ns

1.34ns

L2

5.7ns

9.03ns

4.02ns

L3

8.5ns

-N/A-

13.7ns

Memory

128ns

116ns

201ns

Cycles (processor clocks) L1

2

4

2

L2

18

29

6

L3

27

-N/A-

21

Memory

410

371

302

www.dell.com/hpcc

BLAST ƒ Basic Local Alignment Search Tool • A family of sequence database-search algorithms • Searches a database for similarities to a short query sequence ABCDAFRGLAAQA ASRGAALCNAGF ABCDAFRGLAAQA ASRGAALCNAGF

Non-optimal alignment (1 match)

Optimal alignment (4 matches)

ƒ Application Characteristics • Sensitive to processor memory bandwidth • Embarrassingly parallel operation • Integer operation intensive operations www.dell.com/hpcc

BLAST Performance

1

1.44 1.29

2.29 1

1.59 1.47

1

1.52 1.46

2.22

2.36 1

1.52 1.41

2.31

PE1850 (3.6GHz) PE1750 (3.2 GHz)

2.23 1

1.00

1.43 1.37

1.50

1

2.00

1.44 1.35

Relative Performance

2.50

2.37

PE3250 (1.5 GHz) PE1850 (3.2GHz)

0.50 0.00 94k 94k 206k 206k 510k 510k (1Thread) (2Thread) (1Thread) (2Thread) (1Thread) (2Thread)

Query size/ # of threads • PE1850 - 29%-59% performance improvements • PE3250 – 122% - 137% performance improvements

www.dell.com/hpcc

Scalability (1P to 2P) PE1850 (3.6GHz) PE1750 (3.2 GHz)

10% 0% 94k

206k

Query Sizes • Good Thread-Level Parallelism

www.dell.com/hpcc

510k

72.57%

51.18%

55.70%

78.25%

70.90%

78.11%

71.07%

81.52%

84.55%

50% 40% 30% 20%

86.74%

80% 70% 60%

82.09%

100% 90%

89.44%

Performancee Improvement

PE3250 (1.5 GHz) PE1850 (3.2GHz)

EM64T Evaluation Comparison of the different mode of operations against the protein database

25000 23000

PE1850/32bit PE1850/64bitOS/32bit-binary PE1850/64bitOS/64bit binary

21000

Time (Sec)

19000 17000 15000 13000 11000 9000 153117

206848

237455

Query Word Size

• EM64T mode provides benefits with additional registers and memory addressing capability over legacy 32-bit modes.

www.dell.com/hpcc

CPU Performance Metrics (Xeon) PE1750 (3.2GHz)

PE1850 (3.2GHz)

PE1850 (3.6GHz)

98.97%

98.53%

98.84%

Path Length

133M

144M

145M

CPI

2.84

2.62

2.93

Instruction Speculation Efficiency Ratio

64%

60%

60%

PE1850 (3.2GHz)

PE1850 (3.6GHz)

% Unhalted CPU Cycles

PE1750 (3.2GHz) L1 Data Cache Miss Ratio

8%

7%

7%

L2 Cache Load & Store Miss Ratio

24%

14%

15%

L2 Cache Hits Shared Ratio

48%

68%

65%

L2 Cache Hits Exclusive Ratio

11%

11%

11%

L2 Cache Hits Modified Ratio

16%

9%

9%

L3 Cache Load & Store Miss Ratio

54%

N/A

N/A

1,132

1,398

1,439

FSB Data Bus Throughput Mbytes/sec www.dell.com/hpcc

CPU Performance Metrics (Xeon vs. Itanium)

PE1750 (3.2GHz Xeon)

PE1850 (3.2GHz Xeon)

PE1850 (3.6GHz Xeon)

CPI

2.84

2.62

2.93

0.68

L1 data Cache miss ratio

8%

7%

7%

15.8%

L2 Cache Miss Ratio

24%

14%

15%

22.3%

L3 Cache Miss Ratio

54%

100%

100%

7.1%

www.dell.com/hpcc

PE3250 (1.5GHz Itanium2)

Summary and Future Directions ƒ Evaluated Performance of BLAST on different Platforms • • • • •

BLAST runs well on IA64 architecture Scaled well with faster DDR2 memory No large benefits from increased cache size on Nocona No additional benefits from 64-bit capabilities Interesting workload

ƒ Future Work • Run on a cluster to evaluate • Interconnect performance • Different flavors of MPI libraries • Impact of Hyper-Threading

www.dell.com/hpcc

Suggest Documents