Memory Performance on Dual Processor Nodes: Comparison of Intel Xeon and AMD Opteron Memory Subsystem Architectures

Memory Performance on Dual Processor Nodes: Comparison of Intel Xeon and AMD Opteron Memory Subsystem Architectures by Avi Purkayastha [email protected]...
Author: Henry Willis
1 downloads 1 Views 3MB Size
Memory Performance on Dual Processor Nodes: Comparison of Intel Xeon and AMD Opteron Memory Subsystem Architectures by Avi Purkayastha [email protected]

Chona Guiang, Avi Purkayastha, Kent Milfeld, Jay Boisseau

TEXAS ADVANCED COMPUTING CENTER

OUTLINE • • •

Motivation Architecture of Intel Xeon & AMD Opteron Systems Single & Dual Processor Xeon & Opteron Performance Comparison – –

Measured Memory Characteristics -- Latency, B/W Parallel vs Serial Execution of Codes on a Node ƒ Kernels (STREAM, DAXPY, DGEMM) ƒ Scientific Applications ƒ NAS Parallel Benchmarks

• Summary and Conclusions 2

Motivation Dual processor scoreboard for HPC Applications: Single

Dual

x x x x x

x x

x Peak performance (TFLOP) x Cost Per Processor Memory Subsystem No shared bus system No Coherence in Caches (processor and “Northbridge” & OS) No False Sharing Memory Size Message Passing No Shared interconnect adapters x On-node MPI performance I/O Performance Local Parallel

3

Intel Architecture Commodity IA-32 Server

IA-32 1.6GB/sec

Memory Memory

(200 MHz)

1.6GB/sec (200 MHz)

IA-32 3.2 GB/sec (400MHz)

North Bridge FSB Front-Side Bus Memory (Speed) Bus NB – SB Bus PCI (Speed)

200 Mhz dual channel

Switch

PCI Adapter 0.5GB/s (“NIC”) (66 MHz)

South Bridge

4

Intel Architecture HyperTransport Link Widths and Speeds

Memory

Memory

1.6Gb/s per pin pair

Two unidirectional point-to-point links

2.66GB/s 2.66GB/s

Opteron Chipwide @ up to 800MHz (DDR) 2,4,8,16 or 32bits

DDR Memory Controller

Sys. Request Queue

(333MHz) (333MHz)

Core

HyperTransport

HyperTransport

HyperTransport

XBAR

Opteron Chip HyperTransport

3.2 GB/s per dir. @ 800MHz x2

5

AMD Architecture 2.1/2.7 GB/sec

AMD Opteron 6.4GB/s HT

6.4GB/s Coherent HT

AMD-8151 HT AGP Tunnel 6.4GB/s HT

AMD Opteron

Dual Channel 266/333 MHz (PC2100/2700)

AMD-8131 HT PCI-X Tunnel

6

Memory Latency

I1 = IA(1) DO I = 2,N I2 = IA(I1) I1 = I2 END DO

7

100

32 76 8 65 53 6 13 10 72 26 21 44 52 42 88 10 48 57 6 20 97 15 2 41 94 30 4 83 88 60 8 16 77 72 16 33 55 44 32

150

81 92 16 38 4

40 96

20 48

10 24

51 2

25 6

12 8

Latency (clock periods)

Memory Latency Xeon

2.4GHz

500

450

~470 CP

400

350

300

250

200

~2 CP, L1

Array size (bytes)

8

L2

50

0

40

32 76 8 65 53 6 13 10 72 26 21 44 52 42 88 10 48 57 6 20 97 15 2 41 94 30 4 83 88 60 8 16 77 72 16 33 55 44 32

60

81 92 16 38 4

40 96

20 48

180

10 24

200

51 2

25 6

12 8

Latency (clock periods)

Memory Latency AMD

1.4GHz ~170 CP

160

140

120

100

80

~2-3 CP, L1

Array size (bytes)

9

L2

20

0

Memory Bandwidth

DO I = 1,N S = S + A(I) T = T + B(I) END DO

10

AMD SP/DP memory bandwidth 12000

1.4GHz Opteron

Bandwidth (MB/s)

10000

8000

6000 dual CPU0 serial dual dualCPU1 CPU0

4000

dual serial

CPU1

2.3GB/s

2.0GB/s per cpu

2000

0 512

1024

2048

4096

8192

16384

Size (bytes) X

11

32768

4

65536

131072

262144

Xeon SP/DP memory bandwidth 14000

2.4GHz Xeon

12000

Bandwidth (MB/s)

10000

8000 dual CPU0 dual CPU1

6000

serial 4000

2000

1.0GB/s per cpu

0 512

1024

2048

4096

8192

2.3GB/s 16384

Size (bytes)

12

32768

X4

65536

131072

262144

Remote & Local Memory Read/Write Swap the 2 j columns

25

A(i,j)=time*A(i,j) 20

AMD Opteron

15

Remote Access

"local" average "local" thread 1 "remote" thread 0 "remote" average

10

"remote" thread 1

5

Local Access

matrix leading dimension (n)

13

11 50 0 13 00 0

85 00 10 00 0

70 00

55 00

40 00

25 00

14 00

10 00

70 0

0

40 0

clock cycles

"local" thread 0

STREAM Results Kernel

Intel Xeon

AMD Opteron

Kernel

Intel Xeon

AMD Opteron

Copy

1125

2162

Copy

1105

3934

Scale

1117

2093

Scale

1103

4087

Add

1261

2341

Add

1263

4561

Triad

1263

2411

Triad

1282

4529

Serial Execution, (MB/sec).

Parallel Execution, two threads (MB/sec).

• Opteron system shows higher sustained memory bandwidth. • The difference in FSB vs HyperTransport shows up in the scaling numbers. • Both systems attain about 35-42 % of peak for sustained aggregate memory B/W

14

DAXPY Results 12

10

dual run CPU0 dual run average

6

Intel Xeon

dual run CPU1 serial

4 18

16

2

14

0 16

32

48

64

96

128

176

256

368

vector length (n)

AMD Opteron

512

720

1024

1456 122048 clock periods

clock periods

8

dual run CPU0

10

dual run average dual run CPU1

8

serial

6

4

2 0 16

32

48

64

96

128

176

256

vector length (n)

TEXAS ADVANCED COMPUTING CENTER

368

512

720

1024

1456

2048

DAXPY Results -- Conclusions • Serial DAXPY iteration takes same CP, but that translates to higher throughput for Xeon’s. • For dual-processor execution, time to complete each iteration is same for AMD -- data is streaming from memory independently. • On Xeon systems, bandwidth is split hence time for each iteration is doubled, but not severely. • Less computationally intensive kernels like DOTP will have more severe performance penalties TEXAS ADVANCED COMPUTING CENTER

Library Matrix-Matrix Multiply (DGEMM) MKL 5.1 Library

Performance (MB/sec)

Performance (MFLOPS)

3500 3000 2500 2000 1500 1000 500 0

Intel Xeon 2.4GHz Performance (MB/sec) Performance (MFLOPS)

AMD Opteron 1.4GHz

serial

parallel two MPI tasks

10010

1,000 10000 1000000 Size (8-byte (matrixwords) order)

10000000 0

May be much higher with Opteron-optimized Libs -- NAG

17

6000 5000 4000 3000 2000 1000 0

Intel Xeon serial

parallel two MPI tasks

10010

1,000 10000 1000000 (matrix order) Size (8-byte w ords)

1E+08

Indirect dot-product 140

120

dual run CPU0

80

dual run average dual run CPU1 60

Intel Xeon

serial

40 450 20 400 0

57 92 11 58 4 23 16 8 46 33 6 92 68 8 18 53 60 37 07 20 74 14 5 14 6 82 91 2

dual run CPU0

250

dual run average dual run CPU1

200

serial

150 100 50

57 92 11 58 4 23 16 8 46 33 6 92 68 8 18 53 60 37 07 20 74 14 5 14 6 82 91 2

14 56 28 96

72 0

36 8

17 6

96

0 48

AMD Opteron

300

16

vector length (n)

clock periods

14 56 28 96

72 0

36 8

17 6

96

350 48

16

clock periods

100

vector length (n)

TEXAS ADVANCED COMPUTING CENTER

Indirect Dot Product Conclusions • Important kernel for sparse matrix vector operations. • indirect addressing and more memory reference hinders aggressive optimization. • Predictably, the IDP takes more CPs than DOTP, due to above reasons. • Xeon performs very well when data is cache resident but performance degrades significantly when run on two processors. • Opteron scales very well for dual-processor and performs well particularly for data in main memory. TEXAS ADVANCED COMPUTING CENTER

MPI On-Node Ping Pong It should be faster than node-to-node.

(MB/sec)

DELL 2650

295 @ 2MB

Opteron Suse-64 ch_p4 Opteron Suse-64 ch_shmem

172 @ 2MB 404 @ 2MB

IBM P690 Turbo IBM P655 HPC

1398 @ 2MB 1684 @ 2MB

20

Scientific Applications •

SM: Stommel model of ocean “circulation” ; solves 2-D partial differential equation.



– Uses Verlet algorithm for propagation (displacement & velocities). – Calculation done for 1 pico second.

– Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations).



Parallel version uses Domain Decomposition.



MD: Molecular Dynamics of a Solid Argon Lattice.



Memory Intensive Application

21

Compute Intensive Application

Serial Application Performance Platform

Serial FD

Serial MD

AMD Opteron

205

132249

Intel Xeon 2P

236

2040

• Large Size problems for both applications (4000 Argon atoms for

MD and domain size for FD is 1000x1000). • Time(in seconds) is the measure of the results. • Xeon performs significantly better for the Compute-intensive MD application. • Opteron performs better for the memory-intensive application. TEXAS ADVANCED COMPUTING CENTER

Memory Intensive Application Performance

• Opteron scales very well. • The FSB technology in Xeon limits data transfer to both processors for memory intensive applications. TEXAS ADVANCED COMPUTING CENTER

NAS Parallel Benchmarks AMD Opteron Performance Serial (sec)

Parallel (sec)

cg.B

522

221

is.B

13

lu.B mg.B

Intel Xeon Performance Serial (sec)

Parallel (sec)

cg.B

876

666

7

is.B

14

9

1729

681

lu.B

1437

1182

55

26

mg.B

58

49

• Opteron scales better in all of the parallel runs for above cases. • LU -- the most compute intensive application performs best serially on Xeons.

24

Summary and Conclusions • Opteron has lower latency from main memory. • Although measured bandwidth is same, Xeon has higher efficiency rate. • Parallel bandwidth rates are validated by measured application performance, as well as “scaling” ability. • Computational kernels (DAXPY, DGEMM etc.) show that Xeons perform better for cache resident data while Opteron scales better for 2P cases and for data in MM. • The Opteron system shows higher sustained bandwidth for both serial and multi-threaded execution, for all STREAM kernels. • FD, MD and NPB applications validate most of the above findings. 25

Thanks ! • Newisys (Austin, TX) AMD Opteron System

• Dell (Austin, TX) & Cray Intel Xeon System

26

Suggest Documents