Memory Performance on Dual Processor Nodes: Comparison of Intel Xeon and AMD Opteron Memory Subsystem Architectures by Avi Purkayastha
[email protected]
Chona Guiang, Avi Purkayastha, Kent Milfeld, Jay Boisseau
TEXAS ADVANCED COMPUTING CENTER
OUTLINE • • •
Motivation Architecture of Intel Xeon & AMD Opteron Systems Single & Dual Processor Xeon & Opteron Performance Comparison – –
Measured Memory Characteristics -- Latency, B/W Parallel vs Serial Execution of Codes on a Node Kernels (STREAM, DAXPY, DGEMM) Scientific Applications NAS Parallel Benchmarks
• Summary and Conclusions 2
Motivation Dual processor scoreboard for HPC Applications: Single
Dual
x x x x x
x x
x Peak performance (TFLOP) x Cost Per Processor Memory Subsystem No shared bus system No Coherence in Caches (processor and “Northbridge” & OS) No False Sharing Memory Size Message Passing No Shared interconnect adapters x On-node MPI performance I/O Performance Local Parallel
3
Intel Architecture Commodity IA-32 Server
IA-32 1.6GB/sec
Memory Memory
(200 MHz)
1.6GB/sec (200 MHz)
IA-32 3.2 GB/sec (400MHz)
North Bridge FSB Front-Side Bus Memory (Speed) Bus NB – SB Bus PCI (Speed)
200 Mhz dual channel
Switch
PCI Adapter 0.5GB/s (“NIC”) (66 MHz)
South Bridge
4
Intel Architecture HyperTransport Link Widths and Speeds
Memory
Memory
1.6Gb/s per pin pair
Two unidirectional point-to-point links
2.66GB/s 2.66GB/s
Opteron Chipwide @ up to 800MHz (DDR) 2,4,8,16 or 32bits
DDR Memory Controller
Sys. Request Queue
(333MHz) (333MHz)
Core
HyperTransport
HyperTransport
HyperTransport
XBAR
Opteron Chip HyperTransport
3.2 GB/s per dir. @ 800MHz x2
5
AMD Architecture 2.1/2.7 GB/sec
AMD Opteron 6.4GB/s HT
6.4GB/s Coherent HT
AMD-8151 HT AGP Tunnel 6.4GB/s HT
AMD Opteron
Dual Channel 266/333 MHz (PC2100/2700)
AMD-8131 HT PCI-X Tunnel
6
Memory Latency
I1 = IA(1) DO I = 2,N I2 = IA(I1) I1 = I2 END DO
7
100
32 76 8 65 53 6 13 10 72 26 21 44 52 42 88 10 48 57 6 20 97 15 2 41 94 30 4 83 88 60 8 16 77 72 16 33 55 44 32
150
81 92 16 38 4
40 96
20 48
10 24
51 2
25 6
12 8
Latency (clock periods)
Memory Latency Xeon
2.4GHz
500
450
~470 CP
400
350
300
250
200
~2 CP, L1
Array size (bytes)
8
L2
50
0
40
32 76 8 65 53 6 13 10 72 26 21 44 52 42 88 10 48 57 6 20 97 15 2 41 94 30 4 83 88 60 8 16 77 72 16 33 55 44 32
60
81 92 16 38 4
40 96
20 48
180
10 24
200
51 2
25 6
12 8
Latency (clock periods)
Memory Latency AMD
1.4GHz ~170 CP
160
140
120
100
80
~2-3 CP, L1
Array size (bytes)
9
L2
20
0
Memory Bandwidth
DO I = 1,N S = S + A(I) T = T + B(I) END DO
10
AMD SP/DP memory bandwidth 12000
1.4GHz Opteron
Bandwidth (MB/s)
10000
8000
6000 dual CPU0 serial dual dualCPU1 CPU0
4000
dual serial
CPU1
2.3GB/s
2.0GB/s per cpu
2000
0 512
1024
2048
4096
8192
16384
Size (bytes) X
11
32768
4
65536
131072
262144
Xeon SP/DP memory bandwidth 14000
2.4GHz Xeon
12000
Bandwidth (MB/s)
10000
8000 dual CPU0 dual CPU1
6000
serial 4000
2000
1.0GB/s per cpu
0 512
1024
2048
4096
8192
2.3GB/s 16384
Size (bytes)
12
32768
X4
65536
131072
262144
Remote & Local Memory Read/Write Swap the 2 j columns
25
A(i,j)=time*A(i,j) 20
AMD Opteron
15
Remote Access
"local" average "local" thread 1 "remote" thread 0 "remote" average
10
"remote" thread 1
5
Local Access
matrix leading dimension (n)
13
11 50 0 13 00 0
85 00 10 00 0
70 00
55 00
40 00
25 00
14 00
10 00
70 0
0
40 0
clock cycles
"local" thread 0
STREAM Results Kernel
Intel Xeon
AMD Opteron
Kernel
Intel Xeon
AMD Opteron
Copy
1125
2162
Copy
1105
3934
Scale
1117
2093
Scale
1103
4087
Add
1261
2341
Add
1263
4561
Triad
1263
2411
Triad
1282
4529
Serial Execution, (MB/sec).
Parallel Execution, two threads (MB/sec).
• Opteron system shows higher sustained memory bandwidth. • The difference in FSB vs HyperTransport shows up in the scaling numbers. • Both systems attain about 35-42 % of peak for sustained aggregate memory B/W
14
DAXPY Results 12
10
dual run CPU0 dual run average
6
Intel Xeon
dual run CPU1 serial
4 18
16
2
14
0 16
32
48
64
96
128
176
256
368
vector length (n)
AMD Opteron
512
720
1024
1456 122048 clock periods
clock periods
8
dual run CPU0
10
dual run average dual run CPU1
8
serial
6
4
2 0 16
32
48
64
96
128
176
256
vector length (n)
TEXAS ADVANCED COMPUTING CENTER
368
512
720
1024
1456
2048
DAXPY Results -- Conclusions • Serial DAXPY iteration takes same CP, but that translates to higher throughput for Xeon’s. • For dual-processor execution, time to complete each iteration is same for AMD -- data is streaming from memory independently. • On Xeon systems, bandwidth is split hence time for each iteration is doubled, but not severely. • Less computationally intensive kernels like DOTP will have more severe performance penalties TEXAS ADVANCED COMPUTING CENTER
Library Matrix-Matrix Multiply (DGEMM) MKL 5.1 Library
Performance (MB/sec)
Performance (MFLOPS)
3500 3000 2500 2000 1500 1000 500 0
Intel Xeon 2.4GHz Performance (MB/sec) Performance (MFLOPS)
AMD Opteron 1.4GHz
serial
parallel two MPI tasks
10010
1,000 10000 1000000 Size (8-byte (matrixwords) order)
10000000 0
May be much higher with Opteron-optimized Libs -- NAG
17
6000 5000 4000 3000 2000 1000 0
Intel Xeon serial
parallel two MPI tasks
10010
1,000 10000 1000000 (matrix order) Size (8-byte w ords)
1E+08
Indirect dot-product 140
120
dual run CPU0
80
dual run average dual run CPU1 60
Intel Xeon
serial
40 450 20 400 0
57 92 11 58 4 23 16 8 46 33 6 92 68 8 18 53 60 37 07 20 74 14 5 14 6 82 91 2
dual run CPU0
250
dual run average dual run CPU1
200
serial
150 100 50
57 92 11 58 4 23 16 8 46 33 6 92 68 8 18 53 60 37 07 20 74 14 5 14 6 82 91 2
14 56 28 96
72 0
36 8
17 6
96
0 48
AMD Opteron
300
16
vector length (n)
clock periods
14 56 28 96
72 0
36 8
17 6
96
350 48
16
clock periods
100
vector length (n)
TEXAS ADVANCED COMPUTING CENTER
Indirect Dot Product Conclusions • Important kernel for sparse matrix vector operations. • indirect addressing and more memory reference hinders aggressive optimization. • Predictably, the IDP takes more CPs than DOTP, due to above reasons. • Xeon performs very well when data is cache resident but performance degrades significantly when run on two processors. • Opteron scales very well for dual-processor and performs well particularly for data in main memory. TEXAS ADVANCED COMPUTING CENTER
MPI On-Node Ping Pong It should be faster than node-to-node.
(MB/sec)
DELL 2650
295 @ 2MB
Opteron Suse-64 ch_p4 Opteron Suse-64 ch_shmem
172 @ 2MB 404 @ 2MB
IBM P690 Turbo IBM P655 HPC
1398 @ 2MB 1684 @ 2MB
20
Scientific Applications •
SM: Stommel model of ocean “circulation” ; solves 2-D partial differential equation.
•
– Uses Verlet algorithm for propagation (displacement & velocities). – Calculation done for 1 pico second.
– Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations).
•
Parallel version uses Domain Decomposition.
•
MD: Molecular Dynamics of a Solid Argon Lattice.
•
Memory Intensive Application
21
Compute Intensive Application
Serial Application Performance Platform
Serial FD
Serial MD
AMD Opteron
205
132249
Intel Xeon 2P
236
2040
• Large Size problems for both applications (4000 Argon atoms for
MD and domain size for FD is 1000x1000). • Time(in seconds) is the measure of the results. • Xeon performs significantly better for the Compute-intensive MD application. • Opteron performs better for the memory-intensive application. TEXAS ADVANCED COMPUTING CENTER
Memory Intensive Application Performance
• Opteron scales very well. • The FSB technology in Xeon limits data transfer to both processors for memory intensive applications. TEXAS ADVANCED COMPUTING CENTER
NAS Parallel Benchmarks AMD Opteron Performance Serial (sec)
Parallel (sec)
cg.B
522
221
is.B
13
lu.B mg.B
Intel Xeon Performance Serial (sec)
Parallel (sec)
cg.B
876
666
7
is.B
14
9
1729
681
lu.B
1437
1182
55
26
mg.B
58
49
• Opteron scales better in all of the parallel runs for above cases. • LU -- the most compute intensive application performs best serially on Xeons.
24
Summary and Conclusions • Opteron has lower latency from main memory. • Although measured bandwidth is same, Xeon has higher efficiency rate. • Parallel bandwidth rates are validated by measured application performance, as well as “scaling” ability. • Computational kernels (DAXPY, DGEMM etc.) show that Xeons perform better for cache resident data while Opteron scales better for 2P cases and for data in MM. • The Opteron system shows higher sustained bandwidth for both serial and multi-threaded execution, for all STREAM kernels. • FD, MD and NPB applications validate most of the above findings. 25
Thanks ! • Newisys (Austin, TX) AMD Opteron System
• Dell (Austin, TX) & Cray Intel Xeon System
26