Institute of Computing Technology Chinese Academy of Sciences
High-Efficient Architecture of Godson-T Many-Core Processor Dongrui Fan, Hao Zhang, Da Wang, Xiaochun Ye, Fenglong Song, Junchao Zhang, Lingjun Fan Advanced Micro-System Group National Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences
AMS
Overview
Microprocessor Architecture Challenges
Godson-T Features for Parallel Program
Godson-T Design and Implementation
AMS
2
Godson-T | H OT CH I PS 2 0 1 1
Microprocessor Architecture Revolution
Memory wall + Power wall + ILP wall = Brick wall ! What parallel architecture will be successful? Power Wall
Many-Core
Performance
Multi-Core
ILP Wall
Superscalar Processor Core
Simpler Processor Core
Power, Complexity
AMS
Parallel architectures come to rescue, superscalar processor is being replaced by on-chip multi-core / many-core processor 3
Godson-T | H OT CH I PS 2 0 1 1
Exploiting Polymorphic Parallelism Instruction-Level Parallelism
Godson-3
With Powerful Streaming Unit
Heterogenous Many-Core With moderate DLP and strong TLP exploitation
Godson-T
Thread-Level Parallelism AMS
4
Godson-T | H OT CH I PS 2 0 1 1
Many-Core Challenges: The 5 P’s
Power efficiency challenge
Performance challenge
How to exploit parallelism through software and hardware co-design
Platform challenge
AMS
How to provide a converged many core solution in a standard programming environment
Parallelism challenge
How to scale from 1 to 1000 cores – the number of cores is the new Megahertz
Programming challenge
Performance per watt is the new metric – Dark Silicon will appear in 2020 when facing 11nm process technology
How to maximize the usability, such as, runtime system (OS), compiler & library, many-core debug… 5
Godson-T | H OT CH I PS 2 0 1 1
What we could help…… Productivity Side:
Most sequential programmers are not ready to switch into parallel programming
Unique programming model cannot solve all problems efficiently;
conservative programming model and make incremental improvement? support multiple programming features efficiently?
Locks are messy
new way to eliminate deadlock?
……
Performance Side: Take this carefully, because it may affect productivity !
On-chip synchronization is fast;
Flops are cheap, memory communications are expensive;
handle all synchronizations on-chip? trade flops for communication latency?
Fine-grained parallelism should be taken into consideration;
enable data-driven thread execution on chip?
…… AMS
6
Godson-T | H OT CH I PS 2 0 1 1
Motivation Target at H PC Godson-T Many cores to accelerate one program
Cost
Data Communication
AMS
Method
Cost
Fine Grained concept Thread division
Thread Synchronization
7
Godson-T | H OT CH I PS 2 0 1 1
Overview
AMS
Microprocessor Architecture Challenges ›
Memory wall + Power wall + ILP wall
›
The 5P’s: Many-core Processor Challenges
Godson-T Features for Parallel Program ›
Godson-T architecture overview
›
Architectural supports for multithreading
›
Software runtime system
Godson-T Design and Implementation ›
Journey of design and implementation
›
Software and hardware co-simulation
›
Prototype
8
Godson-T | H OT CH I PS 2 0 1 1
Architecture Overview of Godson-T Program Runtime System
Memory Controller CORE
GodsonT Processor
L1$
Private Memroy R
Memory Controller
Memory Controller
Processing Core Private I-$ Fetch & Decode Register Files
FP/ INT MAC Sync Vector ALU ALU Unit ALU
Synchronization Manager L2 Cache Bank I/O Controller
AMS
SPM
D$
Local Memory DTA
Router Memory Controller
9
Godson-T | H OT CH I PS 2 0 1 1
Processing Core 4Byte
FETCH
Instruction Buffer 1 instruction
ISA: MIPS (user), SIMD-ext., sync-ext.
8-stage pipeline
Dual-issue per thread
Expected SIMD
DECODE 1 instruction 32Byte 2r/1w
Vector Register File 32 Bytex 2x1
Vector/ Floating-Point Unit
Vector/ Fixed-Point Unit
L0_Icache
L0_Dcache
16Byte
16Byte
Load/Store/ Synchronization Unit
Fast level-1 memory
16KB private memory
AMS
Automatically mapped into stack address space
Full/empty bit tagged on each 64-bit slot enables efficient producerconsumer style synchronization
16Byte in/out msg
8Byte
SPM/ L1 Cache
Load, Store, Data Move, Arithmetic ……
Router
10
Communication with external modules through message packets Godson-T | H OT CH I PS 2 0 1 1
Interconnection
AMS
Separated routers with each processing core
Static XY wormhole routing
Round-Robin arbitration
Two independent physical networks
Duplex 128bit link for each network
Guaranteed in-order point-to-point communication
Deadlock-free & livelock-free
Scalable and power-efficient for MESH topology
Low latency core-to-core communication
Separated networks allowing traffic segregation and tolerating burst DMA transfer
11
Godson-T | H OT CH I PS 2 0 1 1
Memory Hierarchy Data Transfer Bandwidth Reg
Each processing core with 32 fixedpoint and 32 floating-point registers
Local Memory
32KB local memory, including L1$D, SPM
512GB/s
128GB/s
16 address-interleaved L2 cache banks, 128KB each
L2 Cache 51.2GB/s
Off-Chip Memory
AMS
4 DDR3-1600 memory controllers
12
Godson-T | H OT CH I PS 2 0 1 1
Power Management RT
RT
RT
RT
RT
RT
RT
RT
RT
RT
RT
RT
• Monitor status of each core at program level • Shut down or turn on cores separately • Reassign tasks
RT
AMS
RT
RT
RT
13
Godson-T | H OT CH I PS 2 0 1 1
Overview
AMS
Microprocessor Architecture Challenges ›
Memory wall + Power wall + ILP wall
›
The 5P’s: Many-core Processor Challenges
Godson-T Features for Parallel Program ›
Godson-T architecture overview
›
Architectural supports for multithreading
›
Software runtime system
Godson-T Design and Implementation ›
Journey of design and implementation
›
Software and hardware co-simulation
›
Prototype
14
Godson-T | H OT CH I PS 2 0 1 1
Thread Communication & Synchronization
Lock-Based Cache Coherent Protocol
Lazy cache coherent protocol
Pure mutual-exclusion synchronization instruction without memory accesses; Eliminate busy-waiting for locks; More scalable than bus-snoopy and directory-based cache protocol; Enable hardware deadlock detecting. int mutual_func_a() { …… acquire_lock(LOCK_VAL); X = 3; release_lock(LOCK_VAL); …… }
X=0
Mini Core
Mini Core
Private Cache
Private Cache
ACQ. REL. X=3
ACQ. REL.
X=0
int mutual_func_b() { …… acquire_lock(LOCK_VAL); Y = X; Z = X; release_lock(LOCK_VAL); …… }
Inte rc o nne c t
ACK. CORE 0 UNLOCK LOCKED CORE 1 WAITING UNLOCK LOCKED AMS
X=3 X=0
Synchronization Manager
Shared L2 Cache
15
Godson-T | H OT CH I PS 2 0 1 1
Thread Communication & Synchronization
Data Transfer Agent (DTA) DTA horizontal operation
DTA vertical operation
DTA vertical operation
SPM
DTA
L2 Cache
SPM
……
DTA
Router
……
Router
……
off-chip memory
Router
……
Programmable asynchronous data transfer agent Support vertical and horizontal DTA operations (such as prefetch)
Data transfers between multidimension addresses (such as matrix inversion )
Network load perception, automatically flow control (improve bandwidth-efficiency)
Support fine-grain synchronous operations
on-chip network
(a) vertical and horizontal DTA operations
chunk stride
chunk stride ……
block block stride
…… block stride
invovled data block address (b) 2D strided DTA operations
AMS
16
Godson-T | H OT CH I PS 2 0 1 1
Synchronized DTA Operations Source Node
Traffic Awareness Node
Target Node
(s): successfully perform on the state of full/empty bit (f): failed to perform on the state of full/empty bit load.sync (s) *load.future (s)/ store.future (s)/ store.sync (f)
1
0 store.sync (s)
AMS
17
load.future (f)/ store.future (f)/ load.sync (f)
Evaluating On-chip DTA Benchmark
Processor
SGEMM
Performance (GFLOPS)
1-D FFT 130 120 110 100 90 80 70 60 50 40 30 20 10 0
122.8
Godson-T
Cyclops-64
GTX8800
Efficiency
95.9% 1
99.9% 2
43.4%
60.0%
Performance
122.81
204.7
13.9
206.0
Efficiency
33.2%
20.4%
25.8%
29.9%
Performance
63.72
41.8
20.7
155.0
1 The SGEMM kernel contains only multiply-and-add operation, so that the ideal peak performance is measured by the multiply-and-add function unit, which is 128GFLOPS.
127.5
Cache SPM, without DTA SPM, with DTA Theoretical
72.8
72.9 64.8
63.7
24.7
SGEMM
Cell
Kernels
2 Efficiency of SGEMM on Cell is slightly better than that on Godson-T, because 256KB SPM for each SPE on Cell makes the better utilization of data locality.
19.7
1-D FFT
Pe rfo rm an ce co m paris o n s o f SGEMM an d 1-D FFT AMS
18
Godson-T | H OT CH I PS 2 0 1 1
Evaluating Synchronization without Memory 10
1
1000
FAA-based SM-based Pthread
Time (us)
100 10
1
1
0.1
0.1 0.01
0.1 0
8
16 24 32 40 48 56 64 # of threads
1E-3
0
8
0.01 16 24 32 40 48 56 64 0 # of threads
10000
10000
100
1000
1000
100
100
10
10
1
1
0.1
0.1
Time (us)
1000
10 1 0.1
0.01
0
8
0.01 16 24 32 40 48 56 64 0 # of threads
8
SM-Based Pthread
0.01 16 24 32 40 48 56 64 0 # of threads
(a) barrier overhead without workload (b) barrier overhead with load imbalancing
AMS
16 24 32 40 48 56 64 # of threads
(c) average time for each load
(b) lock transferring overhead
(a) lock overhead without lock contention
8
19
8
16 24 32 40 48 56 64 # of threads
(c) average time of each load
64 56 48
120
40
coarse-grain sync. fine-grain sync. synchronized DTA
32
100
24 16 8 0 0
8
16
24
32
40
48
56
64
# of threads
Speedup of 2-D Wavefront
Normalized Speedup
Speedup nomalized to serail performance
Evaluating Full-empty Bit Synchronization
80 60 40 20
0 Livermore loop 6 for ( i=1 ; i