High-Efficient Architecture of Godson-T Many-Core Processor

Institute of Computing Technology Chinese Academy of Sciences High-Efficient Architecture of Godson-T Many-Core Processor Dongrui Fan, Hao Zhang, Da ...
Author: June Morrison
2 downloads 0 Views 2MB Size
Institute of Computing Technology Chinese Academy of Sciences

High-Efficient Architecture of Godson-T Many-Core Processor Dongrui Fan, Hao Zhang, Da Wang, Xiaochun Ye, Fenglong Song, Junchao Zhang, Lingjun Fan Advanced Micro-System Group National Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences

AMS

Overview 

Microprocessor Architecture Challenges



Godson-T Features for Parallel Program



Godson-T Design and Implementation

AMS

2

Godson-T | H OT CH I PS 2 0 1 1

Microprocessor Architecture Revolution 

Memory wall + Power wall + ILP wall = Brick wall ! What parallel architecture will be successful? Power Wall

Many-Core

Performance

Multi-Core

ILP Wall

Superscalar Processor Core

Simpler Processor Core

Power, Complexity 

AMS

Parallel architectures come to rescue, superscalar processor is being replaced by on-chip multi-core / many-core processor 3

Godson-T | H OT CH I PS 2 0 1 1

Exploiting Polymorphic Parallelism Instruction-Level Parallelism

Godson-3

With Powerful Streaming Unit

Heterogenous Many-Core With moderate DLP and strong TLP exploitation

Godson-T

Thread-Level Parallelism AMS

4

Godson-T | H OT CH I PS 2 0 1 1

Many-Core Challenges: The 5 P’s 

Power efficiency challenge 



Performance challenge 



How to exploit parallelism through software and hardware co-design

Platform challenge 

AMS

How to provide a converged many core solution in a standard programming environment

Parallelism challenge 



How to scale from 1 to 1000 cores – the number of cores is the new Megahertz

Programming challenge 



Performance per watt is the new metric – Dark Silicon will appear in 2020 when facing 11nm process technology

How to maximize the usability, such as, runtime system (OS), compiler & library, many-core debug… 5

Godson-T | H OT CH I PS 2 0 1 1

What we could help…… Productivity Side: 

Most sequential programmers are not ready to switch into parallel programming 



Unique programming model cannot solve all problems efficiently; 



conservative programming model and make incremental improvement? support multiple programming features efficiently?

Locks are messy 

new way to eliminate deadlock?

……

Performance Side: Take this carefully, because it may affect productivity ! 

On-chip synchronization is fast; 



Flops are cheap, memory communications are expensive; 



handle all synchronizations on-chip? trade flops for communication latency?

Fine-grained parallelism should be taken into consideration; 

enable data-driven thread execution on chip?

…… AMS

6

Godson-T | H OT CH I PS 2 0 1 1

Motivation Target at H PC Godson-T Many cores to accelerate one program

Cost

Data Communication

AMS

Method

Cost

Fine Grained concept Thread division

Thread Synchronization

7

Godson-T | H OT CH I PS 2 0 1 1

Overview 





AMS

Microprocessor Architecture Challenges ›

Memory wall + Power wall + ILP wall



The 5P’s: Many-core Processor Challenges

Godson-T Features for Parallel Program ›

Godson-T architecture overview



Architectural supports for multithreading



Software runtime system

Godson-T Design and Implementation ›

Journey of design and implementation



Software and hardware co-simulation



Prototype

8

Godson-T | H OT CH I PS 2 0 1 1

Architecture Overview of Godson-T Program Runtime System

Memory Controller CORE

GodsonT Processor

L1$

Private Memroy R

Memory Controller

Memory Controller

Processing Core Private I-$ Fetch & Decode Register Files

FP/ INT MAC Sync Vector ALU ALU Unit ALU

Synchronization Manager L2 Cache Bank I/O Controller

AMS

SPM

D$

Local Memory DTA

Router Memory Controller

9

Godson-T | H OT CH I PS 2 0 1 1

Processing Core 4Byte

FETCH

Instruction Buffer 1 instruction



ISA: MIPS (user), SIMD-ext., sync-ext.



8-stage pipeline



Dual-issue per thread



Expected SIMD

DECODE 1 instruction 32Byte 2r/1w

Vector Register File 32 Bytex 2x1

Vector/ Floating-Point Unit

Vector/ Fixed-Point Unit



L0_Icache

L0_Dcache

16Byte

16Byte

Load/Store/ Synchronization Unit



Fast level-1 memory



16KB private memory

AMS



Automatically mapped into stack address space



Full/empty bit tagged on each 64-bit slot enables efficient producerconsumer style synchronization

16Byte in/out msg

8Byte

SPM/ L1 Cache

Load, Store, Data Move, Arithmetic ……

Router



10

Communication with external modules through message packets Godson-T | H OT CH I PS 2 0 1 1

Interconnection 



AMS

Separated routers with each processing core 

Static XY wormhole routing



Round-Robin arbitration



Two independent physical networks



Duplex 128bit link for each network

Guaranteed in-order point-to-point communication 

Deadlock-free & livelock-free



Scalable and power-efficient for MESH topology



Low latency core-to-core communication



Separated networks allowing traffic segregation and tolerating burst DMA transfer

11

Godson-T | H OT CH I PS 2 0 1 1

Memory Hierarchy Data Transfer Bandwidth Reg

Each processing core with 32 fixedpoint and 32 floating-point registers

Local Memory

32KB local memory, including L1$D, SPM

512GB/s

128GB/s

16 address-interleaved L2 cache banks, 128KB each

L2 Cache 51.2GB/s

Off-Chip Memory

AMS

4 DDR3-1600 memory controllers

12

Godson-T | H OT CH I PS 2 0 1 1

Power Management RT

RT

RT

RT

RT

RT

RT

RT

RT

RT

RT

RT

• Monitor status of each core at program level • Shut down or turn on cores separately • Reassign tasks

RT

AMS

RT

RT

RT

13

Godson-T | H OT CH I PS 2 0 1 1

Overview 





AMS

Microprocessor Architecture Challenges ›

Memory wall + Power wall + ILP wall



The 5P’s: Many-core Processor Challenges

Godson-T Features for Parallel Program ›

Godson-T architecture overview



Architectural supports for multithreading



Software runtime system

Godson-T Design and Implementation ›

Journey of design and implementation



Software and hardware co-simulation



Prototype

14

Godson-T | H OT CH I PS 2 0 1 1

Thread Communication & Synchronization

Lock-Based Cache Coherent Protocol 

Lazy cache coherent protocol    

Pure mutual-exclusion synchronization instruction without memory accesses; Eliminate busy-waiting for locks; More scalable than bus-snoopy and directory-based cache protocol; Enable hardware deadlock detecting. int mutual_func_a() { …… acquire_lock(LOCK_VAL); X = 3; release_lock(LOCK_VAL); …… }

X=0

Mini Core

Mini Core

Private Cache

Private Cache

ACQ. REL. X=3

ACQ. REL.

X=0

int mutual_func_b() { …… acquire_lock(LOCK_VAL); Y = X; Z = X; release_lock(LOCK_VAL); …… }

Inte rc o nne c t

ACK. CORE 0 UNLOCK LOCKED CORE 1 WAITING UNLOCK LOCKED AMS

X=3 X=0

Synchronization Manager

Shared L2 Cache

15

Godson-T | H OT CH I PS 2 0 1 1

Thread Communication & Synchronization

Data Transfer Agent (DTA) DTA horizontal operation

DTA vertical operation

DTA vertical operation

 SPM

DTA

L2 Cache

SPM

……

DTA

Router

……

Router

……

off-chip memory

Router

……

Programmable asynchronous data transfer agent  Support vertical and horizontal DTA operations (such as prefetch) 

Data transfers between multidimension addresses (such as matrix inversion )



Network load perception, automatically flow control (improve bandwidth-efficiency)



Support fine-grain synchronous operations

on-chip network

(a) vertical and horizontal DTA operations

chunk stride

chunk stride ……

block block stride

…… block stride

invovled data block address (b) 2D strided DTA operations

AMS

16

Godson-T | H OT CH I PS 2 0 1 1

Synchronized DTA Operations Source Node

Traffic Awareness Node

Target Node

(s): successfully perform on the state of full/empty bit (f): failed to perform on the state of full/empty bit load.sync (s) *load.future (s)/ store.future (s)/ store.sync (f)

1

0 store.sync (s)

AMS

17

load.future (f)/ store.future (f)/ load.sync (f)

Evaluating On-chip DTA Benchmark

Processor

SGEMM

Performance (GFLOPS)

1-D FFT 130 120 110 100 90 80 70 60 50 40 30 20 10 0

122.8

Godson-T

Cyclops-64

GTX8800

Efficiency

95.9% 1

99.9% 2

43.4%

60.0%

Performance

122.81

204.7

13.9

206.0

Efficiency

33.2%

20.4%

25.8%

29.9%

Performance

63.72

41.8

20.7

155.0

1 The SGEMM kernel contains only multiply-and-add operation, so that the ideal peak performance is measured by the multiply-and-add function unit, which is 128GFLOPS.

127.5

Cache SPM, without DTA SPM, with DTA Theoretical

72.8

72.9 64.8

63.7

24.7

SGEMM

Cell

Kernels

2 Efficiency of SGEMM on Cell is slightly better than that on Godson-T, because 256KB SPM for each SPE on Cell makes the better utilization of data locality.

19.7

1-D FFT

Pe rfo rm an ce co m paris o n s o f SGEMM an d 1-D FFT AMS

18

Godson-T | H OT CH I PS 2 0 1 1

Evaluating Synchronization without Memory 10

1

1000

FAA-based SM-based Pthread

Time (us)

100 10

1

1

0.1

0.1 0.01

0.1 0

8

16 24 32 40 48 56 64 # of threads

1E-3

0

8

0.01 16 24 32 40 48 56 64 0 # of threads

10000

10000

100

1000

1000

100

100

10

10

1

1

0.1

0.1

Time (us)

1000

10 1 0.1

0.01

0

8

0.01 16 24 32 40 48 56 64 0 # of threads

8

SM-Based Pthread

0.01 16 24 32 40 48 56 64 0 # of threads

(a) barrier overhead without workload (b) barrier overhead with load imbalancing

AMS

16 24 32 40 48 56 64 # of threads

(c) average time for each load

(b) lock transferring overhead

(a) lock overhead without lock contention

8

19

8

16 24 32 40 48 56 64 # of threads

(c) average time of each load

64 56 48

120

40

coarse-grain sync. fine-grain sync. synchronized DTA

32

100

24 16 8 0 0

8

16

24

32

40

48

56

64

# of threads

Speedup of 2-D Wavefront

Normalized Speedup

Speedup nomalized to serail performance

Evaluating Full-empty Bit Synchronization

80 60 40 20

0 Livermore loop 6 for ( i=1 ; i

Suggest Documents