CS533: Processor in memory (PIM)

CS533: Processor in memory (PIM) Josep Torrellas University of Illinois in Urbana-Champaign March 31, 2015 Josep Torrellas (UIUC) CS533: Lecture 18...

Author: Vivien Dalton

30 downloads 0 Views 304KB Size

Report

Download PDF

Recommend Documents

Memory Consistency. Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

TOP-PIM: Throughput-Oriented Programmable Processing in Memory

Reconfigurable Memory Based AES Co-Processor

Progress in Maths (PiM) Digital

FPGA Based Intelligent Co-operative Processor in Memory Architecture

Threonine Kinases Pim-1 and Pim-3

APrototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

FAASV Integrated Management (PIM)

Chapter 12 PIM Commands

Processor (soldered down) Intel Celeron Processor 847. Chipset. Memory. Graphics. Expandability. Connectivity. Additional Features

Main-Memory Hash Joins on Modern Processor Architectures

CACHED DRAM FOR ILP PROCESSOR MEMORY ACCESS LATENCY REDUCTION

Efficient Use of Memory Bandwidth to Improve Network Processor Throughput

Access Region Locality for High-Bandwidth Processor Memory System Design

System Impact of 3D Processor-Memory Interconnect: A Limit Study

O devices to the Memory, Processor, and Operating System

Memory Performance on Dual Processor Nodes: Comparison of Intel Xeon and AMD Opteron Memory Subsystem Architectures

The Computer s Memory, Storage and Processor. Part I Memory (Primary Storage)

band PIM testing

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

SCHEDULING OF SHARED MEMORY WITH MULTI - CORE PERFORMANCE IN DYNAMIC ALLOCATOR USING ARM PROCESSOR

Adding Concurrency in Python Using a Commercial Processor s Hardware Transactional Memory Support

DISTORTIONS IN EYEWITNESS MEMORY MEMORY AND NON-MEMORY MECHANISMS

Contents User Manual For PIM

CS533: Processor in memory (PIM) Josep Torrellas University of Illinois in Urbana-Champaign

March 31, 2015

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

1 / 20

Today’s Paper

A Case For Intelligent RAM D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick University of California Berkeley

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

2 / 20

Main Idea

memory system is the greatest inhibitor of performance (low bandwidth, high latency) therefore: integrate a high-performance processor & DRAM main memory on a chip low latency = 20-30ns instead of 300ns high bandwidth = 128 bytes instead of 16 bytes

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

3 / 20

Tradeoffs

Advantages high bandwidth to memory low latency to memory energy consumption in memory system decreases several times reduction of off-chip accesses (high capacitance)

fewer pins necessary in chip (currently, most pins used for wide mem. interface) smaller packages thanks to fewer pins regular chp layout, more dense

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

4 / 20

Tradeoffs (Continued)

Advantage can be combined with any processor organization Disadvantages conventional processor design gains little from IRAM because they were designed with assumption of slow memory system → need to open up the memory

if the programming model is too revolutionary, old apps will not run

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

5 / 20

Layout CHIP (Memory)

Vector unit (4M Transistors)

CPU + caches (~3M Transistors)

48 MB Memory (800M Transistors)

DRAM technology (memory) is much more dense than SRAM technology (caches) 16 to 32 times more =⇒ more storage on chip Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

6 / 20

Vector Processors ↑ capacity ↑ bandwidth =⇒ vector processing would work well V1

V2

Vector Unit

MULTV

V1,V2,V3

V3

− Need explicit compilation into vector code + it is claimed that many multimedia apps will be vectorizable e.g. MMX can be considered modest vector unit Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

7 / 20

Advantages of IRAM

Higher bandwidth Lower latency Energy efficiency Memory size and width → free organization Board space (small)

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

8 / 20

Disadvantages of IRAM

Area and speed of logic in a DRAM process technology Area: 30% - 70% Larger, Speed: 30% - 70% Less

Area and power impact of increasing bandwidth to DRAM core Retention time of DRAM when operating at high Temperature retention rate halved for every 10◦ C refresh rates must increase for high temperature

Scaling system beyond single IRAM Matching IRAM to commodity focus of DRAM industry Testing (single processor, everyone?)

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

9 / 20

Challenges to IRAM

1

Fabrication process is tough DRAM fabrication technology is different than logic (microprocessor) technology you get slower microprocessors you need to complicate design to avoid noise of switching logic on memory array refresh rates increase as temperature increases

2

Bounded amount of DRAM: soon 96 MBytes OK for portable computers not OK for workstations what about multiple IRAMs?

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

10 / 20

Evaluation

Evaluation of Existing Architectures in IRAM Systems Christoforos Kozrakis, Ngeci Bowman, Neal Cardwell, Cynthia Rommer and Helen Wang Workshop on “Mixing Logic and DRAM”, ISCA 1997

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

11 / 20

Motivation µProcessor

Intelligent IRAM promises: high memory bandwidth (100x) low memory latency (0.1x) high energy efficiency (4x) higher system integration

Which microprocessor architecture can turn these advantages into significant application performance benefits?

Josep Torrellas (UIUC)

D Bus

R

A

M

CS533: Lecture 18

March 31, 2015

12 / 20

Evolutionary IRAM Approach

Use an existing processor architecture: simple RISC micro, superscalar or out-of-order execution organization

Advantages: Good knowledge of how to design and implement them Performance trade-offs are well understood “Out of the box” solutions borth for system software and applications – software compatibility Higher performance by tuning programs and compilers to new memory hierarchy characteristics

This work: evaluate potential performance benefits of this approach

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

13 / 20

IRAM Architectural Considerations

IRAM systems using existing DRAM technology: 256Mbit DRAM 0.25µm CMOS process 1/4 of die area for microprocessor Up to 24MBytes of on-chip DRAM

Memory access latency can be as low as 21ns Logic speed potentially 10% to 50% slower compared to conventional processors for initial implementations No L2 cache necessary since on-chip DRAM can have comparable latency Memory bus as wide as cache line

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

14 / 20

Method II: Detailed System Simulations

Used SimOS to simulate simple MIPS R4000-based IRAM and conventional architectures Equal die size comparison: Area for on-chip DRAM in IRAM systems same as area for L2 cache in conventional system

Wide memory bus for IRAM systems Main simulation parameters: On-chip DRAM access latency Logic speed (CPU frequency)

Benchmarks: SPEC95Int (compress, li, ijpeg, perl, gcc), SPEC95Fp (tomcatv, su2cor, wave5), Linpack1000

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

15 / 20

Simulated Models Pipeline CPU Frequency Technology L1 Configuration L1 Associativity L1 Block Size L1 Type L1 Access Time L2 Configuration L2 Associativity L2 Block Size L2 Type L2 Access Time Memory Configuration

IRAM Simple in-order 333 or 500 MHz 0.25µm DRAM 64KB I + 64KB D 2-way 128B On-chip SRAM 1 CPU cycle N/A N/A N/A N/A N/A 24MB DRAM on-chip

Memory Bus Width Total Latency

128B 21 or 33ns

Josep Torrellas (UIUC)

CS533: Lecture 18

Conventional Simple in-order 500 MHz 0.25µm logic 64KB I + 64KB D 2-way 64B I + 32B D On-chip SRAM 1 CPU cycle 2MB unified 2-way 128B On-chip SRAM 12 CPU cycles 24MB 166MHz SDRAM off-chip 16B 116ns March 31, 2015

16 / 20

Method Method II: Results

II: Results

SPEC95 & Linpack1000 Results N ormalized Execution Times

1 0.8 0.6

SPEC95 Linpack

0.4 0.2 0 Conventional IRAM 333 Mhz IRAM 333 Mhz IRAM 500 Mhz 116ns 33ns 21ns 33ns

Simulated M odels

•Execution normalized to basic IRAM model Execution timestimes normalized to basic IRAM model (333MHz, 33ns (333MHz, 33ns memory latency) memory latency) IRAM models up to up 40%tofaster conventional •IRAM models 40%than faster than conventional Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

17 / 20

Conclusions

IRAM systems with existing processors provide only moderate performance benefits High bandwidth/low latency used to speed up memory accesses but not computation Reason: existing architectures developed under the assumption of a low bandwidth memory system Still attractive for portable/embedded domain up to 4 times more energy efficient higher system integration

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

18 / 20

Towards a Revolutionary Approach

To provide significant performance benefits, IRAM systems need microprocessor architectures that turn memory bandwidth into application performance Candidates: Vector microprocessor Multithreading architectures Multiprocessor on a chip Some hybrid combination? Some new idea?

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

19 / 20

What People are Looking at? Nearest neighbor database searching IStore (Intelligent Storage) Multimedia apps SIMD computation Distributed vector ATM switch controllers Scalable DSMs Single chip MP + DRAM Synchronization & special support Petaflop: large scale ...

Josep Torrellas (UIUC)

CS533: Lecture 18

March 31, 2015

20 / 20