64 bit. PC-Technologie Speicherhierarchie 51

Übersicht Speicher: Literatur Motivation: "performance gap" zwischen CPU und Speicher DRAM Grundlagen Speicherhierarchie, Cache SDRAM, Rambus IRAM ...

Author: Calvin Bergmann

24 downloads 0 Views 3MB Size

Report

Download PDF

Recommend Documents

64 bit)

64-bit

64 bit)

8.1 Pro 64-bit

AMD 64 Bit Architektur

(for 64-bit) DRIVER

8.1 Professional 64-bit

64 bit User Manual

64 Bit Driver Installation Guide

os 64-bit programming model

Server 2008 SP2 (32 & 64 bit) 1 No. Server 2008 R2 SP1 (64-bit) 1. Windows 7 SP1 (32-bit) Windows 7 SP1 (64-bit) Windows 8.1

7 Vista 32-bit and 64-bit LICENSE AGREEMENT

Vista, 32-bit and 64-bit OEM LICENSE AGREEMENT

Windows XP 64-bit Edition para Itanium. Windows XP 64-bit Edition para sistemas extendidos a 64 bits

IBM WebSphere Application Server and 64-bit platforms: 64-bit Performance Demystified

Installation Guide Ubuntu Linux 64 Bit

Windows 7 Ultimate (32 or 64 bit)

64 bit upgrade kit Installation guide

os 64- bit Virtual Application Support

Enhanced IP: IPv4 with 64 Bit Addresses

A. WINDOWS7-64 BIT- DRIVER SIGNING:

Database, Java 8 and 64 bit support

Einsatz einer Speicherhierarchie

Übersicht

Speicher: Literatur

Motivation: "performance gap" zwischen CPU und Speicher DRAM Grundlagen Speicherhierarchie, Cache SDRAM, Rambus IRAM

PC-Technologie | SS 2001 | 18.214

[IEEE Micro 3/97]

IRAM

[IEEE Micro11/97]

Advanced Memory Technology: Übersicht, RAMBUS, SLDRAM

[Hennessy & Patterson]

Kapitel 5, Speicherhierarchie

[c’t 07/96 p.158]

"SIMMsalabim"

[c’t 10/97 p.298]

"Schnelle Speicherkäfer"

[c’t 96-2000 ]

diverse Testberichte

www.rambus.com

alle RAMBUS Docs

www.jedec.org

Standards

developer.intel.com

Memory homepage, Chipsätze

[[Cvetanivic/Bhandarkar ISCA 96]

Performance-Analyse Alpha-21164

PC-Technologie

Speicher:

PC-Technologie | SS 2001 | 18.214

SIMM / DIMM: 72/168 polig 32/64 bit

Speicherhierarchie 51

[ct 10/97 298]

IEEE Micro 11/97

PC-Technologie | SS 2001 | 18.214

Performance Gap

DRAM:

Performance gap: Beispiel

Performance

55% / Jahr 10000

Zeit für L2-Cache-Miss (# idle instructions):

µPs:

35 % / Jahr

1000

300X 100

340ns / 5.0ns

Alpha 21164 (8400):

266ns / 3.3ns

80 clocks x 4 = 320

Alpha 21264 (est.):

180ns / 1.7ns

108 clocks x 6 = 648

68 clocks x 2 = 136

...

10

7%/Jahr

Alpha 21064 (7000):

52 Performance Gap

DRAM:

DRAM:

1 1980

1985

1990

1995

Caches essentiell notwendig, um DRAM-Latenz zu verstecken

2000

Problem wird mit jeder Prozessorgeneration schlimmer kein Cache

L1

L1+L2

Beispiel: Analyse für Alpha 21164 [ISCA’96]

L1+L2+L3

DRAM-Kapazität: 60% / Jahr, Latenz: 7% / Jahr

CPU mit idealem Speicher:

Prozessor-Performance: 55% / Jahr

Performance durch Verlustleistung limitiert (ca. 50Watt)

Kluft vergrößert sich ständig => Speicherhierarchie mit Caches notwendig PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

DRAM:

DRAM:

Performance Gap: Was tun?

Alpha 21164

Performance

schnellerer Speicher notwendig ...

55% / Jahr 10000

µPs:

Datenbank

35 % / Jahr

Rechnen

1000

aber DRAM inhärent langsam SRAM sehr teuer

300X 100

10

DRAM:

7%/Jahr 1

=> DRAM besser ausnutzen SDRAM, SDRAM-DDR

1980

1985

1990

kein Cache

1995

2000

L1+L2+L3

RAMBUS, SLDRAM

größere, schnellere Caches bessere Cache-Organisation

"Cache: a safe place for hiding or storing things"

Prefetch-Optimierungen

Websters dictionary

Warten

=> neue Konzepte? IRAM [Cvetanovic/Bhandarkar ISCA96] PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

PC-Technologie

=> Speicherhierarchie

DRAM:

data-in

/data-in

ena

PC-Technologie

DRAM vs. SRAM

Trench-Kondensator

data-in

ena

ena

wordline

wordline

T

T

T

2Inverter=4T

C

bitline VCC

bitline

Masse

/bitline

...

sense-amp GND

data-out

6 Xtors/bit

1 Xtor/bit

statisch (kein refresh)

C=10fF: ~200.000 Elektronen

schnell

langsam (charge-sharing) minimale Fläche

10 .. 50X DRAM-Fläche

[Eshragian]

Bauform Trench:

Platten vertikal am Rand eines Grabens

Bauform Stack:

mehrere horizontale Schichten

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

DRAM:

DRAM:

Stack / Trench-Kondensator

Layout Bitline n

ena

T

C

bitline

Masse ...

sense-amp

Row Address

wordline

Bitline /n+1

~256/Sense-Amp

Wordline n

Row address decoder and mux.

data-in

~1024/Bank

Bitline n+1

Bitline /n

Wordline n+1 2λ

2λ Wordline n+2

Wordline n+3 ...

8 λ2

...

SenseAmp n+1

...

data-out

/RAS SenseAmp n

"stacked capacitors"

"trench capacitors"

[Siemens 1Gb DRAM Prototyp 96]

[IBM CMOS-6X embedded DRAM]

/CAS

Column address decoder / Latch / Mux.

C=10fF: ~200.000 Elektronen PC-Technologie | SS 2001 | 18.214

Column Address PC-Technologie | SS 2001 | 18.214

I/O Treiber

data

DRAM 53

...

Organisation / Bandbreite

DRAM: 64 Mbit Chip:

Row address decoder / mux.

Row Address

Funktion

Read: /RAS = 0:

256 bits / sense-amp

64 MBit DRAM

/CAS = 0:

32 banks

Auswahl der Bitline, Ausgabe der Daten Zurückschreiben der gelesenen Daten (!)

100 nsec cycle time

/RAS = 1:

Precharge der Bitlines

/CAS = 0:

Zurückschreiben der gelesenen + neuer Daten

Write:

~ 256 wordlines / amp 32 banks 8K bits / bank

/RAS

Auswahl der Wordline,Aktivierung der Bitlines Auslesen und Auswertung der selektierten Zellen

8K bitlines / bank

~ 8K bitlines / bank (1K x 8 bit)

54 DRAM

DRAM:

SDRAM:

zusätzliche Register, diverse Burst-Modi

Refresh:

alle 16 .. 32 ms notwendig

on-chip Bandbreite: /CAS

Column address decoder / Latch / Mux.

=> 32 * 1 KB / 100 nsec

=> 327 GB / sec

Column Address PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

DRAM: Ansteuerung (asynchron)

DRAM:

Floorplan (IBM 4Mbit)

write back row access

col. access

precharge

/RAS

row addr. wordline

/CAS

Konfiguration nach Marktlage links: 4 Mbit, oben: 16 Mbit

col addr I/O

PC-Technologie | SS 2001 | 18.214

Redundanz für besseren Yield: links: 4.0/4.5 Mbit Kapazität/brutto EDO

[IBM JR&D 1995] PC-Technologie | SS 2001 | 18.214

PC-Technologie

Größenvergleich zwischen I/O, Col/Row-Decoder, Array

bitline pair

DRAM:

PC-Technologie

DRAM: Trend und Dilemma

Halbleitermarkt

Preisverfall: 16Mb: 50$ @ 1/96 -> 10$ @ 12/96 -> 4$ @ 12/97 Anzahl DRAMs / Computer sinkt: - Kapazität steigt mit 50% - 60% / Jahr - Software benötigt 33% / Jahr - Mindestanzahl gegeben durch Busbreite vs. DRAM-Breite (4bit) überhaupt ein Markt für große DRAMs? (256Mb, 1Gb, ...) # Chips 4 MB

’86

’89

’92

’96

’99

’02

1Mb

4Mb

16Mb

64Mb

256Mb

1Gb

32

8

8 MB

16

60% / Jahr 4

16 MB

8

32 MB

16

64 MB 128 MB

2

33% / Jahr

4

1

-

8

2

+ +

256 MB

4

1

8

2

PC verliert Anteile Consumer-Apps Netzwerke

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

DRAM: der Halbleitermarkt

DRAM:

Bauformen SIMM / DIMM / RIMM

DRAM als Standardbauteile: erfordert standardisierte Schnittstelle Markt (1995): DRAMs 37 Mrd. $,

µPs 20 Mrd. $

hohe Stückzahlen, viele Lieferanten, wenig Profit ’quadratische’ Speichermatrix mit N*N Bits, extern 1/4/16 Bits Architekturverbesserungen minimal: PM, EDO, SDRAM, DDR, ... Generationen: 64 Kb, 256 Kb, 1Mb, 4Mb, 16Mb, 256Mb, ... (1 Gb)

EDO-SIMM 60ns. 72p.

RAMBUS-PC800 RIMM 168p

Integration von DRAM und Logik zunehmend aktuell (IRAM &Co)

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

DRAM 55

"kleine" Anwendungen müssen bestehendes Angebot nutzen spezielle Varianten bei entsprechender Stückzahl (z.B. N64, PSX2) PC-Markt bestimmt die Marschrichtung

SDRAM-100 DIMM 168p.

SDRAM:

SPD EEPROM Daten

SDRAM: synchrone Ansteuerung für bessere Performance:

56 SDRAM

SDRAM:

interner Aufbau wie asynchrone DRAMs getaktete I/O-Register Wertekombination auf CD/nRD/nWE/... wird als Befehl interpretiert mehrere Burst Read/Write Modi Mode-Register, etwa Auswahl Burstlength 1/2/4/8 übliche Taktraten 66 MHz / 100 MHz / 133 MHz "serial presence detect":

PC-66 / und PC-100 Spezifikationen von Intel

EEPROM mit allen Timing-Daten volle Autokonfiguration typ. Zeiten 20 .. 50 nsec.

PC-133 Spezifikation zuerst von VIA / von Intel übernommen diverse Varianten (SGRAM / double data rate "DDR" / ...) Marktbedeutung Patentstreitigkeiten (u.a. mit Rambus, Inc.) [developer.intel.com/memory] PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

SDRAM: Commands

steigende Taktflanke: nCS nRAS nCAS

PC-Technologie

nWE => SDRAM-Befehl

Leerseite

PC-Technologie | SS 2001 | 18.214

PC-Technologie

SDRAM:

"Ping Pong Read"

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

PC-Technologie

SDRAM: Initialisierung

SDRAM: Read / Write Bursts

SDRAM 57

Leerseite

PC-Technologie | SS 2001 | 18.214

PC-Technologie

DDR Read

58 DDR-SDRAM

SDRAM:

SDRAM: DDR Controller

20 ns @ 100 MHz [Xilinx appnote]

[Xilinx appnote]

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

SDRAM:

SDRAM: DDR Datenpfad

DDR Write

[Xilinx Appnote]

10 ns @ 100 MHz [Xilinx appnote] PC-Technologie | SS 2001 | 18.214

128-bit @ 100 MHz PC-Technologie | SS 2001 | 18.214

positive / negative Taktflanke

PC-Technologie

64-bit @ 200 MHz

Motivation

PC-Technologie

RAMBUS:

RAMBUS: Konzept

steigende Anforderungen (etwa für 3D-Apps.) Controller

immer mehr Speicherbandbreite erforderlich

config

RDRAM

RDRAM

sinkende Anzahl einzelner DRAM-Chips Bustakt (133 MHz) kaum weiter zu steigern

...

RDRAM

data[18] V_term

breitere Busse als 64 bit sehr teuer

cmd[8]

Boards sollen minimale/maximale Bestückung vertragen

rclk[2]

DDR problematisch, da Verzögerungen bereits ausgereizt

tclk[2] vcc/gnd

=> konventionelle Speichertechnik "am Anschlag" => RAMBUS

400 MHz

400 MHz DDR, Bandbreite 1.6 GB/s (1chip)

timing-optimierter Bus

(266 .. 400 MHz DDR)

wenige Leitungen

(18 data + 8 cmd + 4 clock + vcc + gnd)

flexible Bestückung

(N64/PSX2: nur je 2 Chips)

8-bit Adressen, 16+2 bit Daten gespiegelte Taktleitungen transmit/return für Read/Write (!) chipintern 128/144 bit @ 10 nsec flexibel: Timing angepaßt an Anzahl / Lage der Chips

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

Rambus:

RAMBUS: Read/Write

Prinzip

Controller

data[18]

RDRAM

RDRAM

...

RDRAM

write V_term

cmd[8] rclk[2] tclk[2]

400 MHz

read V_term

cmd[8] rclk[2] tclk[2]

keine Laufzeitdifferenzen zwischen Takt und Daten [ct 03/2000] PC-Technologie | SS 2001 | 18.214

Zugriff auf hintere Chips ist langsamer PC-Technologie | SS 2001 | 18.214

400 MHz

Rambus 59

data[18]

signal delay matching

RAMBUS:

basic read / write transactions

60 Rambus

RAMBUS:

RDRAM-Chips ...

RIMM-Kontakte

Leitungslängen angepasst für einheitliche Laufzeiten 800 MHz / 1.25 nsec / ~ 18 cm

0.5 nsec / ~ 7 cm

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

RAMBUS:

RAMBUS:

Asus P3C (i820)

basic read / write transactions

RIMM Slot-1

ermöglicht Pipelining von Lese- und Schreibzugriffen Datenleitungen im Idealfall fast 100 % ausgelastet aber nur mit geeigneten Zugriffen (32-Byte Ausrichtung) Performance Compiler-abhängig PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

PC-Technologie

separate Steuerleitungen für Row / Column-Select

Read

RAMBUS:

PC-Technologie

RAMBUS:

Interleaved Write

tRCD / tCC / tCAC abhängig vom Modul (PC-800 / PC-700 / usw.) zusätzliche Latenztakte für "hintere" Module zusätzliche Latenztakte zur Temperaturregelung

entsprechend komplexe Zyklen auch für Read

(1 Chip reicht für volle Datenrate => höhere Belastung als bei SDRAM)

zusätzliche Buszyklen für Refresh / Powermanagement / usw.

PC-Technologie | SS 2001 | 18.214

RAMBUS:

PC-Technologie | SS 2001 | 18.214

Write

RAMBUS: vs. SDRAM / SDRAM-DDR 1 row

col

activ.

read tRCD=2

addr control

8 Byte

8 Byte

8 Byte

8 Byte

~10

data

data

data

data

64

tCL=2

clk 100 MHz

~16

data

SDRAM PC100-222

1 row

col

activ.

read tRCD=2

4 * 8 Byte tCL=2

clk 100 MHz

~16

addr

~10

control

64

data

tRCD / tCC / tCAC abhängig vom Modul (PC-800 / PC-700 / usw.) zusätzliche Latenztakte für "hintere" Module zusätzliche Latenztakte zur Temperaturregelung

2+2 row

col

col

16 * 2Byte

8 16+2

tRCD =7

clk 400 MHz control data + parity

tCAC =8

RAMBUS PC800

(1 Chip reicht für volle Datenrate => höhere Belastung als bei SDRAM)

[c’t 04/00 S.232 / rambus.com] PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

Rambus 61

SDRAM DDR PC100-222

c’t memcopy

62 Rambus

RAMBUS:

RAMBUS: SPECint 2000 / SPECfp 2000

80%

Intel BX, SDRAM-100

Rambus 1x PC-800 AMD-750 SDRAM-100 25% [c’t 16/2001 132]

starke Streuung selbst bei gleichem Chipsatz single-channel RAMBUS nicht überragend

[c’t 24/99/118]

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

RAMBUS:

RAMBUS:

Office Benchmark

Fazit . . .

lohnt die neue, teure Technik? interessantes und flexibles Konzept ein Chip reicht für volle Datenrate: geeignet für 1Gbit Generation volle Autokonfiguration und adaptives Bustiming widersprüchliche Benchmark-Ergebnisse single-channel RDRAM-800 kaum besser als SDRAM-133 Rolle von SDRAM-DDR ?! derzeit nur Markenmodule, keine "no name" Billigware Preise bisher nicht konkurrenzfähig neueste Intel-Roadmap "unklar" (developer forum, Feb’00) Speicher überhaupt gefordert ?! SDRAM 133 besser als RDRAM ?! PC-Technologie | SS 2001 | 18.214

[www.tomshardware.com]

RDRAM (desktop) + SDRAM (mobile, server) + advanced DRAM

[c’t / www.dell.com] PC-Technologie | SS 2001 | 18.214

PC-Technologie

dual-channel RDRAM-800 teuer aber gut

Cache:

Parameter . . .

Gesamtgröße, Blockgröße, Zugriffszeit, Miss-Zeit, ... DRAM langsam, SRAM teuer Lokalität:

Daten mehrfach genutzt / benachbarte Daten genutzt

=> "Cache" kleiner SRAM-Zwischenspeicher Cache-Treffer laufen mit SRAM-Performance aber Overhead: Misses langsamer als ohne Cache Parameter:

Beispiel

Grösse

64 KByte

hit-time

1 clk

miss-time

50 clk

miss-rate

1%

Organisation

voll-assoziativ

Organisation: Zugriff:

wo kann ein Block platziert werden?

(direct-mapped)

wie wird ein Block gefunden?

(tags, valid bit)

Ersetzung:

welcher Block wird beim Miss ersetzt?

(random, LRU)

Schreib-Strategie: Architektur:

write back / write through / ...

(dirty bit, ...)

größere Blocks

(weniger Verdrängung, aber geringere Kapazität)

höhere Assoziativität

(aber komplexere Verwaltung)

Victim-Caches

(billig und effizient)

HW-Prefetching

(z.B. instruction prefetch / branch prediction)

Compiler-Prefetching

(bei bekannten Zugriffsmustern)

critical word first

(x86: von Intel patentiert)

write buffer

(alle aktuellen Prozessoren)

nonblocking caches

(effizient, aber komplex)

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

Cache: Prinzip

Cache:

separate I/D oder unified Cache?

AlphaServer 8200 (300MHz 21164)

memory type

size

location

KB CPU

Cache Controller 0xaffe

0x0100

tags

data

1

0x000b

0xcafe

1

0x0100

0xaffe

0x000c

bandwidth

cycles

MB / s 4800

I-Cache

8K

on chip

6.6

2

D-Cache

8K

on chip

6.6

2

4800

L2-Cache

96K

on chip

20.0

6

4800

L3-Cache

~ 4M

off chip

26.0

8

960

64M .. 4G

off chip

253.0

76

1200

single DRAM

16M

off chip

~ 60.0

18

30..100

misses / 1000 instr. D L2 L3

µP

0xcafe 0x0008

- direct-mapped / set-associative - block / line-size (1x tags, Nx data)

0x0004 0x0002 0x0001 0x0000

- write-through / -back / -allocate

0x0000

Program

CPI

I

SPECint92

1.2

7

25

SPECfp92

1.2

2

47

database

3.6

97

82

sparse

3.0

0

38

36

Vergleich der Tags, abhängig davon Cache- oder Speicherzugriff

11

% time spent in L2 I D

L3

0

0.78

0.03

0.13

0.05

0.00

12

0

0.68

0.01

0.23

0.06

0.02

119

13

0.23

0.16

0.14

0.20

0.27

23

0.27

0.00

0.08

0.07

0.58

[Patterson 97] PC-Technologie | SS 2001 | 18.214

Cache 63

klein, schnell (z.B. 128 KB SRAM) automatische Verwaltung

PS Mikroprozessoren | SS 2001 | 18.057

latency ns

main memory

Speicher valid

PC-Technologie

Cache: Motivation

Missrate Beispiele:

Cache:

64 Cache

Cache: Missrate

Compulsory / Capacity / Conflict

3 Arten Cache-Misses:

(SPEC 92, R2000, direct-mapped, 32-byte blocks)

Size

Instruction

Data

Unified

1K

3.06%

24.61%

13.34%

2K

2.26%

20.57%

9,78%

4K

1.78%

15.94%

7.24%

8K

1.10%

10.19%

4.57%

Blöcke müssen ausgetauscht werden

16K

0.64%

6.47%

2.87%

=> Cache vergrössern

32K

0.39%

4.82%

1.99%

64K

0.15%

3.77%

1.35%

128K

0.02%

2.88%

0.95%

compulsory

(cold start / first reference) erster Zugriff auf einen Block

capacity

Cache zu klein für alle benötigten Blöcke;

(collision misses / interference misses)

conflict

bei direct mapped / set associative Caches:

Werte sehr stark programmabhängig

mehrere Blöcke im gleichen Set benötigt => Organisation verbessern, etwa 4fach assoz.

CPU / Multiuser-Auslastung / Messzeit / ...

=> victim buffers [H&P p.384]

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

Cache: Missrate: Beispiel

Cache:

Speicherzugriffe: 75% Instruction, 25% Data avg. memory access time = hit time + (miss rate * miss penalty) cache hit: cache miss:

1 clock 50 clocks

Size

Instruction

Data

Unified

1K

3.06%

24.61%

13.34%

2K

2.26%

20.57%

9,78%

4K

1.78%

15.94%

7.24%

8K

1.10%

10.19%

4.57%

16K

0.64%

6.47%

2.87%

32K

0.39%

4.82%

1.99%

64K

0.15%

3.77%

1.35%

128K

0.02%

2.88%

0.95%

16K I + 16K D Cache: (75% * 0.64%) + (25% * 6.47%) = 2.10% 75%*(1+0.64%*50) + 25%*(1+6.47%*50) = (75%*1.32) + (25%*4.235) = 2.05 32K unified Cache: miss rate: tmac:

load/store hit: 1 extra cycle (one port only) 1.99% 75%*(1+1.99%*50) + 25%*(1+1+1.99%*50) = (75%*1.995) + (25%*2.995) = 2.24

=> split I/D Cache ist schneller (für dieses Beispiel) PC-Technologie | SS 2001 | 18.214

[H&P p.385]

static void filterF(char* in1, char* out1) { register int i0,i1,i2; register int x, int y; register char *in,*out; in = in1; out = out1; for( y=0; y < YRES; y++ ) { i0 = (int)in[0]; i1 = (int)in[1]; /* ignore boundary pixels, over/underflow for this benchmark */ for( x=1; x < XRES-1; x++ ) { i2 = (int)in[x+1]; out[x] = (char)( (i0 + (2*i1) + i2) / 4 ); i0 = i1; i1 = i2; } in += XRES; - read a byte from one array, compute, store result in out += XRES; second array, a byte at a time. } - If the arrays line up on top of each other in a } /*filterF*/ direct-mapped cache, there is massive cache-thrashing. execution time via array SYS 511 512 CRIM 0.2 0.3 INDIGO4K 0.2 0.3 IN4K-fix 0.2 0.2 HP 720 0.3 0.7 HP 735 0.1 0.6* HP 735 0.1 0.7* Gwy486-66 0.3 0.3

PC-Technologie | SS 2001 | 18.214

size: [comp.arch posting] 513 1023 1024 1025 0.2 0.8 7.3* 0.9 0.2 0.8 9.4* 0.8 0.2 0.8 0.8 0.8 0.3 1.1 2.7* 1.0 0.1 0.6 2.7* 0.6 0.1 0.6 2.7* 0.6 0.3 1.3 1.4 1.3

2047 3.7 3.2 3.3 4.2 2.4 2.2 5.5

2048 33.4* 37.9* 3.2 10.8* 11.1* 10.8* 5.5

2049 3.4 3.2 3.2 4.2 2.6 2.2 5.5

D D D D D D SA?

PC-Technologie

miss rate: tmac:

direct-mapped conflict misses

x86:

ctkurve [c’t 07/2000 p.71]

PC-Technologie

x86: Pentium III Caches. . .

Messung der Cache-Transferrate vs. Blockgröße (random) Caches deutlich sichtbar: Pentium 16K/256K, Athlon 64K/512K PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

x86: Pentium III Cache-Modi

x86:

AMD Duron: Cache Duron

Athlon

L1

32K I 32K D

L1

L2

256K unified

L2 control

Hauptspeicher

AMD Duron "exclusive" L2-Cache:

32K I 32K D

64K

Hauptspeicher

=> vgl. "victim buffer"

L2-Cache: nur 64 KB statt 256 KB wäre bei herkömmlicher Verwaltung sinnlos (alle Daten doppelt) daher: L2-Cache speichert nur Daten, die nicht im L1 sind nur ca. 10% Performanceverlust PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

Cache 65

L1-Cache: wie im Athlon (32KB + 32KB)

66 Cache

Speicherhierarchie:

x86: Pentium II Lesezugriff . . .

Übersicht, typ. Werte

[c’t 03/2000, 260]

TLB

L1-Cache

L2-Cache

Virtueller Speicher

size / byte

32 .. 8K

1 .. 128K

256K .. 16M

16M .. 8G

block size / byte

4..8

4..32

32..256

4K..16K

hit time / clk

1

1..2

6..15

10..100

miss penalty / clk

10..30

8..66

30..200

700K..6M

miss rate / %

0.1- 2

0.5 .. 20

15 .. 30

0.000001 .. 0.001

backup

L1

L2

DRAM

Disks

block placement

FA

DM

DM / SA

FA

block identification

tags

tags

tags

table

block replacement

random

-

random

~ LRU

write strategy

flush

WT / WB

WB

WB

Cachezugriffe: L1 typ. 1..2 Takte, L2 typ. 2..10 Takte FA/SA/DM = full/set associative/direct mapped

Speicherzugriffe: ca. 100 Takte

WB/WT = write back/write through

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

x86: Pentium II Schreibzugriff . . .

Speicherhierarchie: Fazit

[H&P p.471]

[c’t 03/2000, 260]

performance gap wächst und wächst DRAM inhärent langsam => Speicherhierarchie wird immer wichtiger größere, tiefere Caches komplexere Caches: voll assoziativ, non-blocking, etc. aber Nutzen nur für "einfache" Anwendungen

intelligenteres Cache-Management Prefetching MemoryTypeRangeRegister: schnelle I/O, z.B. Graphikkarte

computational RAM / IRAM / . . .

weitere Stufen (z.B. AGP GART) im Chipsatz ... PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

PC-Technologie

=> wichtige Forschungsaufgaben:

IRAM :=

Konzept / "Vision"

IRAM:

µP + DRAM + I/O auf einem Chip

I$

CPU

20ns statt 200ns

Bandbreite 100x

TB/s

Caches: kein Wert an sich, nur zum Schließen des performance gap

Bus D

R

/64 .. /128

A

M

Speicherorganisation anpassen: beliebig wählbar: #bits, Busbreite, ...

Energieverbrauch senken:

CPU

%Fläche

%Transistoren

(~Kosten)

(~Leistung)

Alpha 21164

37%

77%

ARM SA110

61%

94%

Pentium Pro

64%

88%

Beispiele:

I$

CPU

D$

L2$ Bus D

R

A

M

I/O

kein DRAM-Bus: 2-4x

Platzverbrauch senken:

performance gap "Tax"

L2$

Performance gap CPU/Speicher schließen: Latenz 5-10x

D$

PC-Technologie

IRAM:

Patterson: performance gap "tax" D

R

A

M

CPU passt auf DRAM: 2-4x

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

IRAM:

IRAM:

Motivation

Architektur?

1G Transistoren möglich, aber welche Rechnerarchitektur?

mehrfache Motivation:

ein Prozessor + DRAM:

Stromverbrauch, Platzbedarf - insbesondere für mobile Geräte

- Nutzen fraglich, evtl. langsamer als optimierte CPU + Cache + DRAM

Anpassung von Speicherbedarf und -organisation

- verschenkt hohe on-chip Bandbreite, da #issues < 8

Performance gap zwischen Prozessor und DRAM schließen,

- wenig innovativ

minimale Latenz, maximale On-chip Bandbreite neue Marktstrategie für DRAM-Produzenten (wegen Preisverfall)

SIMD oder MIMD Parallelrechner? - viele Prozessoren, aber nur wenig RAM / Prozessor

neue, revolutionäre Chip-Packaging Technologien komplexere CPUs (out-of-order, multiple-issue, ...) neue DRAM Standards (SDRAM, RAMBUS, ...)

PC-Technologie | SS 2001 | 18.214

IRAM 67

-Programmierung ist ungelöstes Problem

Alternativen?

- alle bisherigen Varianten gescheitert - unwahrscheinlich - schwierig - nicht in Sicht

Graphikprozessoren - bereits am Markt und etabliert I-VRAM := DRAM + RISC + Vektorrechner

PC-Technologie | SS 2001 | 18.214

[Berkeley IRAM group]

68 IRAM

IRAM: "vanilla" approach?!

IRAM: V-IRAM 2 Floorplan 0.18 µm, 1G Transistoren: 80% DRAM, 4% Vector, 3% CPU Größe und Redundanz wie 1Gb DRAM

vorhandenen Rechner (Alpha 21164) in DRAM Technologie implementieren gleiche Architektur: gleiche Caches, einfaches DRAM, ... übliche Benchmarks simulieren Logik in DRAM Prozeß? Logik langsamer SRAM (Caches) langsamer

Memory (384 Mbits / 48 MBytes)

Faktoren: (optimistisch - pessimistisch) 1.3 - 2.0 1.1 - 1.3

DRAM schneller

10.0 - 5.0

SPEC92 Database Sparse matrix

0.8 - 0.6 1.1 - 0.9 1.8 - 1.2

memory crossbar switch

8 Vector-Units (+1 Spare)

CPU + I$ + D$

I/O

memory crossbar switch

langsamer! gleich schneller

Memory (384 Mbits / 48 MBytes)

Performance nicht überzeugend, aber Leistung/Platzbedarf/Kosten besser

PC-Technologie | SS 2001 | 18.214

PC-Technologie | SS 2001 | 18.214

IRAM: Zusammenfassung

IRAM: V-IRAM 2 0.18 µm, fast logic, 1 GHz, 96 MByte DRAM 16 GFLOPS (64b), 128 GOPS (8b) Prototyp erwartet in 2001

Moore’s Law: 1% / Woche Engpaß ist Performance gap zwischen CPU und DRAM +

8 x 64 or 16 x 32 or 32 x 16 or 64 x 8

x Vector Instruction Queue

2 way Superscalar RISC-Processor 8K I

Netw.

% Load/Store

8K D

Vector Registers

Memory Crossbar Switch

M

M

M

M

M

M

M

M

M

M

M

M

M

8 x 64 M

M

8 x 64 M

PC-Technologie | SS 2001 | 18.214

IRAM Potential Bandbreite 100x, Latenz 5-10x, Leistung 2-4x V-IRAM als Technologiedemo? (Graphikchips bereits lieferbar!) V-IRAM: 25-100MB Speicher @ 20ns, 4-16 GFLOPS, serielle I/O V-IRAM: 1 TB/s Bandbreite, Smart-SIMMs = TFLOPS

M

M

Technologie ermöglicht CPU und DRAM auf einem Chip ab 1998/1999

...

M

M

M

M 8 x 64

M

M

dramatische Auswirkungen auf Halbleiter-Markt wer liefert DRAM, wer liefert Mikroprozessoren?

IRAM - 03.02.98

PC-Technologie

8 x 64

radikal neue Speichertechnologien zunächst unwahrscheinlich