CS 426 Parallel Computing Lecture 01: Introduction Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/
CS426 L01 Introduction.1
Course Administration
Instructor: Dr. Özcan Öztürk Office Hours: 10:30 - 12:30, Tuesday or by appointment Office: EA 421, Phone: 3444 WWW: http://www.cs.bilkent.edu.tr/~ozturk/
TA:
Burak Copur
Office Hrs: posted on the course web page
URL:
http://www.cs.bilkent.edu.tr/~ozturk/cs426/
Text:
Required: Parallel Computing
Slides:
pdf on the course web page after lecture
CS426 L01 Introduction.2
Grading Information
Grade determinates
Midterm Exam
~25%
- November ??? , Location: TBD
Final Exam
~25%
- January ???, Location TBD
Projects (3-5)
~35%
- Due at the beginning of class (or, if its code to be submitted electronically, by 17:00 on the due date). No late assignments will be accepted.
Class participation & pop quizzes
~15%
Let me know about midterm exam conflicts ASAP FZ Grade
%50 minimum grade average (midterm+project+quiz)
Grades will be posted on the course homepage
CS426 L01 Introduction.3
Why did we introduce this course?
Because the entire computing industry has bet on parallelism
There is a desperate need for all computer scientists and practitioners to be aware of parallelism
All major processor vendors are producing multicore chips Every machine will soon be a parallel machine
All programmers will be parallel programmers??? Some may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there
Big open questions:
What will be the killer applications for multicore machines? How should the chips be designed? How will they be programmed?
CS426 L01 Introduction.4
What is Parallel Computing?
Parallel computing: using multiple processors in parallel to solve problems more quickly than with a single processor, or with less energy
Examples of parallel machines
A computer Cluster that contains multiple PCs with local memories combined together with a high speed network A Symmetric Multi-Processor (SMP) that contains multiple processor chips connected to a single shared memory system A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip, also called Multi-Core Computers
The main motivation for parallel execution historically came from the desire for improved performance
Computation is the third pillar of scientific endeavor, in addition to Theory and Experimentation
But parallel execution has also now become a ubiquitous necessity due to power constraints, as we will see CS426 L01 Introduction.5
Why Parallel Computing?
The real world is massively parallel
CS426 L01 Introduction.6
Why Parallel Computing?
Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult problems in many areas of science and engineering:
Atmosphere, Earth, Environment Physics - applied, nuclear, particle, condensed matter Bioscience, Biotechnology, Genetics Chemistry, Molecular Sciences Geology, Seismology Mechanical Engineering - from prosthetics to spacecraft Electrical Engineering, Circuit Design, Microelectronics Computer Science, Mathematics
CS426 L01 Introduction.7
Simulation: The Third Pillar of Science
Traditional scientific and engineering paradigm:
Limitations:
Do theory or paper design. Perform experiments or build system.
Too difficult -- build large wind tunnels. Too expensive -- build a throw-away passenger jet. Too slow -- wait for climate or galactic evolution. Too dangerous -- weapons, drug design, climate experimentation.
Computational science paradigm:
Use high performance computer systems to simulate the phenomenon Base on known physical laws and efficient numerical methods.
CS426 L01 Introduction.8
Why Parallel Computing?
Today, commercial applications provide an equal or greater driving force in the development of faster computers.
These applications require the processing of large amounts of data in sophisticated ways.
CS426 L01 Introduction.9
Why Parallel Computing?
For example:
Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Management of national and multi-national corporations Financial and economic modeling Advanced graphics and virtual reality, particularly in the entertainment industry Networked video and multi-media technologies Collaborative work environments
CS426 L01 Introduction.10
Why Use Parallel Computing?
Save time and/or money:
In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel computers can be built from cheap, commodity components.
Solve larger problems:
Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. For example: "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources. Web search engines/databases processing millions of transactions per second
CS426 L01 Introduction.11
Why Use Parallel Computing?
Provide concurrency:
Use of non-local resources:
A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. For example, the Access Grid (www.accessgrid.org) provides a global collaboration network where people from around the world can meet and conduct work "virtually".
Using compute resources on a wide area network, or even the Internet when local compute resources are scarce.
Limits to serial computing:
Both physical and practical reasons pose significant constraints to simply building ever faster serial computers:
CS426 L01 Introduction.12
The Computational Power Argument
CS426 L01 Introduction.14
The Computational Power Argument
Moore's law states [1965]:
Gordon Moore (co-founder of Intel)
``The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.'' CS426 L01 Introduction.15
The Computational Power Argument 100000
1,000,000,000
Itanium 2 10000
P4
??%/year
100,000,000
P3 1000
52%/year P2
10,000,000
Pentium 100
1,000,000
486 386
10
286 25%/year
8086 1
100,000
Number of Transistors
Performance (vs. VAX-11/780)
Itanium
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 10,000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
He revised his rate of circuit complexity doubling to 18 months and projected from 1975 onwards at this reduced rate. From David Patterson
CS426 L01 Introduction.16
The Computational Power Argument
If one is to buy into Moore's law, the question still remains - how does one translate transistors into useful OPS (operations per second)?
The logical recourse is to rely on parallelism, both implicit and explicit.
Most serial (or seemingly serial) processors rely extensively on implicit parallelism.
CS426 L01 Introduction.17
Implicit vs. Explicit Parallelism
Implicit
Hardware
Superscalar Processors
CS426 L01 Introduction.18
Explicit
Compiler
Explicitly Parallel Architectures
Pipelining Execution
IF: Instruction fetch EX : Execution
ID : Instruction decode WB : Write back Cycles
Instruction #
1
2
3
4
Instruction i
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX
Instruction i+1
Instruction i+2 Instruction i+3 Instruction i+4
CS426 L01 Introduction.19
5
6
7
8
WB
Super-Scalar Execution
Cycles Instruction type
1
2
3
4
5
6
Integer
IF
ID
EX
WB
Floating point
IF
ID
EX
WB
Integer
IF
ID
EX
WB
Floating point
IF
ID
EX
WB
Integer
IF
ID
EX
WB
Floating point
IF
ID
EX
WB
Integer
IF
ID
EX
WB
Floating point
IF
ID
EX
WB
2-issue super-scalar machine CS426 L01 Introduction.20
7
Why Parallelism is now necessary for Mainstream Computing
Chip density is continuing to increase
~2x every 2 years Clock speed is not Number of processor cores have to double instead
There is little or no hidden parallelism (ILP) to be found
Parallelism must be exposed to and managed by software
CS426 L01 Introduction.21
Fundamental limits on Serial Computing: “Walls” Power
Wall: Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase.
CS426 L01 Introduction.22
Power Consumption (watts) i nt el 386 i nt el 486
Power
i nt el pent i um
1000
i nt el pent i um 2 i nt el pent i um 3 i nt el pent i um 4 i nt el i t ani um A l pha 21064
100
A l pha 21164 A l pha 21264 Spar c Super Spar c Spar c64 M i ps
10
HP P A P ower P C AMD K6 AMD K7 A M D x86-64
1 85
87
CS426 L01 Introduction.23
89
91
93
95
97
99
01
03
05
07
Parallelism Saves Power
Power = (Capacitance) * (Voltage)2 * (Frequency)
Power α (Frequency)3
Baseline example: single 1GHz core with power P
Option A: Increase clock frequency to 2GHz Power = 8P Option B: Use 2 cores at 1 GHz each Power = 2P
Option B delivers same performance as Option A with 4x less power … provided
software can be decomposed to run in parallel!
CS426 L01 Introduction.24
Fundamental limits on Serial Computing: “Walls” Frequency
Wall: Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account.
CS426 L01 Introduction.25
The March to Multicore: Uniprocessor Performance i nt el 386 i nt el 486 i nt el pent i um i nt el pent i um 2
10000.00
Specint2000
i nt el pent i um 3 i nt el pent i um 4 i nt el i t ani um A l pha 21064 A l pha 21164
1000.00
A l pha 21264 Spar c Super Spar c Spar c64 M i ps
100.00
HP P A P ower P C AMD K6 AMD K7
10.00
A M D x86-64
1.00 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 CS426 L01 Introduction.26
ILP is becoming fully exploited ILP: instruction level parallelism
ILP is suitable to the superscalar architecture (wider issues, pipelining) CS426 L01 Introduction.27
Fundamental limits on Serial Computing: “Walls” Memory
Wall: On multi-gigahertz symmetric processors --- even those with integrated memory controllers --- latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor.
CS426 L01 Introduction.28
Range of a Wire in One Clock Cycle 0.28 0.26 0.24
• 400 mm2 Die • From the SIA Roadmap
700 MHz
0.22
Process (microns)
0.2 1.25 GHz
0.18 0.16 0.14
2.1 GHz
0.12 0.1 0.08
6 GHz 10 GHz
0.06
13.5 GHz
0.04 0.02 0 1996
1998
CS426 L01 Introduction.29
2000
2002
2004
2006
Year
2008
2010
2012
2014
DRAM Access Latency
Access times are a speed of light issue Memory technology is also changing SRAM are getting harder to scale DRAM is no longer cheapest cost/bit
Power efficiency is an issue here as well
µProc 60%/yr. (2X/1.5yr)
DRAM 9%/yr. (2X/10 yrs)
1000000
Performance
10000
100
Year
CS426 L01 Introduction.30
20 04
20 02
20 00
19 98
19 96
19 94
19 92
19 90
19 88
19 86
19 84
19 82
19 80
1
Important Issues in parallel computing Task/Program Partitioning.
How to split a single task among the processors so that each processor performs the same amount of work, and all processors work collectively to complete the task.
Data
Partitioning.
How to split the data evenly among the processors in such a way that processor interaction is minimized.
Communication/Arbitration.
How we allow communication among different processors and how we arbitrate communication related conflicts.
CS426 L01 Introduction.31
Challenges
Design of parallel computers so that we resolve the above issues.
Design, analysis and evaluation of parallel algorithms run on these machines.
Portability and scalability issues related to parallel programs and algorithms
Tools and libraries used in such systems.
CS426 L01 Introduction.32
Units of Measure in HPC
High Performance Computing (HPC) units are:
Flop: floating point operation Flops/s: floating point operations per second Bytes: size of data (a double precision floating point number is 8)
Typical sizes are millions, billions, trillions…
Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes
• See www.top500.org for current list of fastest machines CS426 L01 Introduction.33
Who and What?
Top500.org provides statistics on parallel computing - the charts below are just a sampling.
CS426 L01 Introduction.34
Top 500 HPC Applications
CS426 L01 Introduction.35
The race is already on for Exascale Computing
CS426 L01 Introduction.36
What is a parallel computer?
Parallel algorithms allow the efficient programming of parallel computers.
This way the waste of computational resources can be avoided.
Parallel computer v.s. Supercomputer
supercomputer refers to a general-purpose computer that can solve computational intensive problems faster than traditional computers. A supercomputer may or may not be a parallel computer.
37 CS426 L01 Introduction.37
Parallel Computers: Past and Present
1980’s Cray supercomputer
1990’s “Cray”-like CPU is 2-4 times as fast as a microprocessor.
20-100 times faster than other computers(main frames, minicomputers) in use. The price of supercomputer is 10 times other computers
The price of supercomputer is 10-20 times a microcomputer Make no sense
The solution to the need for computational power is a massively parallel computers, where tens to hundreds of commercial off-the-shelf processors are used to build a machine whose performance is much greater than that of a single processor.
CS426 L01 Introduction.38
Sun Starfire (UE10000) • Up to 64-way SMP using bus-based snooping protocol
P
P
P
P
P
P
P
P
$
$
$
$
$
$
$
$
Board Interconnect
Board Interconnect
4 processors + memory module per system board
Uses 4 interleaved address busses to scale snooping protocol
16x16 Data Crossbar
Memory Module CS426 L01 Introduction.39
Memory Module
Separate data transfer over high bandwidth crossbar
Case Studies: The IBM Blue-Gene Architecture
The hierarchical architecture of Blue Gene. CS426 L01 Introduction.40
Next Week
Parallel Programming Platforms
CS426 L01 Introduction.41