Introduction to Multicore Architectures and Parallel Programming Jan Balzer
October 18, 2011
Multicore Terminology
I
physical core: an independent processing unit
I
multicore cpu: a CPU chip with multiple cores
single-core < dual-core < quad-core < multicore < manycore
Our champions...
Core Intel Xeon X7550
Our champions...
Core Intel Xeon X7550 I
8 physical cores
I
2 Gigahertz each
I
18 MB L3 Cache (!)
The machine
The machine
The machine
The machine
Two connected motherboards act as one machine with... I
...64 physical cores in total
I
...1 Terabyte RAM in total
Hardware structure
I
2 motherboards
I
4 processors per motherboard
I
8 cores per processor
One node
Memory management Rule of thumb Use local variables - avoid global variables whenever possible.
Hyper-Threading
Hyper-Threading
⇒ 128 virtual cores
Impossible to imitate the power of 128 physical cores
Hyper-Threading
⇒ 128 virtual cores
Impossible to imitate the power of 128 physical cores Expect a speedup of about 10%
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen,
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen, emacs, vim, nano,
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make,
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time,
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time, latex
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time, latex Need anything else (exotic?)
Working on Shanghai
ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!
Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time, latex Need anything else (exotic?) ⇒ ask!
Parallel Programming...
...exploits the parallel architecture on a software level. Parallel programming uses more than one core at once to speed up computing.
Parallel Programming...
...exploits the parallel architecture on a software level. Parallel programming uses more than one core at once to speed up computing. Good for swarm algorithms.
Parallel Programming...
...exploits the parallel architecture on a software level. Parallel programming uses more than one core at once to speed up computing. Good for swarm algorithms. Designing parallel programs is difficult. Having passed a lecture in the field of concurrent programming is highly recommended for the seminar.
Processes
I
are scheduled individually
I
do not share data
I
may communicate via inter-process communication
Threads
I
share data and instruction memory (program code)
I
are still scheduled individually
I
control flow and local variables may differ
Contents
Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications
Master-Slave-Model
I
one master thread controls many slave threads
I
slave threads do the real work
Master-Slave-Model
I
one master thread controls many slave threads
I
slave threads do the real work
I
most common approach
Pipeline-Model
Contents
Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications
Things to avoid
I
deadlocks: all threads are waiting for each other – program freezes http://en.wikipedia.org/wiki/Deadlock
I
race-conditions: different threads manipulate the same memory location – bugs http://en.wikipedia.org/wiki/Race condition
I
bad load-balancing: one thread does all the work, while other threads wait – program is slow http://en.wikipedia.org/wiki/Load balancing (computing)
Desirable properties
I
mutual exclusion for resources: a resource is accessed by just one thread at a time
I
fairness: each thread gets the computing time and resources it needs
Desirable properties
I
mutual exclusion for resources: a resource is accessed by just one thread at a time
I
fairness: each thread gets the computing time and resources it needs
Both properties ensure that a parallel program is efficient (distributed workload) and correct (no race conditions).
Desirable properties
I
mutual exclusion for resources: a resource is accessed by just one thread at a time
I
fairness: each thread gets the computing time and resources it needs
Both properties ensure that a parallel program is efficient (distributed workload) and correct (no race conditions). I
mutual exclusion can be achieved using locks or semaphores.
Desirable properties
I
mutual exclusion for resources: a resource is accessed by just one thread at a time
I
fairness: each thread gets the computing time and resources it needs
Both properties ensure that a parallel program is efficient (distributed workload) and correct (no race conditions). I
mutual exclusion can be achieved using locks or semaphores.
I
fairness requires good design
Contents
Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications
Overview I
C with POSIX Threads I I I
supports Threads, Mutexes, Conditions low-level difficult
I
C with OpenMP I preprocessor-based I use #pragmas to mark parts of code as “parallel” I easy, but very high-level
I
Java Threads I supports “synchronized” methods I easier than POSIX Threads, but probably slower
I
MPI (Message Passing Interface) I many methods for communication between threads I can be combined with POSIX Threads
I
...another framework of your choice?
Example: Concurrent Counting
Goal I
10 threads concurrently increment a counter indx until it reaches 42.
Example: Concurrent Counting
Goal I
10 threads concurrently increment a counter indx until it reaches 42.
Challenges I
exclusive access to indx
I
do not count further than 42
POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c
POSIX threads #include
include the library
int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; threads are represented by variables with type pthread_t int i; pthread_mutex_init(&lock, null); for(i = 0; i < 10; i++) { pthread_create(&threads[i], null, counter, null); } for(i = 0; i < 10; i++) { pthread_join(threads[i], null); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } compile: gcc -pthread file.c
POSIX threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, null); initialize threads with: pthread_create([thread variable], [attributes], [address of starting routine], [arguments]) for(i = 0; i < 10; i++) { pthread_create(&threads[i], null, counter, null); } for(i = 0; i < 10; i++) { pthread_join(threads[i], null); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } compile: gcc -pthread file.c
POSIX threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, null); for(i = 0; i < 10; i++) { pthread_create(&threads[i], null, counter, null); } join threads, i.e., wait for slave threads to terminate: for(i = 0; i < 10; i++) { pthread_join(threads[i], null); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } compile: gcc -pthread file.c
POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock;
locks (to guarantee mutual exclusion) have the type pthread_mutex_t
void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c
POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); create a new lock for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c
POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c
acquire the lock (blocks until the lock is available) critical section free the lock, another thread may enter the critical section
Open MP
#include int main() { int indx = 0; int max = 42; omp_set_num_threads(10); #pragma omp parallel shared(indx, max) { #pragma omp schedule(dynamic) nowait while(indx < max) { #pragma omp critical if(indx < max) { indx++; } } } return 0; } Compile: gcc -fopenmp file.c
Java threads
public class ConCount extends Thread { private static int indx = 0; private static int max = 42; public void run() { while(indx < max) { synchronized(this) { if(indx < max) { indx++; } } } } public static void main(String[] args) { for(int i = 0; i < 10; i++) { new concount().start(); } } } Compile: javac classname.java Run: java classname
MPI - Message passing interface
I
one source code file – many processes
I
communication instead of shared memory
I
a job can be distributed onto different machines in a network
MPI #include #include int main(int argc, char*argv[]) { int max = 42; int indx = 0; MPI_Status stat; int numtasks, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank == 0) { //I am the master int i; while(indx < max) { for(i = 1; i < numtasks; i++) { MPI_Send(&indx, 1, MPI_INT, i, 1, MPI_COMM_WORLD); MPI_Recv(&indx, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat); printf("%d\n", indx); if(indx >= max) { break; } } } for(i = 1; i < numtasks; i++) { MPI_Send(&indx, 1, MPI_INT, i, 1, MPI_COMM_WORLD); } } else { //I am a slave while(indx < max) { MPI_Recv(&indx, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &stat); indx++; MPI_Send(&indx, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); } } MPI_Finalize(); return 0; }
Contents
Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications
Variables and Measures
#threads ∼ #cores speedup =
sequentialtime paralleltime
I
speedup should scale with #threads
I
scheduling/organization overhead should be low: paralleltime × #threads should ideally remain constant
Tools I
top display a list of running processes
Tools I
top display a list of running processes
I
htop display load of individual (virtual!) cores
Tools I
top display a list of running processes
I
htop display load of individual (virtual!) cores
I
time [command] measures time
Tools I
top display a list of running processes
I
htop display load of individual (virtual!) cores
I
time [command] measures time
I
/usr/bin/time --output [file] [command] outputs statistics to a file
Tools I
top display a list of running processes
I
htop display load of individual (virtual!) cores
I
time [command] measures time
I
/usr/bin/time --output [file] [command] outputs statistics to a file
I
TIKZ Latex package to create nice graphical representations usepackage{tikz}
Performance chart : time and #threads 300000 250000
Time (ms)
200000 150000 100000 50000 3000
1
20
40
60
80 100 120
#Threads Your time #threads chart should look similar!
Parallel speedup factor 128 100 80
Speedup
60 40 20 0
1 20 40 60 80 100 #Threads
128
Scaled speedup
#Threads x Processing time (ms)
scaletime = #threads × time In a perfect parallel program, scaletime should be constant 40000
20000
0
1
20
40
60
80 100 120 128
#Threads
Contents
Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications
Tutorials from Livermore Computing Centre
I
Introduction to Parallel Programming: https://computing.llnl.gov/tutorials/parallel comp/
I
POSIX Threads: https://computing.llnl.gov/tutorials/pthreads/
I
Open MP: https://computing.llnl.gov/tutorials/openMP/
I
MPI: https://computing.llnl.gov/tutorials/mpi/
Java Threads
I
The official Java Tutorials offer a comprehensive lesson about concurrency http://download.oracle.com/javase/tutorial/ essential/concurrency/index.html
I
(in German) see “Java ist auch eine Insel”, Chapter 14 http://openbook.galileocomputing.de/javainsel/
Contents
Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications
Books
A list of books recommended for the “concurrent programming” lecture: http://infobib.cs.uni-sb.de/frames/vorlesungen/ info-basic nebenlaeufig.html