Introduction to Multicore Architectures and Parallel Programming

Introduction to Multicore Architectures and Parallel Programming Jan Balzer October 18, 2011 Multicore Terminology I physical core: an independen...
Author: Maurice Francis
1 downloads 0 Views 962KB Size
Introduction to Multicore Architectures and Parallel Programming Jan Balzer

October 18, 2011

Multicore Terminology

I

physical core: an independent processing unit

I

multicore cpu: a CPU chip with multiple cores

single-core < dual-core < quad-core < multicore < manycore

Our champions...

Core Intel Xeon X7550

Our champions...

Core Intel Xeon X7550 I

8 physical cores

I

2 Gigahertz each

I

18 MB L3 Cache (!)

The machine

The machine



The machine



The machine

Two connected motherboards act as one machine with... I

...64 physical cores in total

I

...1 Terabyte RAM in total

Hardware structure

I

2 motherboards

I

4 processors per motherboard

I

8 cores per processor

One node

Memory management Rule of thumb Use local variables - avoid global variables whenever possible.

Hyper-Threading

Hyper-Threading

⇒ 128 virtual cores

Impossible to imitate the power of 128 physical cores

Hyper-Threading

⇒ 128 virtual cores

Impossible to imitate the power of 128 physical cores Expect a speedup of about 10%

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen,

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen, emacs, vim, nano,

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make,

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time,

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time, latex

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time, latex Need anything else (exotic?)

Working on Shanghai

ssh shanghai.mpi-inf.mpg.de -l yourname Enjoy!

Programs ready for you: ssh, scp, screen, emacs, vim, nano, gcc, javac, make, top, htop, time, latex Need anything else (exotic?) ⇒ ask!

Parallel Programming...

...exploits the parallel architecture on a software level. Parallel programming uses more than one core at once to speed up computing.

Parallel Programming...

...exploits the parallel architecture on a software level. Parallel programming uses more than one core at once to speed up computing. Good for swarm algorithms.

Parallel Programming...

...exploits the parallel architecture on a software level. Parallel programming uses more than one core at once to speed up computing. Good for swarm algorithms. Designing parallel programs is difficult. Having passed a lecture in the field of concurrent programming is highly recommended for the seminar.

Processes

I

are scheduled individually

I

do not share data

I

may communicate via inter-process communication

Threads

I

share data and instruction memory (program code)

I

are still scheduled individually

I

control flow and local variables may differ

Contents

Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications

Master-Slave-Model

I

one master thread controls many slave threads

I

slave threads do the real work

Master-Slave-Model

I

one master thread controls many slave threads

I

slave threads do the real work

I

most common approach

Pipeline-Model

Contents

Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications

Things to avoid

I

deadlocks: all threads are waiting for each other – program freezes http://en.wikipedia.org/wiki/Deadlock

I

race-conditions: different threads manipulate the same memory location – bugs http://en.wikipedia.org/wiki/Race condition

I

bad load-balancing: one thread does all the work, while other threads wait – program is slow http://en.wikipedia.org/wiki/Load balancing (computing)

Desirable properties

I

mutual exclusion for resources: a resource is accessed by just one thread at a time

I

fairness: each thread gets the computing time and resources it needs

Desirable properties

I

mutual exclusion for resources: a resource is accessed by just one thread at a time

I

fairness: each thread gets the computing time and resources it needs

Both properties ensure that a parallel program is efficient (distributed workload) and correct (no race conditions).

Desirable properties

I

mutual exclusion for resources: a resource is accessed by just one thread at a time

I

fairness: each thread gets the computing time and resources it needs

Both properties ensure that a parallel program is efficient (distributed workload) and correct (no race conditions). I

mutual exclusion can be achieved using locks or semaphores.

Desirable properties

I

mutual exclusion for resources: a resource is accessed by just one thread at a time

I

fairness: each thread gets the computing time and resources it needs

Both properties ensure that a parallel program is efficient (distributed workload) and correct (no race conditions). I

mutual exclusion can be achieved using locks or semaphores.

I

fairness requires good design

Contents

Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications

Overview I

C with POSIX Threads I I I

supports Threads, Mutexes, Conditions low-level difficult

I

C with OpenMP I preprocessor-based I use #pragmas to mark parts of code as “parallel” I easy, but very high-level

I

Java Threads I supports “synchronized” methods I easier than POSIX Threads, but probably slower

I

MPI (Message Passing Interface) I many methods for communication between threads I can be combined with POSIX Threads

I

...another framework of your choice?

Example: Concurrent Counting

Goal I

10 threads concurrently increment a counter indx until it reaches 42.

Example: Concurrent Counting

Goal I

10 threads concurrently increment a counter indx until it reaches 42.

Challenges I

exclusive access to indx

I

do not count further than 42

POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c

POSIX threads #include

include the library

int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; threads are represented by variables with type pthread_t int i; pthread_mutex_init(&lock, null); for(i = 0; i < 10; i++) { pthread_create(&threads[i], null, counter, null); } for(i = 0; i < 10; i++) { pthread_join(threads[i], null); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } compile: gcc -pthread file.c

POSIX threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, null); initialize threads with: pthread_create([thread variable], [attributes], [address of starting routine], [arguments]) for(i = 0; i < 10; i++) { pthread_create(&threads[i], null, counter, null); } for(i = 0; i < 10; i++) { pthread_join(threads[i], null); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } compile: gcc -pthread file.c

POSIX threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, null); for(i = 0; i < 10; i++) { pthread_create(&threads[i], null, counter, null); } join threads, i.e., wait for slave threads to terminate: for(i = 0; i < 10; i++) { pthread_join(threads[i], null); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } compile: gcc -pthread file.c

POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock;

locks (to guarantee mutual exclusion) have the type pthread_mutex_t

void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c

POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); create a new lock for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c

POSIX Threads #include int indx = 10; int max = 42; pthread_mutex_t lock; void *counter(); int main() { pthread_t threads[10]; int i; pthread_mutex_init(&lock, NULL); for(i = 0; i < 10; i++) { pthread_create(&threads[i], NULL, counter, NULL); } for(i = 0; i < 10; i++) { pthread_join(threads[i], NULL); } return 0; } void *counter() { while(indx < max) { pthread_mutex_lock(&lock); if(indx < max) { indx++; } pthread_mutex_unlock(&lock); } } Compile: gcc -pthread file.c

acquire the lock (blocks until the lock is available) critical section free the lock, another thread may enter the critical section

Open MP

#include int main() { int indx = 0; int max = 42; omp_set_num_threads(10); #pragma omp parallel shared(indx, max) { #pragma omp schedule(dynamic) nowait while(indx < max) { #pragma omp critical if(indx < max) { indx++; } } } return 0; } Compile: gcc -fopenmp file.c

Java threads

public class ConCount extends Thread { private static int indx = 0; private static int max = 42; public void run() { while(indx < max) { synchronized(this) { if(indx < max) { indx++; } } } } public static void main(String[] args) { for(int i = 0; i < 10; i++) { new concount().start(); } } } Compile: javac classname.java Run: java classname

MPI - Message passing interface

I

one source code file – many processes

I

communication instead of shared memory

I

a job can be distributed onto different machines in a network

MPI #include #include int main(int argc, char*argv[]) { int max = 42; int indx = 0; MPI_Status stat; int numtasks, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank == 0) { //I am the master int i; while(indx < max) { for(i = 1; i < numtasks; i++) { MPI_Send(&indx, 1, MPI_INT, i, 1, MPI_COMM_WORLD); MPI_Recv(&indx, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat); printf("%d\n", indx); if(indx >= max) { break; } } } for(i = 1; i < numtasks; i++) { MPI_Send(&indx, 1, MPI_INT, i, 1, MPI_COMM_WORLD); } } else { //I am a slave while(indx < max) { MPI_Recv(&indx, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &stat); indx++; MPI_Send(&indx, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); } } MPI_Finalize(); return 0; }

Contents

Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications

Variables and Measures

#threads ∼ #cores speedup =

sequentialtime paralleltime

I

speedup should scale with #threads

I

scheduling/organization overhead should be low: paralleltime × #threads should ideally remain constant

Tools I

top display a list of running processes

Tools I

top display a list of running processes

I

htop display load of individual (virtual!) cores

Tools I

top display a list of running processes

I

htop display load of individual (virtual!) cores

I

time [command] measures time

Tools I

top display a list of running processes

I

htop display load of individual (virtual!) cores

I

time [command] measures time

I

/usr/bin/time --output [file] [command] outputs statistics to a file

Tools I

top display a list of running processes

I

htop display load of individual (virtual!) cores

I

time [command] measures time

I

/usr/bin/time --output [file] [command] outputs statistics to a file

I

TIKZ Latex package to create nice graphical representations usepackage{tikz}

Performance chart : time and #threads 300000 250000

Time (ms)

200000 150000 100000 50000 3000

1

20

40

60

80 100 120

#Threads Your time #threads chart should look similar!

Parallel speedup factor 128 100 80

Speedup

60 40 20 0

1 20 40 60 80 100 #Threads

128

Scaled speedup

#Threads x Processing time (ms)

scaletime = #threads × time In a perfect parallel program, scaletime should be constant 40000

20000

0

1

20

40

60

80 100 120 128

#Threads

Contents

Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications

Tutorials from Livermore Computing Centre

I

Introduction to Parallel Programming: https://computing.llnl.gov/tutorials/parallel comp/

I

POSIX Threads: https://computing.llnl.gov/tutorials/pthreads/

I

Open MP: https://computing.llnl.gov/tutorials/openMP/

I

MPI: https://computing.llnl.gov/tutorials/mpi/

Java Threads

I

The official Java Tutorials offer a comprehensive lesson about concurrency http://download.oracle.com/javase/tutorial/ essential/concurrency/index.html

I

(in German) see “Java ist auch eine Insel”, Chapter 14 http://openbook.galileocomputing.de/javainsel/

Contents

Our multicore-architecture Parallel Programming Designing Parallel Programs Challenges in parallelization Languages and Frameworks Performance of parallel programs References Tutorials Publications

Books

A list of books recommended for the “concurrent programming” lecture: http://infobib.cs.uni-sb.de/frames/vorlesungen/ info-basic nebenlaeufig.html

Suggest Documents