Introduction to Parallel Computing

Ver DJ2013-05 as of 16 Sep 2013 PowerPoint original available on request Introduction to Parallel Computing Doug James [email protected] Sep 201...

Author: Annabelle Reynolds

1 downloads 3 Views 2MB Size

Report

Download PDF

Recommend Documents

Introduction to parallel computing

Parallel Computing Introduction

Parallel Computing: An Introduction

Introduction to Parallel Computing. October 2016

COMP 422: Introduction to Parallel Computing

Parallel Computing: How to Write Parallel Programs

CS 426 Parallel Computing. Lecture 01: Introduction

An introduction to many-core parallel computing with OpenCL

Introduction to Parallel Computing with MPI. Pekka Manninen Sami Ilvonen

Introduction to PARALLEL COMPUTING with OpenMP and MPI

Introduction to Parallel Programming

Introduction to Parallel Processing

CSC6220 Introduction to Parallel and Distribution Computing. Lecture 5: Parallel Programming with Thread (Part 3)

Introduction to Parallel Programming

CSC6220 Introduction to Parallel and Distribution Computing. Lecture 5: Parallel Programming with Thread (Part1)

Introduction to Scientific Computing

Introduction to Parallel Computing. George Karypis Principles of Parallel Algorithm Design

Introduction to Distributed Computing

Parallel Computing with MATLAB

Parallel Computing with OpenMP

What is Parallel Computing?

Bulk Synchronous Parallel Computing

Shared Memory Parallel Computing

Ver DJ2013-05 as of 16 Sep 2013 PowerPoint original available on request

Introduction to Parallel Computing Doug James [email protected] Sep 2013

© The University of Texas at Austin, 2013 Please see the final slide for copyright and licensing information.

Overview Architectures and Programming Models Levels of Parallelism Practical and Theoretical Performance Limits Other Issues and Challenges Summary

Architectures and Programming Models

What is Parallel Programming? More than one paint brush!

Paint the fence faster… …or paint a bigger fence Paint brushes = cores The adventures of Tom Sawyer, by Mark Twain [pseud.] illustrated by Worth Brehm. Adventures of Tom Sawyer. 1910. In the public domain. From Beinecke Rare Book & Manuscript Library. http://brbl-dl.library.yale.edu/vufind/Record/3520172?image_id=1010069

What is Parallel Programming? More than one mower!

Mow the lawn faster… …or mow a bigger lawn Lawn mowers = cores

Brett Chisum 2012 (Augusta National) Wikipedia Commons http://www.flickr.com/photos/brettchisum/7051114207

Shared Memory • All cores share a common pool of memory (RAM) • The programming challenge is coordination: how to avoid competing for access to the same puzzle pieces (memory) • Principal programming model: OpenMP • A single executable spawns independent threads and manages threads' access to data

Memory (RAM) Core

Core

Core

Core

Core

Core

Octahedron80 2007 Wikipedia Commons http://commons.wikimedia.org/wiki/ File:Jigsaw_pieces_with_border.jpg

Distributed Memory • Each core* has its own memory (RAM), inaccessible to other cores • The programming challenge is communication: how to share puzzle pieces (data) • Principal programming model is MPI (Message Passing Interface) • Every assigned core runs a separate copy of the same executable -- a “rank aware” task

RAM

RAM

RAM

RAM

RAM

Octahedron80 2007 Wikipedia Commons http://commons.wikimedia.org/wiki/ File:Jigsaw_pieces_with_border.jpg

RAM

*we’ll modify this in a few slides Core

Core

Core

Core

Network

Core

Core

Hybrid Architecture • Most large clusters are hybrids of these models – Each node (blade) is a multi-core shared memory computer running its own (Linux) operating system – Many such nodes connected in distributed configuration – Each core sees only the memory on its own node! 16-core node RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

Programming Hybrid Architectures • Programming models vary – Pure MPI: ignore shared memory – Hybrid: mix MPI and OpenMP – Pure OpenMP: and confine yourself to one node

16-core node RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

Hybrid Architecture with MICs Stampede’s Xeon Phi MICs present additional programming models – Native: MIC as stand-alone shared memory computer (OpenMP, MPI) – Symmetric: MICs running MPI tasks with other MICs and Sandy Bridge hosts – Offload: MIC as servant (coprocessor) to the Sandy Bridge E5 host -- like General Purpose Graphical Processing Units (GPUs) 16-core node RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

MIC

MIC

MIC

MIC

RAM

Levels of Parallelism

Needle(s) in the Haystack(s) • First approach: think top-down and coarse-grained • Partition the work into essentially independent tasks

P.N.Alhucemas (Oruteta) 2009 Wikipedia Commons http://commons.wikimedia.org/wiki/File:Almiar_(1).JPG

Paul Allison 2007 http://www.geograph.org.uk/photo/602033 Wikipedia Commons http://commons.wikimedia.org/wiki/ File:Hay_Bales_-_geograph.org.uk_-_602033.jpg

Coarse-Grained Parallelism proc 0

proc 1

proc 2

proc 3

task 0

task 1

task 2

task 3

• Assign tasks to processors (nodes, cores, …) • Also called task-based parallelism

Single Program Multiple Data (SPMD) proc 0

proc 1

proc 2

proc 3

my app

my app

my app

my app

dataset 0

dataset 1

dataset 2

dataset 3

• The same code operates on different data • Logic within the program may differ across processors • How much communication, coordination, synchronization?

Massive (Embarrassing) Parallelism proc 0

proc 1

proc 2

proc 3

my app

my app

my app

my app

dataset 0

dataset 1

dataset 2

dataset 3

• High degree of independence • Little to no coordination, communication

Massive (Embarrassing) Parallelism proc 0

proc 1

proc 2

proc 3

my app

my app

my app

my app

dataset 0

dataset 1

dataset 2

dataset 3

• Important example: parameter sweeps • We have tools that support this: launcher, pylauncher

Wikipedia Commons 2010 https://commons.wikimedia.org/wiki/ File:Elmer-pump-heatequation.png

Domain Decomposition Key issues – Dependencies across ghost (halo/transition/boundary) regions – Communication – Load balancing – Bookkeeping (code complexity)

Bal 79 on Wikipedia Commons 2008 http://commons.wikimedia.org/wiki/File:Z88v13_1.jpg

Doug James 2013 Ethan Hein 2008 http://www.flickr.com/photos/ ethanhein/2352707753/ BryanBrandenburg.net 2012 Wikipedia Commons http://commons.wikimedia.org/wiki/ File:Bryan_Brandenburg_Big_Bang_ Big_Bagel_Theory_Howard_Boom.jpg

Fine-Grained Parallelism: Vectorization One combine, multiple rows of wheat

C. Holmes 2009 Wikipedia Commons http://www.flickr.com/photos/inventorchris2/7723117886/

Fine-Grained Parallelism: Vectorization One core, multiple calculations

C. Holmes 2009 Wikipedia Commons http://www.flickr.com/photos/inventorchris2/7723117886/

Fine-Grained Parallelism: Vectorization Think tight, long inner loops with a few familiar array calculations: /* C-style loop */ for ( int i=0; i