April 1, 2009

LAPACK Past – Present - Future

Center for Computational Science University of Tsukuba, Japan Julie Langou

Toulouse, France - Google Maps

http://maps.google.com/maps?f=q&source=s_q&hl=en&geoco...

Address Toulouse

France

Toulouse, France

September 1999 – September 2003

Keywords: Airbus, Sun, Rugby, Soccer April 1, 2009 ©2009 Google - Map data ©2009 Basarsoft, LeadDog Consulting, AND, Geocentre Consulting, Tele Atlas, PPWK - Terms of Use

2

knoxville,TN - Google Maps

http://maps.google.com/maps?f=q&source=s_q&hl=en&geoco...

Address Knoxville, TN

Knoxville, TN USA ICL, University Of Tennessee October 2003 – August 2006

ICL - Jack Dongarra •  TOP 500 •  LAPACK / ScaLAPACK •  OPEN MPI, FT-MPI •  PAPI •  NetSolve / GridSolve

Keywords: Sunsphere, Vols, Smoky Mountains, ORNL ©2009 Google - Map data ©2009 Tele Atlas, LeadDog Consulting, Europa Technologies - Terms of Use

April 1, 2009 1 of 1

3 3/23/09 8:17 AM

Denver, Colorado USA

Keywords: Rocky Mountains, Ski, Sun (300 days per year), Rockies

Kazuo Matsui

3/23/09 8:44 AM

April 1, 2009

4

LAPACK •  • 

• 

• 

• 

INTRODUCTION PAST •  Motivation •  Success PRESENT •  3.1 release •  3.2 release •  Future release •  Software Engineering FUTURE •  Plasma •  Magma CONCLUSION

April 1, 2009

5

INFRACSTRUCTURE OF LAPACK • 

•  •  •  •  •  •  •  •  •  •  •  •  •  •  • 

provide routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue prob- lems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. support real, complex, support single and double interface in Fortran (accessible for C) great care for manipulating NaNs, Infs, denorms large test suite rely on the BLAS for high performance standard interfaces optimize version from vendors (Intel MKL, IBM ESSL, AMD ACML...) huge community support (contribution and maintenance) insists on readibility of the code, documentation and comments no memory allocation (but workspace query mechanism) error handling (code tries not to abort whenever possible and returns INFO value) Great portability tunable software through ILAENV open source code exceptional longevity! (in particular in the context of ever changing architecture) April 1, 2009

6

LAPACK •  • 

• 

• 

• 

INTRODUCTION PAST •  Motivation •  Success PRESENT •  3.1 release •  3.2 release •  Coming release FUTURE •  Plasma •  Magma CONCLUSION

April 1, 2009

7

LAPACK History 1.0 February 29, 1992 1.0a June 30, 1992 1.0b October 31, 1992 1.1 March 31, 1993 2.0 September 30 1994

April 1, 2009

1994 3.0 June 30 1999 3.1 November 12, 2006 1999 3.0 (update) Oct 31 1999 3.1.1 February 26, 2007 1999 3.0 (update) May 31, 2000 3.2 November 18, 2008

8

LAPACK Team •  University of Tennessee, [Jack Dongarra] •  University of California Berkeley, [Jim Demmel] •  University of Colorado Denver, [Julien Langou] •  NAG Ltd. [Sven Hammarling]

April 1, 2009

9

LAPACK Community Core Team

Developers

•  University of Tennessee, [Jack Dongarra] •  University of California Berkeley, [Jim Demmel] •  University of Colorado Denver, [Julien Langou] •  NAG Ltd. [Sven Hammarling]

Users (bug report / patches) External Contributors

Patrick Alken (University of Colorado at Boulder, USA), Penny Anderson, Bobby Cheng, Cleve Moler, Duncan Po, and Pat Quillen (MathWorks, MA, USA), Michael Ralph Byers (University of Kansas, USA), Zlatko Baudin (Scilab, FR), Michael Chuvelev (Intel, USA), Drmac (University of Zagreb, Croatia) Peng Du Phil DeMier (IBM, USA), Michel Devel (UTINAM (University of Tennessee, Knoxville, USA) Fred institute, University of Franche-Comte, UMR CNRSA, Gustavson (IBM Watson Research Center, NY, US)FR), Alan Edelmann (Massachusetts Institute of Craig Lucas (University of Manchester / NAG Ltd.,Technology, MA, USA), Carlo de Falco and all the UK) Kresimir Veselic (Fernuniversitaet Hagen, Octave developers, Fernando Guevara (University of Hagen, Germany) Jerzy Wasniewski (Technical Utah, UT, USA), Christian Keil, Zbigniew Leyk University of Denmark, Lyngby, Copenhagen, (Wolfram, USA), Joao Moreira de Sa Coutinho, Denmark) Lawrence Mulholland and Mick Pont (NAG, UK), Clint Whaley (University of Texas at San Antonio, TX, USA), Mikhail Wolfson (MIT, USA), Vittorio Zecca.

April 1, 2009

(LAPACK 3.2)

Deaglan Halligan (University of California at Berkeley, USA) Sven Hammarling (NAG Ltd. and University of Manchester, UK) Yozo Hida (University of California at Berkeley, USA) Daniel Kressner (ETH Zurich, Switzerland) Julie Langou (University of Tennessee, USA) Julien Langou (University of Colorado Denver, USA) Xiaoye Li (Lawrence Berkeley Laboratory, USA) Osni Marques (Lawrence Berkeley Laboratory, USA) Jason Riedy (University of California at Berkeley, USA) , Edward Smyth (NAG Ltd., UK) Meghanath Vishvanath (University of California at Berkeley, USA) David Vu (University of California at Berkeley, USA), David Bailey (Lawrence Berkeley Laboratory, USA) Deaglan Halligan (University of California at Berkeley, USA) Greg Henry (Intel, USA) Yozo Hida (University of California at Berkeley, USA) Jimmy Iskandar (University of California at Berkeley, USA) William Kahan (University of California at Berkeley, USA) Anil Kapur (University of California at Berkeley, USA) Suh Y. Kang (University of California at Berkeley, USA) Xiaoye Li (Lawrence Berkeley Laboratory, USA) Soni Mukherjee (University of California at Berkeley, USA) Jason Riedy (University of California at Berkeley, USA) Michael Martin (University of California at Berkeley, USA) Brandon Thompson (University of California at Berkeley, USA) Teresa Tung (University of California at Berkeley, USA) Daniel Yoo (University of California at Berkeley, USA)

10

LAPACK •  • 

• 

• 

• 

INTRODUCTION PAST •  Motivation •  Success PRESENT •  3.1 release •  3.2 release •  Future release •  Software Engineering FUTURE •  Plasma •  Magma CONCLUSION

April 1, 2009

11

LAPACK 3.1 1. 

2. 

3. 

4. 

5. 

6.  7. 

[Release date: Su 11/12/2006]

Hessenberg QR algorithm with the small bulge multi-shift QR algorithm together with aggressive early deflation. This is an implemen- tation of the 2003 SIAM SIAG LA Prize winning algorithm of Braman, Byers and Mathias, that significantly speeds up the nonsymmetric eigenproblem. Improvements of the Hessenberg reduction subroutines. These accelerate the first phase of the nonsymmetric eigenvalue problem. See the reference by G. Quintana-Orti and van de Geijn below. New MRRR eigenvalue algorithms that also support subset computations. These implementations of the 2006 SIAM SIAG LA Prize winning algorithm of Dhillon and Parlett are also significantly more accurate than the version in LAPACK 3.0. Mixed precision iterative refinement subroutines for exploiting fast single precision hardware. On platforms like the Cell processor that do single precision much faster than double, linear systems can be solved many times faster. Even on commodity processors there is a factor of 2 in speed between single and double precision. These are prototype routines in the sense that their interfaces might changed based on user feedback. New partial column norm updating strategy for QR factorization with column pivoting. This fixes a subtle numerical bug dating back to LINPACK that can give completely wrong results. See the reference by Drmac and Bujanovic below. Thread safety: Removed all the SAVE and DATA statements (or provided alternate routines without those statements), increasing reliability on SMPs. Additional support for matrices with NaN/subnormal elements, optimization of the balancing subroutines, improving reliability.

April 1, 2009

12

LAPACK 3.1 Contributors 1. 

2.  3. 

4.  5.  6.  7. 

Hessenberg QR algorithm with the small bulge multi-shift QR algorithm together with aggressive early deflation. Karen Braman and Ralph Byers, Dept. of Mathematics, University of Kansas, USA Improvements of the Hessenberg reduction subroutines. Daniel Kressner, Dept. of Mathematics, University of Zagreb, Croatia New MRRR eigenvalue algorithms that also support subset computations Inderjit Dhillon, University of Texas at Austin, USA Beresford Parlett, Universtiy of California at Berkeley, USA Christof Voemel, Lawrence Berkeley National Laboratory, USA Mixed precision iterative refinement subroutines for exploiting fast single precision hardware. Julie Langou, UTK, Julien Langou, CU Denver, Jack Dongarra, UTK. New partial column norm updating strategy for QR factorization with pivoting. Zlatko Drmac and Zvonomir Bujanovic, Dept. of Mathematics, University of Zagreb, Croatia Thread safety: Removed all the SAVE and DATA statements (or provided alternate routines without those statements) Sven Hammarling, NAG Ltd., UK Additional support for matrices with NaN/subnormal elements, optimization of the balancing subroutines Bobby Cheng, MathWorks, USA

April 1, 2009

13

Speed up Hessenberg QR [Braman, Byers and Mathias]

ARCH: Intel Pentium 4 ( 3.4 GHz ) F77 : GNU Fortran (GCC) 3.4.4 BLAS: libgoto_prescott32p-r1.00.so (one thread) April 1, 2009

14

LAPACK 3.2

[Release date: 11/18/2008]

FORTRAN 90 compiler 1.  Extra Precise Iterative Refinement “guarantee” answers accurate to machine precision

2.  XBLAS, or the portable extra-precision and mixed-precision BLAS http://www.netlib.org/xblas/

3.  Non-Negative Diagonals from Householder QR QR factorization routines now guarantee that the diagonal is both real and non-negative

4.  High Performance QR and Householder Reflections on Low-Profile

Matrices 5.  New fast and accurate Jacobi SVD [Drmac / Veselic – NO complex support. Input Matrix M >= N ] High accuracy SVD routine for dense matrices, which can compute tiny singular values to many more correct digits than SGESVD when the matrix has columns differing widely in norm, and usually runs faster than SGESVD too.

6.  Rectangular Full Packed format

April 1, 2009

[Gustaveson]

15

LAPACK 3.2

[Release date: 11/18/2008]

7.  Pivoted Cholesky 8.  Mixed precision iterative refinement subroutines for exploiting fast

single precision hardware 9.  Add some variants for the one-sided factorization LU Crout, LU Left Looking, LU Sivan Toledo's Recursive as Iterative, QR Left Looking, Cholesky Right Looking, Cholesky Top

10. Bug fixes for the bidiagonal SVD routine that fix some rare convergence

failures 11.  Better multishift Hessenberg QR algorithm with early aggressive deflation

April 1, 2009

16

Expected Addition for the future • 

multishift QZ with early aggressive deflation

• 

block reordering algorithm

• 

Extra-precise iterative refinement for overdetermined least squares

• 

accurate and efficient Givens rotations - http://www.cs.berkeley.edu/~demmel/Givens/

• 

blas 2.5

• 

level 3 BLAS Upgrade

• 

support more matrix types for extra-precise iterative refinement. …

• 

Bo Kagstrom and Daniel Kressner. Multishift variants of the QZ algorithm with aggressive early deflation. SIAM J. Matrix Anal. Appl., 29(1):199-227, 2006. Daniel Kressner. Block algorithms for reordering standard and generalized Schur forms. ACM Trans. Math. Software, 32(4):521-532, 2006. James Demmel, Yozo Hida, Xiaoye S. Li, and E. Jason Riedy. Extra-precise Iterative Refinement for Overdetermined Least Squares Problems. LAPACK Working Note 188, May 2007. David Bindel, James Demmel, William Kahan, and Osni Marques. On computing givens rotations reliably and efficiently. ACM Transactions on Mathematical Software (TOMS) Volume 28, Issue 2, 2002. Pages: 206-238. Gary W. Howell, James Demmel, Charles T. Fulton, Sven Hammarling, and Karen Marmol. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Transactions on Mathematical Software (TOMS) Volume 34, Issue 3, 2008. dsytrs, ssytrs, zhetrs, chetrs, dsytri, ssytri, zhetri, chetri when called with multiple right-hand sides (MathWorks).

April 1, 2009

17

Windows deployment

•  •  •  •  •  •  •  •  •  •  • 

11 Microsoft HPC Institutes (2005-2008) Cornell Computational Biology Service Unit University of Utah Southampton University TACC - University of Texas, Austin HLRS - University of Stuttgart Nizhny Novgorod University Shanghai Supercomputer Center Shanghai Jiao Tong University University of Virginia University of Tennessee Tokyo Institute of Technology

April 1, 2009

18

Windows deployment •  •  •  •  • 

Release nmake build with Intel and PGI Windows compilers. Release pre-build libraries for 32 and 64 bits. Released LAPACK Visual Studio Solution. Release FULL package with installer. Available from: http://icl.cs.utk.edu/lapack-for-windows/

Installer

April 1, 2009

GUI Examples

GUI Testing

19

Other Windows deployments ScaLAPACK •  Released nmake build with Intel and PGI Windows compilers. •  Released pre-build libraries for 64 bits. •  Released ScaLAPACK and BLACS Visual Studio Solution. (64 bits / 32 bits) •  Available from: http://icl.cs.utk.edu/lapack-for-windows/ DGESV Windows vs Linux Linux: ATLAS 3.6.0 / MKL 9.0 / ACML 3.6.1 Windows: ATLAS 3.7.xx / MLK 8.1 / ACML 3.6.0 (Same machine [dual boot] AMD Dual Core Opteron 1.87Ghz)

0.8

0.7

0.6

Time in seconds

ACML-Linux

0.5

MKL-Linux ATLAS-Linux

0.4

ACML-Windows MKL-Windows ATLAS-Windows

0.3

0.2

0.1

00

00

20

00

19

00

18

00

17

00

16

00

15

00

14

00

13

00

12

11

0

00

10

0

90

0

80

0

70

0

60

0

50

0

40

0

30

20

0

0

10

• 

matrix dimension

April 1, 2009

20

Other Windows deployments • 

• 

•  • 

CLAPACK 3.1.1 •  Objective: Provide LAPACK for users not having a Fortran compiler on their machine CLAPACK is more or less an automagically C-translated LAPACK (F2C). Visual Studio Solution with the latest f2c library. Pre-Built libraries

April 1, 2009

21

Software Engineering • 

ScaLAPACK installer [1st version released in December 2007 – latest 0.92 released August 6, 2008] •  with python script (inspired from the petsc installation process)

• 

Lapack User Support [Users still needs support!] •  Mailing List : [email protected] •  Forum : http://icl.cs.utk.edu/lapack-forum •  Installations •  Algorithm •  Portability •  Language

April 1, 2009

22

LAPACK •  • 

• 

• 

• 

INTRODUCTION PAST •  Motivation •  Success PRESENT •  3.1 release •  3.2 release •  Future release •  Software Engineering FUTURE •  Plasma •  Magma CONCLUSION

April 1, 2009

23

Future…Something happening here… From K. Olukotun, L. Hammond, H. Sutter, and B. Smith

A hardware issue just became a software problem

•  In the “old days” it was: each year processors would become faster •  Today the clock speed is fixed or getting slower •  Things are still doubling every 18 -24 months •  Moore’s Law reinterpretated.   Number of cores double every 18-24 months

April 1, 2009

24

Today’s Multicores and GPU 98% of Top500 Systems Are Based on Multicore

Sun Niagra2 IBM Cell

Intel Polaris

SciCortex IBM BG/P AMD Opteron

April 1, 2009

25

Lessons for the future • 

Moore’s Law Reinterpreted

•  Number of cores per chip doubles every two

year, while clock speed roughly stable

• 

Need to deal with systems with millions of concurrent threads

•  MPI and programming languages from the 60’s

will not make it

April 1, 2009

26

PLASMA • 

The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) project aims to address the critical and highly disruptive situation that is facing the Linear Algebra and High Performance Computing community due to the introduction of multi-core architectures.

• 

PLASMA’s ultimate goal is to create software frameworks that enable programmers to simplify the process of developing applications that can achieve both high performance and portability across a range of new architectures.

• 

The development of programming models that enforce asynchronous, out of order scheduling of operations is the concept used as the basis for the definition of a scalable yet highly efficient software framework for Computational Linear Algebra applications

http://icl.cs.utk.edu/plasma

April 1, 2009

27

PLASMA: Tile algorithms Block algorithms – LAPACK

April 1, 2009

Tile algorithms – PLASMA

28

PLASMA: Dag Scheduling

Cholesky 6x6 QR 6x6

April 1, 2009

29

MAGMA • 

Matrix Algebra on GPU and Multicore Architectures

• 

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current "Multicore+GPU" systems.

• 

The MAGMA research is based on the idea that, to address the complex challenges of the emerging hybrid environments, optimal software solutions will themselves have to hybridize, combining the strengths of different algorithms within a single framework. Building on this idea, we aim to design linear algebra algorithms and frameworks for hybrid manycore and GPUs systems that can enable applications to fully exploit the power that each of the hybrid components offers.

April 1, 2009

30

Conclusion Development of new libraries: PLASMA, MAGMA. •  LAPACK backbone of these new libraries. •  Contribution •  Support • 

April 1, 2009

31