PeachPy: A Python Framework for Developing High-Performance Assembly Kernels

PeachPy: A Python Framework for Developing High-Performance Assembly Kernels Marat Dukhan School of Computational Science and Engineering College of C...
Author: Kellie Harris
1 downloads 2 Views 518KB Size
PeachPy: A Python Framework for Developing High-Performance Assembly Kernels Marat Dukhan School of Computational Science and Engineering College of Computing Georgia Institute of Technology

Presentation on PyHPC 2013

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

1 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

2 / 43

The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET

Marat Dukhan (Georgia Tech)

PeachPy

16] 16] 32] 32]

PyHPC 2013

3 / 43

The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET

Marat Dukhan (Georgia Tech)

dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB r8, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rcx] MOVUPS xmm7, [rdx] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rcx + MOVUPS xmm7, [rdx + MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rcx + MOVUPS xmm7, [rdx + MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: MOVSS xmm3, [rcx] MOVSS xmm4, [rdx] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET

16] 16] 32] 32]

PeachPy

16] 16] 32] 32]

PyHPC 2013

4 / 43

The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET

16] 16] 32] 32]

dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB r8, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rcx] MOVUPS xmm7, [rdx] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rcx + MOVUPS xmm7, [rdx + MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rcx + MOVUPS xmm7, [rdx + MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: MOVSS xmm3, [rcx] MOVSS xmm4, [rdx] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET

Marat Dukhan (Georgia Tech)

16] 16] 32] 32]

dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB rdx, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rdi] MOVUPD xmm7, [rsi] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rdi + MOVUPD xmm7, [rsi + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rdi + MOVUPD xmm7, [rsi + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 6 JAE .batch .restore: ADD rdx, 6 JZ .reduce .remainder: MOVSD xmm3, [rdi] MOVSD xmm4, [rsi] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rdi, 8 ADD rsi, 8 SUB rdx, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET

PeachPy

16] 16] 32] 32]

dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB rdx, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rdi] MOVUPS xmm7, [rsi] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rdi + MOVUPS xmm7, [rsi + MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rdi + MOVUPS xmm7, [rsi + MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 12 JAE .batch .restore: ADD rdx, 12 JZ .reduce .remainder: MOVSS xmm3, [rdi] MOVSS xmm4, [rsi] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rdi, 4 ADD rsi, 4 SUB rdx, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET

PyHPC 2013

16] 16] 32] 32]

5 / 43

The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET

16] 16] 32] 32]

dot_product: VXORPD ymm0, ymm0 VXORPD ymm1, ymm1 VXORPD ymm2, ymm2 SUB r8, 12 JAE .restore align 16 .batch: VMOVUPD ymm3, [rcx] VMOVUPD ymm7, [rdx] VMULPD ymm3, ymm7 VADDPD ymm0, ymm3 VMOVUPD ymm4, [rcx + 32] VMOVUPD ymm7, [rdx + 32] VMULPD ymm4, ymm7 VADDPD ymm1, ymm4 VMOVUPD ymm5, [rcx + 64] VMOVUPD ymm7, [rdx + 64] VMULPD ymm5, ymm7 VADDPD ymm2, ymm5 ADD rcx, 96 ADD rdx, 96 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: VMOVSD xmm3, [rcx] VMOVSD xmm4, [rdx] VMULSD xmm3, xmm4 VADDSD ymm0, ymm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: VADDPD ymm0, ymm1 VADDPD ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPD xmm0, xmm1 VHADDPD xmm0, xmm0 RET

Marat Dukhan (Georgia Tech)

dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB r8, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rcx] MOVUPS xmm7, [rdx] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rcx + 16] MOVUPS xmm7, [rdx + 16] MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rcx + 32] MOVUPS xmm7, [rdx + 32] MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: MOVSS xmm3, [rcx] MOVSS xmm4, [rdx] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET

dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB rdx, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rdi] MOVUPD xmm7, [rsi] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rdi + MOVUPD xmm7, [rsi + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rdi + MOVUPD xmm7, [rsi + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 6 JAE .batch .restore: ADD rdx, 6 JZ .reduce .remainder: MOVSD xmm3, [rdi] MOVSD xmm4, [rsi] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rdi, 8 ADD rsi, 8 SUB rdx, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET

dot_product: VXORPS ymm0, ymm0 VXORPS ymm1, ymm1 VXORPS ymm2, ymm2 SUB r8, 24 JAE .restore align 16 .batch: VMOVUPS ymm3, [rcx] VMOVUPS ymm7, [rdx] VMULPS ymm3, ymm7 VADDPS ymm0, ymm3 VMOVUPS ymm4, [rcx + 32] VMOVUPS ymm7, [rdx + 32] VMULPS ymm4, ymm7 VADDPS ymm1, ymm4 VMOVUPS ymm5, [rcx + 64] VMOVUPS ymm7, [rdx + 64] VMULPS ymm5, ymm7 VADDPS ymm2, ymm5 ADD rcx, 96 ADD rdx, 96 SUB r8, 24 JAE .batch .restore: ADD r8, 24 JZ .reduce .remainder: VMOVSS xmm3, [rcx] VMOVSS xmm4, [rdx] VMULSS xmm3, xmm4 VADDSS ymm0, ymm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: VADDPS ymm0, ymm1 VADDPS ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPS xmm0, xmm1 VHADDPS xmm0, xmm0 VHADDPS xmm0, xmm0 RET

dot_product: VXORPD ymm0, ymm0 VXORPD ymm1, ymm1 VXORPD ymm2, ymm2 SUB rdx, 12 JAE .restore align 16 .batch: VMOVUPD ymm3, [rdi] VMOVUPD ymm7, [rsi] VMULPD ymm3, ymm7 VADDPD ymm0, ymm3 VMOVUPD ymm4, [rdi + 32] VMOVUPD ymm7, [rsi + 32] VMULPD ymm4, ymm7 VADDPD ymm1, ymm4 VMOVUPD ymm5, [rdi + 64] VMOVUPD ymm7, [rsi + 64] VMULPD ymm5, ymm7 VADDPD ymm2, ymm5 ADD rdi, 96 ADD rsi, 96 SUB rdx, 12 JAE .batch .restore: ADD rdx, 12 JZ .reduce .remainder: VMOVSD xmm3, [rdi] VMOVSD xmm4, [rsi] VMULSD xmm3, xmm4 VADDSD ymm0, ymm3 ADD rdi, 8 ADD rsi, 8 SUB rdx, 1 JNZ .remainder .reduce: VADDPD ymm0, ymm1 VADDPD ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPD xmm0, xmm1 VHADDPD xmm0, xmm0 RET

PeachPy

16] 16] 32] 32]

dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB rdx, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rdi] MOVUPS xmm7, [rsi] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rdi + 16] MOVUPS xmm7, [rsi + 16] MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rdi + 32] MOVUPS xmm7, [rsi + 32] MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 12 JAE .batch .restore: ADD rdx, 12 JZ .reduce .remainder: MOVSS xmm3, [rdi] MOVSS xmm4, [rsi] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rdi, 4 ADD rsi, 4 SUB rdx, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET

dot_product: VXORPS ymm0, ymm0 VXORPS ymm1, ymm1 VXORPS ymm2, ymm2 SUB rdx, 24 JAE .restore align 16 .batch: VMOVUPS ymm3, [rdi] VMOVUPS ymm7, [rsi] VMULPS ymm3, ymm7 VADDPS ymm0, ymm3 VMOVUPS ymm4, [rdi + 32] VMOVUPS ymm7, [rsi + 32] VMULPS ymm4, ymm7 VADDPS ymm1, ymm4 VMOVUPS ymm5, [rdi + 64] VMOVUPS ymm7, [rsi + 64] VMULPS ymm5, ymm7 VADDPS ymm2, ymm5 ADD rdi, 96 ADD rsi, 96 SUB rdx, 24 JAE .batch .restore: ADD rdx, 24 JZ .reduce .remainder: VMOVSS xmm3, [rdi] VMOVSS xmm4, [rsi] VMULSS xmm3, xmm4 VADDSS ymm0, ymm3 ADD rdi, 4 ADD rsi, 4 SUB rdx, 1 JNZ .remainder .reduce: VADDPS ymm0, ymm1 VADDPS ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPS xmm0, xmm1 VHADDPS xmm0, xmm0 VHADDPS xmm0, xmm0 RET

PyHPC 2013

6 / 43

The Research Problem This reasearch is about the problem of generating multiple similar assembly kernels: Kernels which perform similar operations I

E.g. vector addition/subtraction

Kernels which do same operation on dierent data types I

E.g. single- and double-precision dot product

Kernels which target dierent microarchitectures or ISA I

E.g. dot product for AVX, FMA4, FMA3

Kernels which use dierent ABIs I

E.g. x86-64 on Windows and Linux

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

7 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

8 / 43

Assembly Compilation Process

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

9 / 43

Assembly Compilation Process

Lets replace macro processor with something More exible More standardized More popular

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

10 / 43

Assembly Compilation Process

Lets replace macro processor with something More exible More standardized More popular Python!

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

11 / 43

Introducing PeachPy PeachPy is. . . . . . an automation and metaprogramming tool for assembly programming . . . an Assembly-like DSL: PeachPy user is exposed to the same low-level details as assembly programmer . . . a Python framework: any PeachPy code is a valid Python code PeachPy is not. . . . . . a compiler: PeachPy does not oer high-level programming abstractions . . . an assembler: PeachPy does not generate machine code

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

12 / 43

PeachPy Philosophy

PeachPy is for writing high-performance codes I I I

No support for invoke, OOP, and other "high-level assembly" No kernel-mode instructions No system instructions

PeachPy is for writing assembly codes I

Not a replacement for high-level compiler

All optimizations possible to do in assembly should be possible to do in PeachPy Everything that can be automated in assembly programming should be automated

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

13 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

14 / 43

Minimal PeachPy Function from peachpy.x64 import * abi = peachpy.c.ABI('x64-sysv') assembler = Assembler(abi) x_argument = peachpy.c.Parameter("x", peachpy.c.Type("uint32_t")) arguments = (x_argument,) function_name = "f" microachitecture = "SandyBridge" with Function(assembler, function_name, arguments, microarchitecture): RETURN() print assembler

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

15 / 43

Modules

PeachPy functionality is concentrated in three Python modules: peachpy.c for C compatilibity classes (C types and ABIs) peachpy.x64 for x86-64 assembly classes peachpy.arm for ARM assembly classes Assembly modules are intended to be imported into program workspace: # This will make the syntax of PeachPy codes # very similar to native assembly from peachpy.x64 import *

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

16 / 43

Assembler and Function

Assembler I I I

Container for functions Contains only functions with specied ABI Normally may be saved as assembly le to disk

Function I I

Created using with syntax: with Function(...): Creates an active instruction stream

Microarchitecture I I

String parameter for Function constructor Restricts the set of supported instructions

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

17 / 43

Instructions All instructions are named in uppercase All instructions are python objects When an instruction is called (as Python object), it is generated and added to the active PeachPy function (as assembly instruction) PeachPy veries the correctness of instruction operands Most computational x86-64 and many ARM instructions are supported Traditional Assembly

PeachPy

.loop: ADDPD xmm0, [rsi] ADD rsi, 16 SUB rcx, 2 JAE .loop

LABEL( "loop" ) ADDPD( xmm0, [rsi] ) ADD( rsi, 16 ) SUB( rcx, 2 ) JAE( "loop" )

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

18 / 43

Registers PeachPy maps dierent types of architectural registers on Python classes x86 register classes: I I I I I I I I

GeneralPurposeRegister (base class) GeneralPurposeRegister8 GeneralPurposeRegister16 GeneralPurposeRegister32 GeneralPurposeRegister64 MMXRegister SSERegister AVXRegister

ARM register classes: I I I I

GeneralPurposeRegister SRegister DRegister QRegister

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

19 / 43

Registers Architectural registers are represented in PeachPy as Python objects All register names are in lowercase Traditional x86-64 Assembly

PeachPy

MOVZX rax, al PADD mm0, mm1 ADDPS xmm0, xmm1 VMULPD ymm0, ymm1, ymm2

MOVSX( rax, al ) PADD( mm0, mm1 ) ADDPS( xmm0, xmm1 ) VMULPD( ymm0, ymm1, ymm2 )

Traditional ARM Assembly

PeachPy

ADD r0, r0, r1 VLD1.32 {d0[]}, [r2] VFMA.F32 q2, q1, q1

ADD( r0, r0, r1 ) VLD1.F32( (d0[:],), [r2] ) VFMA.F32( q2, q1, q1 )

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

20 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

21 / 43

Register Allocation Traditional x86-64 Assembly

VMOVAPD ymm0, [rsi] VMOVAPD ymm1, ymm0 VFMADD132PD ymm1, ymm13, ymm12 VFMADD231PD ymm0, ymm1, ymm14 VFMADD231PD ymm0, ymm1, ymm15 PeachPy

ymm_x = AVXRegister() VMOVAPD( ymm_x, [xPointer] ymm_t = AVXRegister() VMOVAPD( ymm_t, ymm_x ) VFMADD132PD( ymm_t, ymm_t, VFMADD231PD( ymm_x, ymm_t, VFMADD231PD( ymm_x, ymm_t, Marat Dukhan (Georgia Tech)

) ymm_log2e, ymm_magic_bias) ymm_minus_ln2_hi, ymm_x ) ymm_minus_ln2_lo, ymm_x ) PeachPy

PyHPC 2013

22 / 43

In-place Memory Constant Declarations

Traditional x86-64 Assembly

Right here: section .rdata rdata c0 dq 3.141592, 3.141592

In a galaxy far far away: section .text code MULPD xmm0, [c0] PeachPy

MULPD( xmm_x, Constant.float64x2(3.141592) )

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

23 / 43

Hexadecimal Floating-Point Constants Hexadecimal oating-point constants provide an accurate and portable way to specify oating-point constants and without rounding errors Required in C99 standard Supported by gcc, clang, icc, xlc, and NASM But not supported by GNU Assembler PeachPy lets programmers use hexadecimal oating-point constants on all supported platforms C99

const double ln2 = 0x1.71547652B82FEp+0; ARM Assembly (GNU)

x86-64 Assembly (NASM)

ln2: .quad 0x3FF71547652B82FE

ln2 dq 0x1.71547652B82FEp+0

PeachPy (x86-64 and ARM)

ln2 = Constant.float64('0x1.71547652B82FEp+0') Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

24 / 43

Calling conventions The Problem

Consider Assembly implementations for the C function uint64_t add(uint64_t x, uint64_t y) { return x + y; } Assembly for Microsoft x86-64 calling convention

add: LEA rax, [rcx + rdx * 1] RET Assembly for System V x86-64 calling convention

add: LEA rax, [rdi + rsi * 1] RET Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

25 / 43

Calling conventions PeachPy Approach PeachPy code

from peachpy.x64 import * asm = Assembler(peachpy.c.ABI("x64-ms")) # or "x64-sysv" x_arg = peachpy.c.Parameter("x", peachpy.c.Type("uint64_t")) y_arg = peachpy.c.Parameter("y", peachpy.c.Type("uint64_t")) with Function(asm, "add", (x_arg, y_arg), "Bobcat"): x = GeneralPurposeRegister64() LOAD.PARAMETER( x, x_arg ) # Does the magic! y = GeneralPurposeRegister64() LOAD.PARAMETER( y, y_arg ) # Does the magic! LEA( rax, [x + y * 1] ) RETURN() Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

26 / 43

ISA-based runtime dispatching

PeachPy known the instruction set of each instruction PeachPy also collects ISA information about each function This helps to do ne-grained runtime dispatching I

More ecient vs recompiling the function for each ISA with high-level compiler

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

27 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

28 / 43

Parametrized Unroll with Function(asm, "dot_product", args, "SandyBridge"): xPointer, yPointer, zPointer, length = LOAD.PARAMETERS() reg_size = 32 reg_elements = 8 unroll_regs = 8 acc = [AVXRegister() for _ in range(unroll_regs)] temp = [AVXRegister() for _ in range(unroll_regs)] ... LABEL( "process_batch" ) for i in range(unroll_regs): VMOVAPS( temp[i], [xPointer + i * reg_size] ) VMULPS( temp[i], [yPointer + i * reg_size] ) VADDPS( acc[i], temp[i] ) ADD( xPointer, reg_size * unroll_regs ) ADD( yPointer, reg_size * unroll_regs ) SUB( length, reg_elements * unroll_regs ) JAE( "process_batch" ) ... Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

29 / 43

Parametrization by Element Type reg_size = 32 reg_elements = reg_size / element_size unroll_regs = 8 SIMD_LOAD = {4: VMOVAPS, 8: VMOVAPD}[element_size] SIMD_MUL = {4: VMULPS, 8: VMULPD}[element_size] SIMD_ADD = {4: VADDPS, 8: VADDPD}[element_size] ... LABEL( "process_batch" ) for i in range(unroll_regs): SIMD_LOAD( temp[i], [xPointer + i * reg_size] ) SIMD_MUL( temp[i], [yPointer + i * reg_size] ) SIMD_ADD( acc[i], temp[i] ) ADD( xPointer, reg_size * unroll_regs ) ADD( yPointer, reg_size * unroll_regs ) SUB( length, reg_elements * unroll_regs ) JAE( "process_batch" ) ... Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

30 / 43

Supporting Multiple Instruction Sets if Target.has_fma(): VMLAPS = VFMADDPS if Target.has_fma4() else VFMADD231PS else: def VMLAPS(x, a, b, c): t = AVXRegister() VMULPS( t, a, b ) VADDPS( x, t, c ) ... LABEL( "processBatch" ) for i in range(unroll_regs): VMOVAPS( temp[i], [xPointer + i * reg_size] ) VMLAPS( acc[i], temp[i], [yPointer + i * reg_size], acc[i] ) ADD( xPointer, reg_size * unroll_regs ) ADD( yPointer, reg_size * unroll_regs ) SUB( length, reg_elements * unroll_regs ) JAE( "processBatch" ) ... Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

31 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

32 / 43

Introduction Usually we want assembly instructions to appear in the same order as we write them. To increase IPC it is useful to interleave two parts of code using dierent types of instructions. However, it might be convenient to write the code of those parts separately. I

I

ARM Cortex-A9 can decode one SIMD instruction and one scalar instruction per cycle. By interleaving SIMD and scalar processing we can achieve higher performance. On x86 we may use scalar instructions to detect special cases while SIMD units are busy doing calculations.

For software pipelining we may want the skew the sequences of similar instructions relative to each other. I

But we don't want to skew our code

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

33 / 43

Instruction Stream Objects The Python with statement can be used to redirect generated instructions to an InstructionStream object. scalar_stream = InstructionStream() with scalar_stream: x = GeneralPurposeRegister64() MOV( x, [xPointer] ) CMP.JA( x, threshold, "above_threshold" ) vector_stream = InstructionStream() with vector_stream: ...

Instructions from instruction stream can then be re-issued to current instruction stream: while scalar_stream or vector_stream: scalar_stream.issue() vector_stream.issue() Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

34 / 43

Software Pipelining Instruction streams are useful for implementing software pipelining instruction_columns = [InstructionStream(), InstructionStream(), InstructionStream()] for i in range(unroll_regs): with instruction_columns[0]: VMOVDQU( ymm_x[i], [xPointer + i * reg_size] ) with instruction_columns[1]: VPADDD( ymm_x[i], ymm_y ) with instruction_columns[2]: VMOVDQU( [zPointer + i * reg_size], ymm_x[i] ) with instruction_columns[0]: ADD( xPointer, reg_size * unroll_regs ) with instruction_columns[2]: ADD( zPointer, reg_size * unroll_regs )

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

35 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

36 / 43

Is It Worth The Eort? Albeit PeachPy simplies develiping assembly kernels, PeachPy is still assembly, with all its drawbacks. For many HPC scientists C code with compiler intrinsics is a viable alternative to writing assembly: C code with intrinsics is more portable with assembly Many of the problems targeted by PeachPy become irrelevant (e.g. calling convention) Compiler could take into account more processors details than humans We did a simple experiment to check if PeachPy (and assebly in general) can deliver better performance than optimizing compilers.

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

37 / 43

Experimental Setup For the experiment we used branchless versions of vector logarithm and exponential functions from Yeppp! library. These are high-performance implementation originally developed and tuned using Peach-Py. We converted the assembly instructions one-to-one to C++ intrinsics and compiled with modern C++ compilers. The C++ code is a nearly ideal input for a compiler: Code is already vectorized with intrinsics. Each function processes 40 elements and has only one branch. The only parts left to the compilers are register allocation and instruction scheduling. Initial instruction schedulling is close to optimal. A lot of room for improving instruction scheduling: the original version contains 581 instructions for log function and 400 instructions for exp. The produced codes are benchmarked on Intel Core i7-4770K processor with the recent Haswell microarchitecture. Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

38 / 43

Benchmarking Results

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

39 / 43

Outline 1

Motivation

2

PeachPy Foundations

3

PeachPy Basics

4

Assembly Programming Automation

5

Metaprogramming

6

Instruction Streams

7

Experimental Validation

8

Conclusion

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

40 / 43

Plans and Goals The current priorities for PeachPy are Support for PowerPC (including Blue Gene/Q) and Xeon Phi architectures Distribute PeachPy via Python Package Index Enable generation of machine code directly from PeachPy Provide additional features for x86-64 and ARM architectures (e.g. table lookups) ARM64 and x86-32 ports In the long term we hope that PeachPy will replace conventional assembly in HPC workow.

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

41 / 43

Public Availability

PeachPy repository is hosted on bitbucket.org/MDukhan/peachpy The primary user of PeachPy is Yeppp! library (www.yeppp.info). I

The codegen directory in Yeppp! source tree contains a large number of Peach-Py codes.

Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

42 / 43

Funding This research was supported in part by

National Science Foundation (NSF) under NSF CAREER award number 0953100.

A grant from the Defense Advanced Research Projects Agency (DARPA) Computer Science Study Group program.

Declaimer

Any opinions, conclusions or recommendations expressed in this presentation are those of the authors and not necessarily reect those of NSF or DARPA. Marat Dukhan (Georgia Tech)

PeachPy

PyHPC 2013

43 / 43

Suggest Documents