ENGINEERS AND DEVICES WORKING TOGETHER

ENGINEERS AND DEVICES WORKING TOGETHER ENGINEERS AND DEVICES WORKING TOGETHER ● ○ ○ ○ ● ● K L M N O (Android (Android (Android (Android (Andr...
Author: Buddy Bruce
17 downloads 0 Views 3MB Size
ENGINEERS AND DEVICES WORKING TOGETHER

ENGINEERS AND DEVICES WORKING TOGETHER

● ○ ○ ○

● ●

K L M N O

(Android (Android (Android (Android (Android

4.4): 5.0): 6.0): 7.0): 8.0):

Dalvik ART ART ART ART

+ + + + +

JIT compiler AOT compiler AOT compiler JIT/AOT compiler JIT/AOT compiler + vectorization

● ● ● ● ● ●

ENGINEERS AND DEVICES WORKING TOGETHER

All modern general-purpose CPUs support small-scale SIMD instructions (typically between 64-bit and 512-bit) A SIMD instruction performs a single operation to multiple operands in parallel ARM: NEON Technology (128-bit) Intel: SSE* (128-bit) AVX* (256-bit, 512-bit) MIPS: MSA (128-bit)

4x32-bit operations

● ○ ○ ○

● ○ ○ ○

● ● ●

Many vectorizing compilers were developed by supercomputer vendors Intel introduced first vectorizing compiler for SSE in 1999 Since the Android O release, the optimizing compiler of ART has joined the family of vectorizing compilers www.aartbik.com

ENGINEERS AND DEVICES WORKING TOGETHER

for (int i = 0; i < 256; i++) { a[i] = b[i] + 1; }

for (int i = 0; i < 256;

i += 4) {

a[i:i+3] = b[i:i+3] + [1,1,1,1];

-> }

A class hierarchy of general vector operations that is sufficiently powerful to represent SIMD operations common to all architectures VectorOperation Ronny Reader

….

Wendy VectorBinOp Writer

….

Perry VectorAdd Presenter

has vector length has packed data type

Vinny VectorSub Viewer

VectorMemOp Abby Author

Molly VectorLoad Maker

has alignment

Casey VectorStore Creator

t = [1,1,1,1]; for (int i = 0; i < 256; i += 4) {

->

for (int i = 0; i < 256; i += 8) {

a[i:i+3] = b[i:i+3] + [1,1,1,1];

a[i :i+3] = b[i :i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t;

} }

movi v0.4s, #0x1, lsl #0 mov w3, #0xc

t = [1,1,1,1]; for (int i = 0; i < 256; i += 8) { a[i:i+3] = b[i:i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t; }

mov w0, #0x0

->

Loop:

cmp w0, #0x100 (256) b.hs Exit add w4, w0, #0x4 (4) add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q1, [x2, x0] add v1.4s, v1.4s, v0.4s str q1, [x1, x0] ldr q1, [x2, x5] add v1.4s, v1.4s, v0.4s str q1, [x1, x5] add w0, w4, #0x4 (4) ldrh w16, [tr] ; suspend check cbz w16, Loop

VecReplicateScalar(x)

ARM64 dup v0.4s, w2

x86-64 movdq xmm0, rdx pshufd xmm0, xmm0, 0

MIPS64 fill.w

w0, a2

/** * Cross-fade byte arrays x1 and x2 into byte array x_out. */ private static void avg(byte[] x_out, byte[] x1, byte[] x2) { // Compute minimum length of the three byte arrays. int min = Math.min(x_out.length, Math.min(x1.length, x2.length)); // Morph with rounding halving add (unsigned). for (int i = 0; i < min; i++) { x_out[i] = (byte) (((x1[i] & 0xff) + (x2[i] & 0xff) + 1) >> 1); } }

SEQUENTIAL (ARMv8 AArch64)

SIMD (ARMv8 AArch64 + NEON Technology)

L:cmp w5, w0 b.hs Exit add w4, w2, #0xc (12) add w6, w3, #0xc (12) ldrsb w4, [x4, x5] ldrsb w6, [x6, x5] and w4, w4, #0xff and w6, w6, #0xff add w4, w4, w6 add w6, w1, #0xc (12) add w4, w4, #0x1 (1) asr w4, w4, #1 strb w4, [x6, x5] add w5, w5, #0x1 (1) ldrh w16, [tr] ; suspend check cbz w16, L

L:cmp w5, w4 b.hs Exit add w16, w2, w5 ldur q0, [x16, #12] add w16, w3, w5 ldur q1, [x16, #12] urhadd v0.16b, v0.16b, v1.16b add w16, w1, w5 stur q0, [x16, #12] add w5, w5, #0x10 (16) ldrh w16, [tr] ; suspend check cbz w16, L

Runs about 10x faster!

Sequential performance ≈20fps

SIMD performance (NEON 128-bit) ≈60fps

ENGINEERS AND DEVICES WORKING TOGETHER

Java code

● ○

● ○ ○

Autovectorization result

void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; } }

ENGINEERS AND DEVICES WORKING TOGETHER

Java code

● ○

● ○ ○

● ○ ○

● ○

void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; } }

Autovectorization result L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER



Before

After (68% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER



Before

After (68% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

Before

After (68% perf boost)



L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit



add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L



● ○

● ○



ENGINEERS AND DEVICES WORKING TOGETHER

● ○

Before

After (11% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ○

Before

After (11% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ○ ○ ○

Before

After (11% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ○

● ○ ○ ○ ○

Before

After (23% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ○

● ○ ○ ○ ○

Before

After (23% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ○

● ○ ○ ○ ○

● ○ ○

Before

After (23% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

● ENGINEERS AND DEVICES WORKING TOGETHER

Before

mov w3, #0xc

● ○

● ○ ○



After (10% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

ENGINEERS AND DEVICES WORKING TOGETHER

Before

mov w3, #0xc

● ○

● ○ ○



After (10% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

ENGINEERS AND DEVICES WORKING TOGETHER

Before

mov w3, #0xc

● ○

● ○ ○

● ●

After (10% perf boost)

L: cmp w0, #0x200 b.hs Exit

L: cmp w0, #0x200 b.hs Exit

add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

ENGINEERS AND DEVICES WORKING TOGETHER

ENGINEERS AND DEVICES WORKING TOGETHER



Before

After (2.5% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER



Before

After (2.5% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ● ○ ○

Before

After (2.5% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER



Before

After (12% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER



Before

After (12% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ● ● ○

● ○ ○

● ○

Before

After (12% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ● ● ○

● ○ ○

● ○



Before

After (12% perf boost)

L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L

L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER

● ○

for (int i = 0; i < LENGTH; i++) { c[i] = (byte)(a[i] + b[i]); }





i87 i102 i99 d89 d84 d83 d88 d85 d90 d86 d91 i92 v78

Add [i80,i79] IntermediateAddressIndex [i87,i98,i3] IntermediateAddressIndex [i80,i98,i3] VecLoad [l35,i102] VecLoad [l35,i99] VecLoad [l29,i99] VecLoad [l29,i102] VecAdd [d83,d84] VecAdd [d88,d89] VecStore [l27,i99,d85] VecStore [l27,i102,d90] Add [i87,i79] Goto ENGINEERS AND DEVICES WORKING TOGETHER

(gdb) x/64u 0xefc0b000 0xefc0b000: 0 0xefc0b008: 0 0xefc0b010: 104 0xefc0b018: 112 0xefc0b020: 120 0xefc0b028: 128 0xefc0b030: 136 0xefc0b038: 144

28 0 105 113 121 129 137 145

192 4 106 114 122 130 138 146

final int byte [] a byte [] b byte [] c

0 100 108 116 124 132 140 148

0 101 109 117 125 133 141 149

0 102 110 118 126 134 142 150

0 Object Header 103 111 119 127 135 143 151

data[0]

Java Code static static static static

18 0 107 115 123 131 139 147

LENGTH = 1024 * 256; = new byte[LENGTH]; = new byte[LENGTH]; = new byte[LENGTH];

// 256K elements, 0x40000

ENGINEERS AND DEVICES WORKING TOGETHER

(gdb) x/64u 0xefc0b000 0xefc0b000: 0 0xefc0b008: 0 0xefc0b010: 104 0xefc0b018: 112 0xefc0b020: 120 0xefc0b028: 128 0xefc0b030: 136 0xefc0b038: 144

28 0 105 113 121 129 137 145

192 4 106 114 122 130 138 146

18 0 107 115 123 131 139 147

0 100 108 116 124 132 140 148

0 101 109 117 125 133 141 149

0 102 110 118 126 134 142 150

0 Object Header 103 111 One VecLoad / VecStore 119 127 135 143 151

Java Code static static static static

final int byte [] a byte [] b byte [] c

LENGTH = 1024 * 256; = new byte[LENGTH]; = new byte[LENGTH]; = new byte[LENGTH];

// 256K elements, 0x40000

ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ○ ○ ○



SIMD from

0xefc0b000: 0xefc0b008: here-> 0xefc0b010: 0xefc0b018: 0xefc0b020: 0xefc0b028: 0xefc0b030: 0xefc0b038:

0 0 104 112 120 128 136 144

28 0 105 113 121 129 137 145

192 4 106 114 122 130 138 146

18 0 107 115 123 131 139 147

0 100 108 116 124 132 140 148

0 101 109 117 125 133 141 149

0 102 110 118 126 134 142 150

0 103 111 119 127 135 143 151

Avoid SIMD from here

ENGINEERS AND DEVICES WORKING TOGETHER

ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ● ○ ○

ENGINEERS AND DEVICES WORKING TOGETHER

● ○ ○

● ● ● ● ○

ENGINEERS AND DEVICES WORKING TOGETHER

● ● ○ ○ ○ ○ ○ ○ ○

Analyzable and flexible

CHECKED!

Embeddable

CHECKED!

Stable and reproducible

CHECKED!

Recognized

CHECKED!

● ○ ○ ○

ENGINEERS AND DEVICES WORKING TOGETHER

● ● ○ ○ ○

● ○ ○ ○

ENGINEERS AND DEVICES WORKING TOGETHER



ENGINEERS AND DEVICES WORKING TOGETHER



ENGINEERS AND DEVICES WORKING TOGETHER



ENGINEERS AND DEVICES WORKING TOGETHER



ENGINEERS AND DEVICES WORKING TOGETHER

● ○

● ○

● ○

● ○

LDR q1, [x16] + LDR q2, [x16, #16] -> LDP q1, q2, [x16]

● ○

ENGINEERS AND DEVICES WORKING TOGETHER

● ● ○

● ○ ○

ENGINEERS AND DEVICES WORKING TOGETHER

Java void mul_add(int[] a, int[] b, int[] c) -{ for (int i=0; i