ENGINEERS AND DEVICES WORKING TOGETHER
ENGINEERS AND DEVICES WORKING TOGETHER
● ○ ○ ○
● ●
K L M N O
(Android (Android (Android (Android (Android
4.4): 5.0): 6.0): 7.0): 8.0):
Dalvik ART ART ART ART
+ + + + +
JIT compiler AOT compiler AOT compiler JIT/AOT compiler JIT/AOT compiler + vectorization
● ● ● ● ● ●
ENGINEERS AND DEVICES WORKING TOGETHER
All modern general-purpose CPUs support small-scale SIMD instructions (typically between 64-bit and 512-bit) A SIMD instruction performs a single operation to multiple operands in parallel ARM: NEON Technology (128-bit) Intel: SSE* (128-bit) AVX* (256-bit, 512-bit) MIPS: MSA (128-bit)
4x32-bit operations
● ○ ○ ○
● ○ ○ ○
● ● ●
Many vectorizing compilers were developed by supercomputer vendors Intel introduced first vectorizing compiler for SSE in 1999 Since the Android O release, the optimizing compiler of ART has joined the family of vectorizing compilers www.aartbik.com
ENGINEERS AND DEVICES WORKING TOGETHER
for (int i = 0; i < 256; i++) { a[i] = b[i] + 1; }
for (int i = 0; i < 256;
i += 4) {
a[i:i+3] = b[i:i+3] + [1,1,1,1];
-> }
A class hierarchy of general vector operations that is sufficiently powerful to represent SIMD operations common to all architectures VectorOperation Ronny Reader
….
Wendy VectorBinOp Writer
….
Perry VectorAdd Presenter
has vector length has packed data type
Vinny VectorSub Viewer
VectorMemOp Abby Author
Molly VectorLoad Maker
has alignment
Casey VectorStore Creator
t = [1,1,1,1]; for (int i = 0; i < 256; i += 4) {
->
for (int i = 0; i < 256; i += 8) {
a[i:i+3] = b[i:i+3] + [1,1,1,1];
a[i :i+3] = b[i :i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t;
} }
movi v0.4s, #0x1, lsl #0 mov w3, #0xc
t = [1,1,1,1]; for (int i = 0; i < 256; i += 8) { a[i:i+3] = b[i:i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t; }
mov w0, #0x0
->
Loop:
cmp w0, #0x100 (256) b.hs Exit add w4, w0, #0x4 (4) add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q1, [x2, x0] add v1.4s, v1.4s, v0.4s str q1, [x1, x0] ldr q1, [x2, x5] add v1.4s, v1.4s, v0.4s str q1, [x1, x5] add w0, w4, #0x4 (4) ldrh w16, [tr] ; suspend check cbz w16, Loop
VecReplicateScalar(x)
ARM64 dup v0.4s, w2
x86-64 movdq xmm0, rdx pshufd xmm0, xmm0, 0
MIPS64 fill.w
w0, a2
/** * Cross-fade byte arrays x1 and x2 into byte array x_out. */ private static void avg(byte[] x_out, byte[] x1, byte[] x2) { // Compute minimum length of the three byte arrays. int min = Math.min(x_out.length, Math.min(x1.length, x2.length)); // Morph with rounding halving add (unsigned). for (int i = 0; i < min; i++) { x_out[i] = (byte) (((x1[i] & 0xff) + (x2[i] & 0xff) + 1) >> 1); } }
SEQUENTIAL (ARMv8 AArch64)
SIMD (ARMv8 AArch64 + NEON Technology)
L:cmp w5, w0 b.hs Exit add w4, w2, #0xc (12) add w6, w3, #0xc (12) ldrsb w4, [x4, x5] ldrsb w6, [x6, x5] and w4, w4, #0xff and w6, w6, #0xff add w4, w4, w6 add w6, w1, #0xc (12) add w4, w4, #0x1 (1) asr w4, w4, #1 strb w4, [x6, x5] add w5, w5, #0x1 (1) ldrh w16, [tr] ; suspend check cbz w16, L
L:cmp w5, w4 b.hs Exit add w16, w2, w5 ldur q0, [x16, #12] add w16, w3, w5 ldur q1, [x16, #12] urhadd v0.16b, v0.16b, v1.16b add w16, w1, w5 stur q0, [x16, #12] add w5, w5, #0x10 (16) ldrh w16, [tr] ; suspend check cbz w16, L
Runs about 10x faster!
Sequential performance ≈20fps
SIMD performance (NEON 128-bit) ≈60fps
ENGINEERS AND DEVICES WORKING TOGETHER
Java code
● ○
● ○ ○
Autovectorization result
void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; } }
ENGINEERS AND DEVICES WORKING TOGETHER
Java code
● ○
● ○ ○
● ○ ○
● ○
void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; } }
Autovectorization result L: cmp w0, #0x200 b.hs Exit add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
●
Before
After (68% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
●
Before
After (68% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
Before
After (68% perf boost)
●
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
●
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.2s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.2s}, [x16] mul v1.2s, v0.2s, v1.2s add v0.2s, v0.2s, v1.2s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.2s}, [x16] add w0, w0, #0x2 ldrh w16, [tr] cbz w16, L
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
○
● ○
● ○
●
ENGINEERS AND DEVICES WORKING TOGETHER
● ○
Before
After (11% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ○
Before
After (11% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ○ ○ ○
Before
After (11% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mul v1.4s, v0.4s, v1.4s add v0.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v0.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ○
● ○ ○ ○ ○
Before
After (23% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ○
● ○ ○ ○ ○
Before
After (23% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ○
● ○ ○ ○ ○
● ○ ○
Before
After (23% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, #0xc add x16, x16, x0, lsl #2 ld1 {v0.4s}, [x16] add w16, w2, #0xc add x16, x16, x0, lsl #2 ld1 {v1.4s}, [x16] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, #0xc add x16, x16, x0, lsl #2 st1 {v2.4s}, [x16] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
● ENGINEERS AND DEVICES WORKING TOGETHER
Before
mov w3, #0xc
● ○
● ○ ○
●
After (10% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
ENGINEERS AND DEVICES WORKING TOGETHER
Before
mov w3, #0xc
● ○
● ○ ○
●
After (10% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
ENGINEERS AND DEVICES WORKING TOGETHER
Before
mov w3, #0xc
● ○
● ○ ○
● ●
After (10% perf boost)
L: cmp w0, #0x200 b.hs Exit
L: cmp w0, #0x200 b.hs Exit
add w16, w1, w0, lsl #2 ldur q0, [x16, #12] add w16, w2, w0, lsl #2 ldur q1, [x16, #12] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s add w16, w1, w0, lsl #2 stur q2, [x16, #12] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
ENGINEERS AND DEVICES WORKING TOGETHER
ENGINEERS AND DEVICES WORKING TOGETHER
●
Before
After (2.5% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
●
Before
After (2.5% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ● ○ ○
Before
After (2.5% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
●
Before
After (12% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
●
Before
After (12% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ● ● ○
● ○ ○
● ○
Before
After (12% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ● ● ○
● ○ ○
● ○
●
Before
After (12% perf boost)
L: cmp w0, #0x200 b.hs Exit add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4 ldrh w16, [tr] cbz w16, L
L: cmp w0, #0x200 b.hs Exit add w4, w0, #0x4 add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2 ldr q0, [x1, x0] ldr q1, [x2, x0] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4 ldrh w16, [tr] cbz w16, L ENGINEERS AND DEVICES WORKING TOGETHER
● ○
for (int i = 0; i < LENGTH; i++) { c[i] = (byte)(a[i] + b[i]); }
○
●
i87 i102 i99 d89 d84 d83 d88 d85 d90 d86 d91 i92 v78
Add [i80,i79] IntermediateAddressIndex [i87,i98,i3] IntermediateAddressIndex [i80,i98,i3] VecLoad [l35,i102] VecLoad [l35,i99] VecLoad [l29,i99] VecLoad [l29,i102] VecAdd [d83,d84] VecAdd [d88,d89] VecStore [l27,i99,d85] VecStore [l27,i102,d90] Add [i87,i79] Goto ENGINEERS AND DEVICES WORKING TOGETHER
(gdb) x/64u 0xefc0b000 0xefc0b000: 0 0xefc0b008: 0 0xefc0b010: 104 0xefc0b018: 112 0xefc0b020: 120 0xefc0b028: 128 0xefc0b030: 136 0xefc0b038: 144
28 0 105 113 121 129 137 145
192 4 106 114 122 130 138 146
final int byte [] a byte [] b byte [] c
0 100 108 116 124 132 140 148
0 101 109 117 125 133 141 149
0 102 110 118 126 134 142 150
0 Object Header 103 111 119 127 135 143 151
data[0]
Java Code static static static static
18 0 107 115 123 131 139 147
LENGTH = 1024 * 256; = new byte[LENGTH]; = new byte[LENGTH]; = new byte[LENGTH];
// 256K elements, 0x40000
ENGINEERS AND DEVICES WORKING TOGETHER
(gdb) x/64u 0xefc0b000 0xefc0b000: 0 0xefc0b008: 0 0xefc0b010: 104 0xefc0b018: 112 0xefc0b020: 120 0xefc0b028: 128 0xefc0b030: 136 0xefc0b038: 144
28 0 105 113 121 129 137 145
192 4 106 114 122 130 138 146
18 0 107 115 123 131 139 147
0 100 108 116 124 132 140 148
0 101 109 117 125 133 141 149
0 102 110 118 126 134 142 150
0 Object Header 103 111 One VecLoad / VecStore 119 127 135 143 151
Java Code static static static static
final int byte [] a byte [] b byte [] c
LENGTH = 1024 * 256; = new byte[LENGTH]; = new byte[LENGTH]; = new byte[LENGTH];
// 256K elements, 0x40000
ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ○ ○ ○
●
SIMD from
0xefc0b000: 0xefc0b008: here-> 0xefc0b010: 0xefc0b018: 0xefc0b020: 0xefc0b028: 0xefc0b030: 0xefc0b038:
0 0 104 112 120 128 136 144
28 0 105 113 121 129 137 145
192 4 106 114 122 130 138 146
18 0 107 115 123 131 139 147
0 100 108 116 124 132 140 148
0 101 109 117 125 133 141 149
0 102 110 118 126 134 142 150
0 103 111 119 127 135 143 151
Avoid SIMD from here
ENGINEERS AND DEVICES WORKING TOGETHER
ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ● ○ ○
ENGINEERS AND DEVICES WORKING TOGETHER
● ○ ○
● ● ● ● ○
ENGINEERS AND DEVICES WORKING TOGETHER
● ● ○ ○ ○ ○ ○ ○ ○
Analyzable and flexible
CHECKED!
Embeddable
CHECKED!
Stable and reproducible
CHECKED!
Recognized
CHECKED!
● ○ ○ ○
ENGINEERS AND DEVICES WORKING TOGETHER
● ● ○ ○ ○
● ○ ○ ○
ENGINEERS AND DEVICES WORKING TOGETHER
●
ENGINEERS AND DEVICES WORKING TOGETHER
●
ENGINEERS AND DEVICES WORKING TOGETHER
●
ENGINEERS AND DEVICES WORKING TOGETHER
●
ENGINEERS AND DEVICES WORKING TOGETHER
● ○
● ○
● ○
● ○
LDR q1, [x16] + LDR q2, [x16, #16] -> LDP q1, q2, [x16]
● ○
ENGINEERS AND DEVICES WORKING TOGETHER
● ● ○
● ○ ○
ENGINEERS AND DEVICES WORKING TOGETHER
Java void mul_add(int[] a, int[] b, int[] c) -{ for (int i=0; i