FASTER PASSWORD RECOVERY WITH MODERN GPUs

FASTER PASSWORD RECOVERY WITH MODERN GPUs Andrey Belenko ElcomSoft Co. Ltd. Security Researcher WHO ARE WE §Founded in 1990 §Privately owned §Do...
Author: Ada Oliver
15 downloads 0 Views 3MB Size
FASTER PASSWORD RECOVERY WITH MODERN GPUs Andrey Belenko ElcomSoft Co. Ltd. Security Researcher

WHO ARE WE §Founded in 1990 §Privately owned §Doing password recovery (software) since 1998 §HQ and development in Moscow, Russia §Brought GPUs to password recovery in 2007 §5 US patents issued, more in queue –2 are about GPU-accelerated password recovery

3 | Faster Password Recovery with Modern GPUs | June 14, 2011

WHO NEEDS PASSWORD RECOVERY? §Ordinary users –Passwords of their own §IT Departments –Passwords of the employees §Security auditors, consultants and penetration testers –Customer/contractor passwords §Law enforcement & government agencies –Passwords of suspects

§Hackers usually don’t!

4 | Faster Password Recovery with Modern GPUs | June 14, 2011

WHY SPEED COUNTS? §Users and IT Departments: –«We needed those passwords yesterday» §Auditors, consultants and pentesters: –«Time is Money» §Law Enforcement and investigators –Legal time limits

5 | Faster Password Recovery with Modern GPUs | June 14, 2011

PASSWORD RECOVERY | The Loop The slow part

Transform password Generate trial password

(compute hash or encryption key)

Validate hash/ key Failure

Try next password

6 | Faster Password Recovery with Modern GPUs | June 14, 2011

Success

PASSWORD RECOVERY | The Slow Part §Designed to be slow –50ms verification time has no impact on usability but HUGE impact on password recovery performance §Usually designed around well-known hash functions –MD5 (old days) –SHA-1 (most popular so far) –SHA-2 (still exotic) §Thousands to millions of hash computations per password

7 | Faster Password Recovery with Modern GPUs | June 14, 2011

FAST PASSWORD RECOVERY | The CPU Way Before GPGPU era most optimizations focused on: §SIMD (MMX, SSE, AVX) §Multi-core §Distributed computing (think distributed.net) –Communication overhead –Difficult to manage –Not power-efficient

8 | Faster Password Recovery with Modern GPUs | June 14, 2011

9 | Faster Password Recovery with Modern GPUs | June 14, 2011

10 | Faster Password Recovery with Modern GPUs | June 14, 2011

FAST PASSWORD RECOVERY | The GPU Way §Password recovery constitutes “embarrassingly parallel” workload §Each processing unit verifies own password, independently from other processing units §Linear scalability in practice Done by GPU Transform password Transform password Generate trial passwords

Transform password

Validate hashes/ keys

Transform password

Failure

Transform password Try next password 11 | Faster Password Recovery with Modern GPUs | June 14, 2011

Success

FAST PASSWORD RECOVERY | The GPU Way

CPU Generate trial passwords

Passwords[]

GPU PCIe

Passwords[]

Compute keys from passwords

Keys[]

Validate keys

12 | Faster Password Recovery with Modern GPUs | June 14, 2011

Keys[]

LIMITATIONS §Works good for “slow” algorithms §For “fast” algorithms PCIe becomes the bottleneck –e.g. for SHA-1 theoretical limit is 8 Gbps / (20 bytes in + 20 bytes out) ≈ 214 million passwords per second §Need to offload everything to the GPU –password generation and key validation on GPU are bigger challenges than crypto itself –especially so without OpenCL

13 | Faster Password Recovery with Modern GPUs | June 14, 2011

ALTERNATIVE WAY

CPU

GPU PCIe

Initial password

Generate trial passwords

Passwords[]

Compute keys from passwords

Keys[]

Result 14 | Faster Password Recovery with Modern GPUs | June 14, 2011

Validate keys

PASSWORD RECOVERY

CPU Generate trial passwords

Passwords[]

GPU PCIe

Passwords[]

Compute keys from passwords

Keys[]

Validate keys

15 | Faster Password Recovery with Modern GPUs | June 14, 2011

Keys[]

OVERLAPPING CPU AND GPU §In straightforward implementation it may look like this: CPU GPU

Gen

Vfy

Gen

Compute

Vfy

Gen

Compute

Vfy Compute

§But CPU and GPU can work simultaneously, so overlap their operations: CPU GPU

Gen Gen Compute

Vfy

Gen

Compute

Vfy

Vfy

Compute

Profit!

16 | Faster Password Recovery with Modern GPUs | June 14, 2011

PERFORMANCE | PBKDF2-SHA1 x 10’000

Intel i7-970

3120

NVIDIA GTX 590

23500

AMD HD 6990

50300

0K

15K

30K Computations per second

17 | Faster Password Recovery with Modern GPUs | June 14, 2011

45K

60K

HEY, WHY NO 100X SPEEDUP? Be fair! §CPUs are not single core any more –Even Atoms are not §Extended instruction sets were introduced for performance reasons –So why ignore them? §Will usually get ~10x on comparable hardware for well-suited compute-bound tasks

18 | Faster Password Recovery with Modern GPUs | June 14, 2011

CPU LAYOUT

Core

Core

Core

Core

Memory Controller

19 | Faster Password Recovery with Modern GPUs | June 14, 2011

IO & QPI

L3 Cache Queue

§Less than 10% are in execution and/or ALU units

L3 Cache

IO & QPI

§1.2 billions transistors –Most are L3/L2 caches

Core

Core

GPU LAYOUT §3 billions transistors (2.5x) §About 30% are execution and/or ALU units (3x) §7.5x more transistors dedicated to execution units §Core frequency is about lower (~0.4x) §3x estimated speedup

In fair real-world comparison this GPU is 4x faster than CPU on compute-bound task

20 | Faster Password Recovery with Modern GPUs | June 14, 2011

HEY, WHY NO 100X SPEEDUP? Be fair! §CPUs are not single core any more –Even Atoms are not §Extended instruction sets were introduced for performance reasons –So why ignore them? §Will usually get ~10x on comparable hardware for well-suited tasks In our case: §SSE2 code + processor-specific compiler optimizations §12 threads to fully utilize 6 cores + HT §16x over high-end CPU

21 | Faster Password Recovery with Modern GPUs | June 14, 2011

PERFORMANCE | PBKDF2-SHA1 x 10’000

Intel i7-970

3120

NVIDIA GTX 590

23500

AMD HD 6990

50300

0K

15K

30K Computations per second

22 | Faster Password Recovery with Modern GPUs | June 14, 2011

45K

60K

WHY AMD IS SO FAST? §Most password transformations are bounded by integer performance –AMD cards exhibit awesome integer performance §Many password transformations (=crypto) make heavy use of bit rotations (=cyclic shifts) –There is a special instruction for this! –Cyclic shift in 1 instruction instead of 3, up to 30% overall speedup in practice §GPU code written in IL –Utilize all GPU devices under Windows –(Recent APP SDK versions allow this with OpenCL)

23 | Faster Password Recovery with Modern GPUs | June 14, 2011

PERFORMANCE | bitalign §AMD IL Specification, section 7.13: Aligns bit data for video. This is a special instruction for multi-media video. bitalign dst, src0, src1, src2 dst = (src0 >> src2.x) | (src1