FASTER PASSWORD RECOVERY WITH MODERN GPUs Andrey Belenko ElcomSoft Co. Ltd. Security Researcher
WHO ARE WE §Founded in 1990 §Privately owned §Doing password recovery (software) since 1998 §HQ and development in Moscow, Russia §Brought GPUs to password recovery in 2007 §5 US patents issued, more in queue –2 are about GPU-accelerated password recovery
3 | Faster Password Recovery with Modern GPUs | June 14, 2011
WHO NEEDS PASSWORD RECOVERY? §Ordinary users –Passwords of their own §IT Departments –Passwords of the employees §Security auditors, consultants and penetration testers –Customer/contractor passwords §Law enforcement & government agencies –Passwords of suspects
§Hackers usually don’t!
4 | Faster Password Recovery with Modern GPUs | June 14, 2011
WHY SPEED COUNTS? §Users and IT Departments: –«We needed those passwords yesterday» §Auditors, consultants and pentesters: –«Time is Money» §Law Enforcement and investigators –Legal time limits
5 | Faster Password Recovery with Modern GPUs | June 14, 2011
PASSWORD RECOVERY | The Loop The slow part
Transform password Generate trial password
(compute hash or encryption key)
Validate hash/ key Failure
Try next password
6 | Faster Password Recovery with Modern GPUs | June 14, 2011
Success
PASSWORD RECOVERY | The Slow Part §Designed to be slow –50ms verification time has no impact on usability but HUGE impact on password recovery performance §Usually designed around well-known hash functions –MD5 (old days) –SHA-1 (most popular so far) –SHA-2 (still exotic) §Thousands to millions of hash computations per password
7 | Faster Password Recovery with Modern GPUs | June 14, 2011
FAST PASSWORD RECOVERY | The CPU Way Before GPGPU era most optimizations focused on: §SIMD (MMX, SSE, AVX) §Multi-core §Distributed computing (think distributed.net) –Communication overhead –Difficult to manage –Not power-efficient
8 | Faster Password Recovery with Modern GPUs | June 14, 2011
9 | Faster Password Recovery with Modern GPUs | June 14, 2011
10 | Faster Password Recovery with Modern GPUs | June 14, 2011
FAST PASSWORD RECOVERY | The GPU Way §Password recovery constitutes “embarrassingly parallel” workload §Each processing unit verifies own password, independently from other processing units §Linear scalability in practice Done by GPU Transform password Transform password Generate trial passwords
Transform password
Validate hashes/ keys
Transform password
Failure
Transform password Try next password 11 | Faster Password Recovery with Modern GPUs | June 14, 2011
Success
FAST PASSWORD RECOVERY | The GPU Way
CPU Generate trial passwords
Passwords[]
GPU PCIe
Passwords[]
Compute keys from passwords
Keys[]
Validate keys
12 | Faster Password Recovery with Modern GPUs | June 14, 2011
Keys[]
LIMITATIONS §Works good for “slow” algorithms §For “fast” algorithms PCIe becomes the bottleneck –e.g. for SHA-1 theoretical limit is 8 Gbps / (20 bytes in + 20 bytes out) ≈ 214 million passwords per second §Need to offload everything to the GPU –password generation and key validation on GPU are bigger challenges than crypto itself –especially so without OpenCL
13 | Faster Password Recovery with Modern GPUs | June 14, 2011
ALTERNATIVE WAY
CPU
GPU PCIe
Initial password
Generate trial passwords
Passwords[]
Compute keys from passwords
Keys[]
Result 14 | Faster Password Recovery with Modern GPUs | June 14, 2011
Validate keys
PASSWORD RECOVERY
CPU Generate trial passwords
Passwords[]
GPU PCIe
Passwords[]
Compute keys from passwords
Keys[]
Validate keys
15 | Faster Password Recovery with Modern GPUs | June 14, 2011
Keys[]
OVERLAPPING CPU AND GPU §In straightforward implementation it may look like this: CPU GPU
Gen
Vfy
Gen
Compute
Vfy
Gen
Compute
Vfy Compute
§But CPU and GPU can work simultaneously, so overlap their operations: CPU GPU
Gen Gen Compute
Vfy
Gen
Compute
Vfy
Vfy
Compute
Profit!
16 | Faster Password Recovery with Modern GPUs | June 14, 2011
PERFORMANCE | PBKDF2-SHA1 x 10’000
Intel i7-970
3120
NVIDIA GTX 590
23500
AMD HD 6990
50300
0K
15K
30K Computations per second
17 | Faster Password Recovery with Modern GPUs | June 14, 2011
45K
60K
HEY, WHY NO 100X SPEEDUP? Be fair! §CPUs are not single core any more –Even Atoms are not §Extended instruction sets were introduced for performance reasons –So why ignore them? §Will usually get ~10x on comparable hardware for well-suited compute-bound tasks
18 | Faster Password Recovery with Modern GPUs | June 14, 2011
CPU LAYOUT
Core
Core
Core
Core
Memory Controller
19 | Faster Password Recovery with Modern GPUs | June 14, 2011
IO & QPI
L3 Cache Queue
§Less than 10% are in execution and/or ALU units
L3 Cache
IO & QPI
§1.2 billions transistors –Most are L3/L2 caches
Core
Core
GPU LAYOUT §3 billions transistors (2.5x) §About 30% are execution and/or ALU units (3x) §7.5x more transistors dedicated to execution units §Core frequency is about lower (~0.4x) §3x estimated speedup
In fair real-world comparison this GPU is 4x faster than CPU on compute-bound task
20 | Faster Password Recovery with Modern GPUs | June 14, 2011
HEY, WHY NO 100X SPEEDUP? Be fair! §CPUs are not single core any more –Even Atoms are not §Extended instruction sets were introduced for performance reasons –So why ignore them? §Will usually get ~10x on comparable hardware for well-suited tasks In our case: §SSE2 code + processor-specific compiler optimizations §12 threads to fully utilize 6 cores + HT §16x over high-end CPU
21 | Faster Password Recovery with Modern GPUs | June 14, 2011
PERFORMANCE | PBKDF2-SHA1 x 10’000
Intel i7-970
3120
NVIDIA GTX 590
23500
AMD HD 6990
50300
0K
15K
30K Computations per second
22 | Faster Password Recovery with Modern GPUs | June 14, 2011
45K
60K
WHY AMD IS SO FAST? §Most password transformations are bounded by integer performance –AMD cards exhibit awesome integer performance §Many password transformations (=crypto) make heavy use of bit rotations (=cyclic shifts) –There is a special instruction for this! –Cyclic shift in 1 instruction instead of 3, up to 30% overall speedup in practice §GPU code written in IL –Utilize all GPU devices under Windows –(Recent APP SDK versions allow this with OpenCL)
23 | Faster Password Recovery with Modern GPUs | June 14, 2011
PERFORMANCE | bitalign §AMD IL Specification, section 7.13: Aligns bit data for video. This is a special instruction for multi-media video. bitalign dst, src0, src1, src2 dst = (src0 >> src2.x) | (src1