Maximizing Application Acceleration
with FPGAs Shreyas Shah Xilinx, Inc
Santa Clara, CA
Agenda ▪ ▪ ▪ ▪
Market background and Data Center trends Change in compute architectures after 75 years! Workload specific acceleration: Xilinx FPGA Increased Ethernet Network Performance : Xilinx FPGA ▪ Summary Santa Clara, CA
Markets and Data Center Trends : ▪
Software Defined Data Center
▪
Physical Evolving Standards :
▪
Need for Workload specific Acceleration : mentioned publicly by companies like Baidu, Microsoft, JPMorgan and others …
Santa Clara, CA
Virtual
Cloud
Exponential Growth: Servers, Storage, Network Servers Storage
Network
Acceleration
Source: ONS2014 Keynote, Microsoft / Azure
Santa Clara, CA
Big Data : Big Impact on DC infrastructure Video Analytics, Speech to text, Targeted advertisements, OCR NVDIMM, SSD, In memory DB, Acceleration in SSD Key value store, DNN
Santa Clara, CA
Acceleration in Appliances NIC card acceleration Storage HBA acceleration ToR switch acceleration Core Network acceleration
Acceleration in NAS Servers Flash as a cache : FileIO Acceleration in Flash based SSD Caching, De-dup, Comp/De-comp Big data analytics
Evolving Architectures Tr
▪ Power/thermal density is limiting Fmax scaling
max
• End of Dennard scaling ⇨ End of Moore’s law
▪ CPU performance scaling problematic
F
P
Source : Intel, Wikipedia
• Difficulties in exploiting task-level parallelism with multicore ⇨ Dark silicon
▪ Heterogeneous computing⇨
Best of both worlds
• Higher performance and lower power • Increased compute density
Santa Clara, CA
Source : ISCA2011 CPU / GPU
Source : Xilinx
Xi li FP nx GA
Application Acceleration in Data Center
Cloud computing and Big data analytics
▪ ▪
TCO optimized processor architectures are emerging Clusters of workload specific computing •
HPC in cloud – –
•
Big data analytics, Event stream processing, Data mining – –
•
Personalized medicine Oil and Gas exploration Data base acceleration (In-memory data bases) Personalized Advertisements
Other applications include – – –
Video analytics & Image processing Ticker symbol processing Machine learning and Analytics – –
▪ ▪
IoT C-RAN
Santa Clara, CA
Image recognition, speech recognition Neural network and deep learning
Traditional & Emerging Computer Architectures ▪ Computer Architectures • • • • • •
Main Processor Bridge chip (South, North, IO bridges) IO Slots for Graphics IO controllers DRAM Memory DIMMs Hard disk
▪ Emerging Architectures evolution •
Main Processor: SoC – Processor w Integrated Bridge chips – IO controllers – Graphics Processing units
•
Memory – Processors in Memories – Memory appliances w large amount of DRAM – DRAM Memory module w Flash
•
Flash replaces hard disk
Application Acceleration with FPGA : Hottest New Trend
Santa Clara, CA
Workload specific acceleration : Xilinx FPGA
Processor PCIe
Mem
PCIe
FPGA
PCIe IO
IO
IO Cntl r
▪ Main Processor w FPGA : Two models emerging : • •
Inline Model (Inline acceleration, Pre-processing) Co-processor Model
Santa Clara, CA
Mem
Processor
FPGA
PCIe
9
Application Acceleration : Xilinx FPGA Processor PCIe
DM FPGA A PCIe
Mem
Mem
▪ Co-processor Model •
IO Bus based: – DMA based programming model
Santa Clara, CA
Processor
Mem
CCI
Mem
FPGA
CCI CCI : Cache Coherent Interfac
▪ Co-processor Model •
Cache coherent Interface: – Load/Store programming model
Network Acceleration : Xilinx FPGA
PCIe
FPGA
NIC FPG ASS A P PCIe
PCIe
•
Inline Model (Inline FPGA w Network acceleration, Pre-processing)
Santa Clara, CA
NIC ASSP PCIe Ethernet IO
Ethernet IO
▪ ASSP w FPGA :
Mem
Processor
Mem
Processor
▪ ASSP w FPGA : •
Co-processor Model : ASSP w FPGA on a side interface
Xilinx Chips Used in Data Centers COMPUTE Graph processing 10-100x Perf/W
String/Pattern matching 10-20x Perf/W
Image/Signal processing 50x throughput
DNN
STORAGE Hybrid memory Latency hiding 10x power saving
Key-Value Stores 36x RPS/Watt 10x-100x latency reduction
Compression/Encryption Customize algorithms Latency sub 5us Encryption rate 10x
NETWORKING Secure socket Latency sub 5us Encryption rate 10x
Santa Clara, CA
TCP endpoint Latency sub 2us 10x virtual circuits
Packet switch Latency sub 100ns Protocol choices
Source : Xilinx
Application acceleration w Xilinx FPGA ▪ FPGA value proposition : • •
High speed IO and serdes (33 Gbps) High speed memory connectivity – – – –
• •
DRAM, QDR SRAM, RL3 Graphics memory HBM HMC, Mosys/GSI BE
Large amount of on chip memory Flash interfaces with error correction – ONFi, Toggle, eMMC, SAS, SATA
•
PCIe IO Bus : G1/G2/G3/G4 and future G4Overclocked
Santa Clara, CA
▪ FPGA value proposition (con’t) • • • • •
Ethernet Connectivity: 100 Mbps to 400Gbps Interlaken: 150 Gbps – 600 Gbps Processor blocks for optimized applications Large pool of DSPs Support for higher level abstractions – C/C++/OpenCL
• Application library components to serve acceleration market • Variety of other protocol support
13
Software Defined Development Environments ✓ SDAccel for OpenCL, C, C++ enables up to 25x
better performance per watt
✓ Provides C/C++/OpenCL
programming to bit files
✓ SDSoC : ASSP-like programming experience ✓ SDNet allows creation of ‘Softly’ Defined
Networks
✓ Higher level language to define the “Fields”
of interest to perform Packet processing Tasks
Expand Users to Broad Community of Software and Systems Engineers
Summary ▪ Processors hitting the wall : Increase performance at reasonable power • Stuck between 2-3 GHz for more than a decade
▪ Specific applications require specialized hardware blocks to be optimal: •
Power, performance and scalability -- > Xilinx FPGA is the answer
▪ TCO (OPEX) is the main focus of Data Center for profitability • TCO optimized architectures: – CPU + FPGA on PCIe/Cache coherent bus will drive the application acceleration
• NIC ASSP + FPGA architectures evolving – Addressing the performance challenge servers and Ethernet network connectivity
• Ethernet switch ASSP + FPGA evolving to solve network performance Santa Clara, CA
Thank for your attention!
Santa Clara, CA