Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto
Davide Kirchner1∗ , Raihana Ferdous2∗ , Renato Lo Cigno3 , Leonardo Maccari3 , Massimo Gallo4 , Diego Perino5∗ , and Lorenzo Saino6 September 27, 2016 1
Google Inc., Dublin, Ireland; 2 Create-Net, Trento, Italy; 3 DISI – University of Trento, Italy Bell Labs – Nokia, Paris, France; 5 Telefonica Research, Spain; 6 Fastly, London, UK ∗ This work was done while D. Kirchner and R. Ferdous were at the University of Trento, and D. Perino and L. Saino at Bell Labs. 4
Outline
1. Introduction 2. The Augustus CCN router 3. Performance evaluation 4. Conclusions and lessons learned
2
Introduction
Objectives The main goal is to explore the possibilities offered by modern general-purpose hardware in the context of information-centric networking: • Implement a CCN data plane forwarder fully in software • Run on a commodity x86 64 machine • Performance-oriented, open-source and extensible
• Analyze the performance in a worst-case scenario
4
Objectives The main goal is to explore the possibilities offered by modern general-purpose hardware in the context of information-centric networking: • Implement a CCN data plane forwarder fully in software • Run on a commodity x86 64 machine • Performance-oriented, open-source and extensible
• Analyze the performance in a worst-case scenario Why software router? Flexibility: • Quicker development/deployment cycle and (re)configuration • Hardware can be dynamically allocated to network functions Tools • Off-the-shelf high-performance hardware • High-speed packet I/O libraries [Int, Riz12] • Software routing frameworks built on top [BSM15, KJL+ 15] 4
Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]
A
• Interests hold full content name • Similar to CCNx (vs NDN)
• CS and PIT: exact match
R1
• Longest-prefix match at FIB eth0
Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:
eth1 R2 eth2
Forwarding information base (FIB)
/com/updates
eth0
B
R3
Pending Interest Table (PIT)
C Content Store (CS)
5
Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]
A
• Interests hold full content name • Similar to CCNx (vs NDN)
• CS and PIT: exact match
R1
• Longest-prefix match at FIB eth0
Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:
eth1 R2 eth2
Forwarding information base (FIB)
/com/updates
eth0
B
R3
Pending Interest Table (PIT)
/com/updates/sw/v4.2.5.tar.gz
{eth1}
C
Content Store (CS)
5
Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]
A
• Interests hold full content name • Similar to CCNx (vs NDN)
• CS and PIT: exact match
R1
• Longest-prefix match at FIB eth0
Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:
eth1 R2 eth2
Forwarding information base (FIB)
/com/updates
eth0
B
R3
Pending Interest Table (PIT)
C Content Store (CS)
/com/updates/sw/v4.2.5.tar.gz
(data. . . ) 5
Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]
A
• Interests hold full content name • Similar to CCNx (vs NDN)
• CS and PIT: exact match
R1
• Longest-prefix match at FIB eth0
Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:
eth1 R2 eth2
Forwarding information base (FIB)
/com/updates
eth0
B
R3
Pending Interest Table (PIT)
C Content Store (CS)
/com/updates/sw/v4.2.5.tar.gz
(data. . . ) 5
The Augustus CCN router
Design principles
• Exploit parallelism at all possible levels: • • • •
Hardware multi-queue at NIC DRAM memory channels Multiple cores on chip Multiple NUMA sockets
• Data structures designed to match the x86 cache system • Shared read-only FIB, duplicated in all NUMA sockets • Sharded, thread-private CS and PIT • Exploit NIC’s Receive Side Scaling capabilities to dispatch incoming packets to threads
• Zero-copy packet processing • Based on DPDK for fast packet I/O [Int]
• Explored two trade-offs: max performance or more flexibility 7
Design - standalone
Low-level standalone C implementation: • Based on low-level optimized APIs • Pushes the platform to its limits • Architecture based on Caesar [PVL+ 14]
8
Design - modular FromDPDKDevice(n) 0
1
I = Interest Packet D = Data Packet
2
InputMux
Input port output port
0
• Based on (Fast)Click [KMC+ 00, BSM15]
0 Check ICNHeader 2 1 0 I
D
• Easy to extend, experiment with
0
1
ICN_CS 0
1 D (hit)
• Same optimized data structures
0 1 2
• Can be deployed aside other routing components
I(miss)
ICN_PIT
0 1 2
0
ICN_FIB
D(hit)
0
1
I(hit) I(miss) Discard 0 OutputDemux 0
1
2
ToDPDKDevice(n)
9
Performance evaluation
Experimental setup • Two twin machines, each with two 10Gbps Ethernet ports • Measurements expressed in data packets per second • Work in slight overload conditions Worst-case assumptions: • Every interest packet has a unique name: no CS hits, no PIT aggregation • Minimal-sized packets, to stress the forwarding engine
Augustus router
Interest generator
data
eth0 interest
data
interest
eth1
Echo server
Traffic generator and sink 11
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Threads and core mapping
CPU CPU
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
CPU
L1 I L1 D
L2
L2
L1 I L1 D
L1 I L1 D
L2
L2
L1 I L1 D
L3
L3
3 19 5 21 7 23 9 25 11 27 13 29
CPU
14 30
L2
CPU
12 28
L2
CPU
10 26
L1 I L1 D
1 17
CPU
8 24
L1 I L1 D
CPU
6 22
L2
CPU
4 20
L2
CPU
2 18
L1 I L1 D
CPU
0 16
CPU
Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
15 31
12
Standalone performance 12
Standalone
Data throughput [Mpps]
10 8 6
• 2 threads: large gap hyperthreaded vs physical cores
4 2 0
L3 cache misses ratio
0.6
1 2
4
6
8 10 12 14 16 1820222426283032
Hyperthreading Single Socket Dual Socket
0.5 0.4
• Best performance: 4 threads (dual socket), 8 threads (single/dual)
0.3 0.2 0.1 0.0
1 2
4
6
8 10 12 14 16 1820222426283032 Number of threads
13
Click module performance 12
Click module
Data throughput [Mpps]
10 8 6 4
• 1 thread: same cache miss ratio, half performance
2 0 L3 cache misses ratio
0.6
1 2
4
6
8 10 12 14 16 1820222426283032
• Best performance: 16 threads
0.5 0.4 0.3 0.2
Hyperthreading Single Socket Dual Socket
0.1 0.0
1 2
4
6
8 10 12 14 16 1820222426283032 Number of threads
14
Data throughput [Mpps]
12 11 10 9 8 7 6 5 4 3 2 1 0
Cache miss ratio
FIB size scaling
0.6 0.5 0.4 0.3 0.2 0.1 0.0
Standalone, 8 threads Standalone, 4 threads Click module, 16 threads Standalone, 1 thread Click module, 1 thread
212 214 216 218 220 222 224 226
212 214 216 218 220 222 224 226 Number of FIB buckets
15
Conclusions and lessons learned
Conclusions and lessons learned Present Augustus, a CCN software router which: • Forwards packets at more than 10 millions data packets per second and supports a FIB with up to 226 entries, and it is able to saturate the 10 Gbit/s link with Ethernet payloads as small as 87 bytes; • Tested with a thorough worst-case oriented performance evaluation
• Runs both as a stand-alone system, achieving the best performance, or as a set of elements in the Click modular router framework • Is open source and can be used in software based networks for fast and incremental ICN deployment
17
Conclusions and lessons learned Present Augustus, a CCN software router which: • Forwards packets at more than 10 millions data packets per second and supports a FIB with up to 226 entries, and it is able to saturate the 10 Gbit/s link with Ethernet payloads as small as 87 bytes; • Tested with a thorough worst-case oriented performance evaluation
• Runs both as a stand-alone system, achieving the best performance, or as a set of elements in the Click modular router framework • Is open source and can be used in software based networks for fast and incremental ICN deployment Lessons learned: • Manual configuration for best performance • Abstraction hides critical low level properties • Complex zero-copy in modular framework 17
Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto September 27, 2016
Thanks for your attention
[email protected]
Bibliography
References I
[BSM15]
Tom Barbette, Cyril Soldani, and Laurent Mathy. Fast userspace packet processing. In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’15, pages 5–16, Washington, DC, USA, 2015. IEEE Computer Society.
[Int]
R Intel . DPDK: Data plane development kit. http://dpdk.org.
[JST+ 09]
Van Jacobson, Diana K. Smetters, James D. Thornton, Michael F. Plass, Nicholas H. Briggs, and Rebecca L. Braynard. Networking named content. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, CoNEXT ’09, pages 1–12, New York, NY, USA, 2009. ACM.
20
References II [KJL+ 15]
Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pages 22:1–22:14, New York, NY, USA, 2015. ACM.
[KMC+ 00] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The Click modular router. ACM Trans. Comput. Syst., 18(3):263–297, August 2000. [PVL+ 14]
Diego Perino, Matteo Varvello, Leonardo Linguaglossa, Rafael Laufer, and Roger Boislaigue. Caesar: A Content Router for High-speed Forwarding on Content Names. In Proceedings of the Tenth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’14, pages 137–148, New York, NY, USA, 2014. ACM.
21
References III
[Riz12]
Luigi Rizzo. netmap: A novel framework for fast packet I/O. In 21st USENIX Security Symposium (USENIX Security 12), pages 101–112, Bellevue, WA, August 2012. USENIX Association.
22