Augustus: a CCN router for programmable networks

Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto Davide Kirchner1∗ , Raihana Ferdous2∗ , Renato Lo Cigno3 , Leonardo Maccari3 , M...
Author: Tobias Wheeler
5 downloads 0 Views 2MB Size
Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto

Davide Kirchner1∗ , Raihana Ferdous2∗ , Renato Lo Cigno3 , Leonardo Maccari3 , Massimo Gallo4 , Diego Perino5∗ , and Lorenzo Saino6 September 27, 2016 1

Google Inc., Dublin, Ireland; 2 Create-Net, Trento, Italy; 3 DISI – University of Trento, Italy Bell Labs – Nokia, Paris, France; 5 Telefonica Research, Spain; 6 Fastly, London, UK ∗ This work was done while D. Kirchner and R. Ferdous were at the University of Trento, and D. Perino and L. Saino at Bell Labs. 4

Outline

1. Introduction 2. The Augustus CCN router 3. Performance evaluation 4. Conclusions and lessons learned

2

Introduction

Objectives The main goal is to explore the possibilities offered by modern general-purpose hardware in the context of information-centric networking: • Implement a CCN data plane forwarder fully in software • Run on a commodity x86 64 machine • Performance-oriented, open-source and extensible

• Analyze the performance in a worst-case scenario

4

Objectives The main goal is to explore the possibilities offered by modern general-purpose hardware in the context of information-centric networking: • Implement a CCN data plane forwarder fully in software • Run on a commodity x86 64 machine • Performance-oriented, open-source and extensible

• Analyze the performance in a worst-case scenario Why software router? Flexibility: • Quicker development/deployment cycle and (re)configuration • Hardware can be dynamically allocated to network functions Tools • Off-the-shelf high-performance hardware • High-speed packet I/O libraries [Int, Riz12] • Software routing frameworks built on top [BSM15, KJL+ 15] 4

Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]

A

• Interests hold full content name • Similar to CCNx (vs NDN)

• CS and PIT: exact match

R1

• Longest-prefix match at FIB eth0

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

eth1 R2 eth2

Forwarding information base (FIB)

/com/updates

eth0

B

R3

Pending Interest Table (PIT)

C Content Store (CS)

5

Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]

A

• Interests hold full content name • Similar to CCNx (vs NDN)

• CS and PIT: exact match

R1

• Longest-prefix match at FIB eth0

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

eth1 R2 eth2

Forwarding information base (FIB)

/com/updates

eth0

B

R3

Pending Interest Table (PIT)

/com/updates/sw/v4.2.5.tar.gz

{eth1}

C

Content Store (CS)

5

Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]

A

• Interests hold full content name • Similar to CCNx (vs NDN)

• CS and PIT: exact match

R1

• Longest-prefix match at FIB eth0

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

eth1 R2 eth2

Forwarding information base (FIB)

/com/updates

eth0

B

R3

Pending Interest Table (PIT)

C Content Store (CS)

/com/updates/sw/v4.2.5.tar.gz

(data. . . ) 5

Forwarding flow • Focus on the Content Centric Networing approach [JST+ 09]

A

• Interests hold full content name • Similar to CCNx (vs NDN)

• CS and PIT: exact match

R1

• Longest-prefix match at FIB eth0

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

eth1 R2 eth2

Forwarding information base (FIB)

/com/updates

eth0

B

R3

Pending Interest Table (PIT)

C Content Store (CS)

/com/updates/sw/v4.2.5.tar.gz

(data. . . ) 5

The Augustus CCN router

Design principles

• Exploit parallelism at all possible levels: • • • •

Hardware multi-queue at NIC DRAM memory channels Multiple cores on chip Multiple NUMA sockets

• Data structures designed to match the x86 cache system • Shared read-only FIB, duplicated in all NUMA sockets • Sharded, thread-private CS and PIT • Exploit NIC’s Receive Side Scaling capabilities to dispatch incoming packets to threads

• Zero-copy packet processing • Based on DPDK for fast packet I/O [Int]

• Explored two trade-offs: max performance or more flexibility 7

Design - standalone

Low-level standalone C implementation: • Based on low-level optimized APIs • Pushes the platform to its limits • Architecture based on Caesar [PVL+ 14]

8

Design - modular FromDPDKDevice(n) 0

1

I = Interest Packet D = Data Packet

2

InputMux

Input port output port

0

• Based on (Fast)Click [KMC+ 00, BSM15]

0 Check ICNHeader 2 1 0 I

D

• Easy to extend, experiment with

0

1

ICN_CS 0

1 D (hit)

• Same optimized data structures

0 1 2

• Can be deployed aside other routing components

I(miss)

ICN_PIT

0 1 2

0

ICN_FIB

D(hit)

0

1

I(hit) I(miss) Discard 0 OutputDemux 0

1

2

ToDPDKDevice(n)

9

Performance evaluation

Experimental setup • Two twin machines, each with two 10Gbps Ethernet ports • Measurements expressed in data packets per second • Work in slight overload conditions Worst-case assumptions: • Every interest packet has a unique name: no CS hits, no PIT aggregation • Minimal-sized packets, to stress the forwarding engine

Augustus router

Interest generator

data

eth0 interest

data

interest

eth1

Echo server

Traffic generator and sink 11

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Threads and core mapping

CPU CPU

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

CPU

L1 I L1 D

L2

L2

L1 I L1 D

L1 I L1 D

L2

L2

L1 I L1 D

L3

L3

3 19 5 21 7 23 9 25 11 27 13 29

CPU

14 30

L2

CPU

12 28

L2

CPU

10 26

L1 I L1 D

1 17

CPU

8 24

L1 I L1 D

CPU

6 22

L2

CPU

4 20

L2

CPU

2 18

L1 I L1 D

CPU

0 16

CPU

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

15 31

12

Standalone performance 12

Standalone

Data throughput [Mpps]

10 8 6

• 2 threads: large gap hyperthreaded vs physical cores

4 2 0

L3 cache misses ratio

0.6

1 2

4

6

8 10 12 14 16 1820222426283032

Hyperthreading Single Socket Dual Socket

0.5 0.4

• Best performance: 4 threads (dual socket), 8 threads (single/dual)

0.3 0.2 0.1 0.0

1 2

4

6

8 10 12 14 16 1820222426283032 Number of threads

13

Click module performance 12

Click module

Data throughput [Mpps]

10 8 6 4

• 1 thread: same cache miss ratio, half performance

2 0 L3 cache misses ratio

0.6

1 2

4

6

8 10 12 14 16 1820222426283032

• Best performance: 16 threads

0.5 0.4 0.3 0.2

Hyperthreading Single Socket Dual Socket

0.1 0.0

1 2

4

6

8 10 12 14 16 1820222426283032 Number of threads

14

Data throughput [Mpps]

12 11 10 9 8 7 6 5 4 3 2 1 0

Cache miss ratio

FIB size scaling

0.6 0.5 0.4 0.3 0.2 0.1 0.0

Standalone, 8 threads Standalone, 4 threads Click module, 16 threads Standalone, 1 thread Click module, 1 thread

212 214 216 218 220 222 224 226

212 214 216 218 220 222 224 226 Number of FIB buckets

15

Conclusions and lessons learned

Conclusions and lessons learned Present Augustus, a CCN software router which: • Forwards packets at more than 10 millions data packets per second and supports a FIB with up to 226 entries, and it is able to saturate the 10 Gbit/s link with Ethernet payloads as small as 87 bytes; • Tested with a thorough worst-case oriented performance evaluation

• Runs both as a stand-alone system, achieving the best performance, or as a set of elements in the Click modular router framework • Is open source and can be used in software based networks for fast and incremental ICN deployment

17

Conclusions and lessons learned Present Augustus, a CCN software router which: • Forwards packets at more than 10 millions data packets per second and supports a FIB with up to 226 entries, and it is able to saturate the 10 Gbit/s link with Ethernet payloads as small as 87 bytes; • Tested with a thorough worst-case oriented performance evaluation

• Runs both as a stand-alone system, achieving the best performance, or as a set of elements in the Click modular router framework • Is open source and can be used in software based networks for fast and incremental ICN deployment Lessons learned: • Manual configuration for best performance • Abstraction hides critical low level properties • Complex zero-copy in modular framework 17

Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto September 27, 2016

Thanks for your attention [email protected]

Bibliography

References I

[BSM15]

Tom Barbette, Cyril Soldani, and Laurent Mathy. Fast userspace packet processing. In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’15, pages 5–16, Washington, DC, USA, 2015. IEEE Computer Society.

[Int]

R Intel . DPDK: Data plane development kit. http://dpdk.org.

[JST+ 09]

Van Jacobson, Diana K. Smetters, James D. Thornton, Michael F. Plass, Nicholas H. Briggs, and Rebecca L. Braynard. Networking named content. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, CoNEXT ’09, pages 1–12, New York, NY, USA, 2009. ACM.

20

References II [KJL+ 15]

Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pages 22:1–22:14, New York, NY, USA, 2015. ACM.

[KMC+ 00] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The Click modular router. ACM Trans. Comput. Syst., 18(3):263–297, August 2000. [PVL+ 14]

Diego Perino, Matteo Varvello, Leonardo Linguaglossa, Rafael Laufer, and Roger Boislaigue. Caesar: A Content Router for High-speed Forwarding on Content Names. In Proceedings of the Tenth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’14, pages 137–148, New York, NY, USA, 2014. ACM.

21

References III

[Riz12]

Luigi Rizzo. netmap: A novel framework for fast packet I/O. In 21st USENIX Security Symposium (USENIX Security 12), pages 101–112, Bellevue, WA, August 2012. USENIX Association.

22

Suggest Documents