mtcp: A Highly Scalable User-level TCP Stack for Multicore Systems

mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong Sunghwan Ihm*, Dongsu Ha...
Author: Baldwin McBride
2 downloads 1 Views 1MB Size
mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong Sunghwan Ihm*, Dongsu Han, and KyoungSoo Park KAIST

* Princeton University

Needs for Handling Many Short Flows

Middleboxes - SSL proxies - Network caches

End systems - Web servers

Flow count 91%

100% 80% 60% 40% 20% 0%

* Commercial cellular traffic for 7 days 61%

CDF

Comparison of Caching Strategies in Modern Cellular Backhaul Networks, MOBISYS 2013

0

4K

16K 32K 64K 256K FLOW SIZE (BYTES)

1M

2

Unsatisfactory Performance of Linux TCP • Large flows: Easy to fill up 10 Gbps • Small flows: Hard to fill up 10 Gbps regardless of # cores – Too many packets: 14.88 Mpps for 64B packets in a 10 Gbps link – Kernel is not designed well for multicore systems Connections/sec (x 105)

TCP Connection Setup Performance 2.5

Linux: 3.10.16 Intel Xeon E5-2690 Intel 10Gbps NIC

2.0 1.5

Performance meltdown

1.0 0.5 0.0 1

2

4

6

8

Number of CPU Cores

3

Kernel Uses the Most CPU Cycles CPU Usage Breakdown of Web Server Web server (Lighttpd) Serving a 64 byte file Linux-3.10

Application 17%

TCP/IP 34%

Kernel (without TCP/IP) 45%

Packet I/O 4%

Performance bottlenecks 1. Shared resources 2. Broken locality 3. Per packet processing Bottleneck removed by mTCP

1) Efficient use of CPU cycles for TCP/IP processing  2.35x more CPU cycles for app 2) 3x ~ 25x better performance

83% of CPU usage spent inside kernel! 4

Inefficiencies in Kernel from Shared FD 1. Shared resources – Shared listening queue – Shared file descriptor space File descriptor space

Listening queue Linear search for finding empty slot

Lock Core 0

Core 1

Core 2

Core 3

Per-core packet queue

Receive-Side Scaling (H/W) 5

Inefficiencies in Kernel from Broken Locality 2. Broken locality Interrupt handling core != accepting core

accept() read() write() Interrupt handle Core 0

Core 1

Core 2

Core 3

Per-core packet queue

Receive-Side Scaling (H/W) 6

Inefficiencies in Kernel from Lack of Support for Batching 3. Per packet, per system call processing Application thread User Kernel

accept(), read(), write() Inefficient per system call processing

BSD socket

LInux epoll

Frequent mode switching Cache pollution

Kernel TCP Inefficient per packet processing

Packet I/O

Per packet memory allocation

7

Previous Works on Solving Kernel Complexity Listening queue

Connection App TCP locality comm.

Packet I/O

API

Shared

No

Per system call

Per packet

BSD

Per-core

No

Per system call

Per packet

BSD

Affinity-Accept

Per-core

Yes

Per system call

Per packet

BSD

MegaPipe

Per-core

Yes

Batched system call

Per packet custom

Linux-2.6 Linux-3.9 SO_REUSEPORT

Still, 78% of CPU cycles are used in kernel! How much performance improvement can we get if we implement a user-level TCP stack with all optimizations? 8

Clean-slate Design Principles of mTCP • mTCP: A high-performance user-level TCP designed for multicore systems • Clean-slate approach to divorce kernel’s complexity Problems 1. Shared resources 2. Broken locality 3. Lack of support for batching

Our contributions Each core works independently – No shared resources – Resources affinity Batching from flow processing from packet I/O to user API Easily portable APIs for compatibility 9

Overview of mTCP Architecture Core 0

1

3

Core 1

Application Thread 0

Application Thread 1

mTCP socket

mTCP epoll

mTCP thread 0

2

mTCP thread 1

User-level packet I/O library (PSIO) NIC device driver

User-level Kernel-level

1. Thread model: Pairwise, per-core threading 2. Batching from packet I/O to application 3. mTCP API: Easily portable API (BSD-like) • [SIGCOMM’10] PacketShader: A GPU-accelerated software router, http://shader.kaist.edu/packetshader/io_engine/index.html

10

1. Thread Model: Pairwise, Per-core Threading Application Thread 0 mTCP socket

mTCP thread 0

Application Thread 1 mTCP epoll mTCP thread 1

User-level packet I/O library (PSIO)

Per-core file descriptor User-level Kernel-level

Device driver Core 0

Per-core listening queue

Core 1

Per-core packet queue

Symmetric Receive-Side Scaling (H/W) 11

From System Call to Context Switching mTCP

Linux TCP

Application Thread

Application thread

System call

User

Context switching

Kernel BSD socket

LInux epoll

Kernel TCP

mTCP socket

mTCP epoll

mTCP thread User-level packet I/O library

Packet I/O NIC device driver 12

From System Call to Context Switching Linux TCP

Suggest Documents