mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong Sunghwan Ihm*, Dongsu Han, and KyoungSoo Park KAIST
* Princeton University
Needs for Handling Many Short Flows
Middleboxes - SSL proxies - Network caches
End systems - Web servers
Flow count 91%
100% 80% 60% 40% 20% 0%
* Commercial cellular traffic for 7 days 61%
CDF
Comparison of Caching Strategies in Modern Cellular Backhaul Networks, MOBISYS 2013
0
4K
16K 32K 64K 256K FLOW SIZE (BYTES)
1M
2
Unsatisfactory Performance of Linux TCP • Large flows: Easy to fill up 10 Gbps • Small flows: Hard to fill up 10 Gbps regardless of # cores – Too many packets: 14.88 Mpps for 64B packets in a 10 Gbps link – Kernel is not designed well for multicore systems Connections/sec (x 105)
TCP Connection Setup Performance 2.5
Linux: 3.10.16 Intel Xeon E5-2690 Intel 10Gbps NIC
2.0 1.5
Performance meltdown
1.0 0.5 0.0 1
2
4
6
8
Number of CPU Cores
3
Kernel Uses the Most CPU Cycles CPU Usage Breakdown of Web Server Web server (Lighttpd) Serving a 64 byte file Linux-3.10
Application 17%
TCP/IP 34%
Kernel (without TCP/IP) 45%
Packet I/O 4%
Performance bottlenecks 1. Shared resources 2. Broken locality 3. Per packet processing Bottleneck removed by mTCP
1) Efficient use of CPU cycles for TCP/IP processing 2.35x more CPU cycles for app 2) 3x ~ 25x better performance
83% of CPU usage spent inside kernel! 4
Inefficiencies in Kernel from Shared FD 1. Shared resources – Shared listening queue – Shared file descriptor space File descriptor space
Listening queue Linear search for finding empty slot
Lock Core 0
Core 1
Core 2
Core 3
Per-core packet queue
Receive-Side Scaling (H/W) 5
Inefficiencies in Kernel from Broken Locality 2. Broken locality Interrupt handling core != accepting core
accept() read() write() Interrupt handle Core 0
Core 1
Core 2
Core 3
Per-core packet queue
Receive-Side Scaling (H/W) 6
Inefficiencies in Kernel from Lack of Support for Batching 3. Per packet, per system call processing Application thread User Kernel
accept(), read(), write() Inefficient per system call processing
BSD socket
LInux epoll
Frequent mode switching Cache pollution
Kernel TCP Inefficient per packet processing
Packet I/O
Per packet memory allocation
7
Previous Works on Solving Kernel Complexity Listening queue
Connection App TCP locality comm.
Packet I/O
API
Shared
No
Per system call
Per packet
BSD
Per-core
No
Per system call
Per packet
BSD
Affinity-Accept
Per-core
Yes
Per system call
Per packet
BSD
MegaPipe
Per-core
Yes
Batched system call
Per packet custom
Linux-2.6 Linux-3.9 SO_REUSEPORT
Still, 78% of CPU cycles are used in kernel! How much performance improvement can we get if we implement a user-level TCP stack with all optimizations? 8
Clean-slate Design Principles of mTCP • mTCP: A high-performance user-level TCP designed for multicore systems • Clean-slate approach to divorce kernel’s complexity Problems 1. Shared resources 2. Broken locality 3. Lack of support for batching
Our contributions Each core works independently – No shared resources – Resources affinity Batching from flow processing from packet I/O to user API Easily portable APIs for compatibility 9
Overview of mTCP Architecture Core 0
1
3
Core 1
Application Thread 0
Application Thread 1
mTCP socket
mTCP epoll
mTCP thread 0
2
mTCP thread 1
User-level packet I/O library (PSIO) NIC device driver
User-level Kernel-level
1. Thread model: Pairwise, per-core threading 2. Batching from packet I/O to application 3. mTCP API: Easily portable API (BSD-like) • [SIGCOMM’10] PacketShader: A GPU-accelerated software router, http://shader.kaist.edu/packetshader/io_engine/index.html