Limits to Low-Latency Communication on High-Speed Networks

Limits to Low-Latency Communication High-Speed Networks CHANDRAMOHAN A. THEKKATH University of Washington, Seattle The throughput of local ATM net...
Author: Crystal Russell
1 downloads 0 Views 2MB Size
Limits to Low-Latency Communication High-Speed Networks CHANDRAMOHAN A. THEKKATH University of Washington, Seattle

The throughput

of local

ATM

networks

Other

network

few years. often

and

mented this

system

procedure

call

to several

hardware

platforms

to achieving

low

system,

Our

RPC

network We

(ATM,

impact

latency,

and

which

networks

demonstrate

and

that

controllers

RPC

times

more

than

ever

to achieving

Subject

Descriptors:

crucial

Categories

and

put/Output

to near

Devices—channels

Network

Protocols—protocol

tributed

Systems—distributed

tems]:

Systems]:

General Additional

Key

munication,

This

Design,

work Digital Intel

Authors’

WA

Permission not made of the

with

us to

respect

architecture

and

(170

pseconds

those

and

on an ATM

components

software

design

network

of next-

communication.

and

can

controller

reduce design

and

Data

Communications]

[Computer-Communication operating

sending,

Networks]:

systems;

network

In-

Networks]:

D.4.4

Dis-

[Operating

communication;

Sys-

D.4.7

[Oper-

systems

Performance networks,

Systems

level

in part

of Washington,

host-network

interfaces,

interprocess

com-

protocols

Science

Foundation

by the

Research

Computer.

is supported

University

host

of low-latency

making

Output

CCR-9200832,

and Apple

H. Levy

under

Washington Center

C. Thekkath by a Fullbright Department

and

grants

Technology External

is supported research

Research

in part award

of Computer

CCR-8619663, Center,

Science

and

by

Program,

by a fellowship

and by the INRIA. and Engineering,

98195. to copy without

or distributed

publication

Association specific

Corp.,

Corp.

differ-

allows

communication.

network

transport

Corporation

several

systems

C.2.2 [Computer-Communication

by the National

and

impleported

system

the

times

way

technology limits,

[Input/

ATM

calls,

in part

Equipment

address:

Seattle,

Phrases:

CCR-8907666,

Hewlett-Packard from

and

procedure

was supported

CCR-8703049, the

Words

remote

and

We then

with

these

to isolate

in the

Design—dwtributed

Measurement,

call

us

stand

Management—message and

latency.

SPARCstation)

controller,

reduced

C.2.4

is

high-speed

software.

low-latency

applications,

in a

throughput

we designed

low

Comparing

network

processor

B.4.2

of magnitude

of newer

in the communication

allow

and controllers;

Organization

Terms:

the

network-imposed truly

effects

of new

of Ethernets.

increased

software,

and

runtime

still

architecture;

Communications

ating

that

new-generation

small-packet

and

designs

hosts),

that

order

than

at providing

(DECstation

substantially

5000/200

the bandwidth than

communications.

and Ethernet).

network,

rather

system-level

targeted

user-level

achieves

DECstation

system FDDI,

the and

the

hardware

of alternative

e.g.,

the kernel

system,

using

generation

both

example, greater

of yet another

latency

cross-machine

of influences,

and controllers

increase

lowered

examines

on low-latency,

the performance

cache

paper

For

of magnitude

a bandwidth

remote

ent networks

increasing.

is an order

systems,

This

a number

a new

explore

promise

concern.

and HENRY M. LEVY

is rapidly

rings

in distributed

technologies

To evaluate

networks

token

technologies

However,

of primary

network

area

FDDI

on

and

fee all or part

for direct its

for Computing

date

of this

commercial

appear,

Machinery.

and

material

is granted

advantage,

the ACM

notice

is given

To copy otherwise,

that

provided copyright

copying

or to republish,

that notice

the copies

is by permission requires

are

and the title of the

a fee and/or

permission.

!3 1993 ACM

0734-2071/93/0500-0179 ACM

Transactions

$01.50 on Computer

Systems,

Vol. 11, No. 2, May 1993, Pages 179-203.

180

.

C. A, Thekkath and H M. Levy

1. IINTRODUCTION In

modern

computer

systems, slow communication links and softwarehave imposed a stiff penalty for cross-machine commucosts limit the extent to which a network of computers

processing overheads nication. These high can

be used

related

Several way

as a solution

problems recent

in which

for

either

structuring-related

developments,

networks

improvement over Ethernet ment to 1 gigabit per second

however,

have

the

potential

such as FDDI (ATM), offer

expected

the bandwidth

Taken

together,

communication

lying

on

these

system

technologies

systems

other

These

operating system system mechanisms reduce the software

the

that

should

can

exploit

of networked

nodes

[2], and B-ISDN [34] a 10-fold bandwidth

could

be

improve-

have already given us power, with 100 MIPS

processors

will

be capable

of

provide. research has led to extremely low [4, 22, 27, 33]. Such low-overhead latency that has typically limited

primitives.

be possible, example, consider

[ 11]. In a cluster idle

the year.

new networks

of operating

would not otherwise As a motivating server

within that

—Performance-oriented overhead operating mechanisms greatly the performance

to change

[23]. Another order-of-magnitude seems close behind.

—New processor technologies and RISC architectures an order-of-magnitude improvement in processing using

performance-

are used:

—New local area network technologies, using Asynchronous Transfer Mode

performance

or

in applications.

the

allow novel design

us to produce distribution

that

of a distributed-memory

workstations,

paging

much

than

faster

low-latency schemes

to spare paging

to

memory disk.

A

distributed server could manage the idle memory in the network; on a page fault, the page fault handler would communicate with the server and transfer pages to or from remote nodes. The effectiveness of this scheme is highly influenced by the communication latency between client and server. As another example, reduced-latency communication would greatly facilitate the use of networked workstations as a loosely coupled multiprocessor or shared virtual-memory system [7, 21]. Such configurations would have significant cost performance, flexibility, and scalability benefits over tightly coupled systems or over dedicated message-based multiprocessors (e.g., cubes). These systems are characterized by frequent exchanges of small synchronization packets and data transfers—attributes that are well served by low communication latencies. The common thread in these examples is that latency and CPU overhead are at least as critical as network throughput. With the advent of very high-bandwidth networks, the throughput problem relatively more tractable, leaving latency as the preeminent issue. 1.1 Paper

Organization

The objectives (1) To evaluate ACM

TransactIons

of this

and Goals paper

are threefold:

the fundamental, on Computer

Systems,

underlying Vol.

11. No. 2, May

message 1993,

costs for small-packet

is

Low-Latency Communication communication

on new-technology

on High-Speed

networks,

when

Networks

compared

181

. with

Ether-

net technology. (2)

To describe and evaluate the the face of different networks,

design of a new low-latency RPC controllers, and architectures.

(3) To draw lessons for network and ence. We believe (as do others) controllers are often poorly tributed systems, particularly Overall,

our

generation interface.

objective

is

to

take

system-level

of modern disRPC systems.

view

of RPC

FDDI,

connected

by an Ethernet

and

ATM

applications;

(2) general

new-

local

area

networks,

and

on the

combinations, connected by SparcStation

I

network.

We have used Remote Procedure model for several reasons: (1) it known [5, 27]; communication

on

in particular the software and hardware designed, implemented, and measured a

RPC on several different network and workstation on the MIPS R3000-based DECstation 5000/200

Ethernet,

cantly;

a

in

design based on our experinew generation of network

matched to the demands in the face of low-overhead

networks, examining To do this, we have

low-latency specifically,

controller that the

system

Call (RPC) as the remote is a dominant paradigm

techniques

for achieving

communications for distributed

good performance

are well

(3) it is user-to-user, in contrast to kernel-to-kernel memory that tends to underestimate communication latencies signifi-

and (4) it is close in spirit

to other

communication

models,

such

as V

our observations widely applicable. Our objective is [81 or CSP [17], making not to attempt a completely new RPC design, but rather to assess the latency impact

of modern

required

the

networks,

design

and

controllers,

processors,

implementation

and software.

of a new

RPC

system,

To do this because

existing high-performance RPC systems (such as SRC RPC [27]) do not run on the experimental hardware base we required. The paper is organized around the objectives listed previously. Section 2 describes the network cost for cross-machine ines the RPC design ment

of our RPC

of the controller nication existing

principles

system

Complex appear

that we used and details the minimum in these environments. Section 3 exam-

that

make

on a variety

has a significant

overheads. machines

environments communication

low-latency

RPC feasible.

of controllers

influence controllers

indicates

on efficient and

to add to the overall

how

cross-machine

host/controller overhead

Measurethe design commu-

interfaces

on

in two ways—latency

inherent in the controller and latency required by the host software to service the controller. Finally, Section 4 examines in greater detail the implications of low-latency communication for networks, controllers, and operating system abstractions. For example, we show the latency effects interrupt handling, and data transfer techniques. Our

experience

demonstrates

that

new-generation

of cache management, processor

technology

and performance-oriented software design can reduce RPC times for small packets to near networkand architecture-imposed limits. For example, using DECstation 5000/200 workstations on an ATM network, we achieve a simple user-to-user that RPC

round-trip RPC in 170 microseconds. Our experiments software overhead contributed by marshaling, stubs, ACM

Transactions

on Computer

Systems,

suggest and the

Vol. 11, No. 2, May 1993

182

C. A, Thekkath and H. M, Levy

.

packet exchange protocol need not be the bottleneck to low-latency remote communication on modern microprocessors. However, as software overheads decrease, network controller design becomes more crucial to achieving truly low-latency communication. 2. LOWER

BOUNDS

This

section

from

two important

FOR CROSS-MACHINE

evaluates

the

overhead

sources:

COMMUNICATION

added

to cross-machine

(1) the network

hardware

(the

communications link) and (2) the controller, memory, This allows us to examine our RPC software (the stubs, protocol,

and

runtime

support)

costs for cross-machine network and processor nents

may

change;

relative

to the

by isolating

the

lower-bound

components,

to extend

networks,

and

other

and

with

ware

and

explain

those

second

measurements,

to compare

Ethernet. our

In the

first

different

experimental

past

we briefly

testbed

2.1 Overview of the Networking subsection

capabilities Ethernet. which controller

is

summarizes

and

The [1].

network

Ethernet

accessed

on

However,

networks

controller

DECstations

the

[ 18, 27]. newer

However, technology

networks

with

characterize

measurement

each

our hardmethodology.

controller and network communication.

controller

we used and the particular

used to access each network.

is a 10 Mbit/see

our

the perfor-

Environment

the various

of the specific

Given different of these compo-

to examine

Then we analyze our performance results and discuss issues that specifically impact latency in cross-machine

This

message-passing

we can see how

high-speed

following

and the

and CPU interfaces. RPC packet exchange

communication on our hardware. technologies, the relative importance

mance of each layer scales with technology change. Similar measurements have been reported in the we wish

communication controller

CSMA/CD and

local

SparcStations

is packaged

area by

network, a LANCE

differently

on the

two

machines. On the DECstations, the controller cannot do DMA to or from host memory; instead, it contains a 128-Kbyte on-board packet buffer memory into which the host places a correctly formatted packet for transmission [ 14]. Similarly, the controller places incoming packets in its buffer memory for the host to copy. In the case of the SparcStation, the controller uses DMA to transfer

data

to and from

host memory.

In this

case, a cache flush

is done on receives to remove old data from the cache. packets are described by special descriptors initialized by send) or the controller (on receive). Descriptors are kept the SparcStations while they are in the special on-board

operation

On both machines, either the host (on in host memory on packet memory on

the DECstations. Two message sizes were used in our experiments, a minimum-sized (60 bytes) send and receive, and a maximum-sized (1514 bytes) send and receive. FDDI.

FDDI

is a 100-Mbit/sec

fiber

token

tion by the DEC FDDI controller. Like the the FDDI controller cannot perform DMA ACM

TransactIons

on Computer

Systems,

Vol

11, No

ring,

accessed

DECstation from host

2, May

1993.

on the DECsta-

Ethernet memory

controller, on message

Low-Latency Communication transmission;

instead,

ever, the FDDI

it relies

controller

on an on-board

can perform

on reception of a packet from share descriptors as described an unloaded

private

token

passing

ment,

token-passing

Packet with

FDDI

is kept

DMA

ring

delay

packet

transfers

Networks buffer

with

two

would

have

1514 bytes

hosts.

the

overhead

in a more

to be added

were

chosen

How-

to host memory

and host software Ethernet. We used

Thus,

minimum;

183

.

memory.

directly

the network. The controller above for the DECstation

to an absolute

sizes of 60 and

on High-Speed

due

realistic

to the

to facilitate

overall

direct

to

environlatency.

comparison

the Ethernet.

ATM.

ATM

munication exchanged order

(Asynchronous

standard

Transfer

Mode)

used to implement

between

entities

of a few tens

in fixed-length

of bytes.

is an international

B-ISDN.

An ATM

In an ATM

parcels network

called

telecom-

network,

data

cells, usually

typically

consists

is

on the

of a set of

hosts connected by a mesh of switches that form the network. In an ATM network, user-level data is segmented into cells, routed, and then reassembled at the destination The particular ATM

using header information we used has 140-Mbit/sec

contained in the cells. fiber optic links and

cell

sizes of 53 bytes and is accessed using FORE Systems’ ATM controller [16]. The controllers on the two DECstation hosts were directly connected without going through a switch; thus there is no switch delay. Unlike the Ethernet and FDDI controllers, the ATM controller uses two FIFOS, one for transmit and

the

simply

other

for

receive.

reads/writes

tions

that

cells

arrive

choosing

correspond in

how

The

complete to the

the

receive

often

controller ATM

FIFOS. FIFO.

it should

has

cells The The

no DMA

by accessing host

is notified

host

be interrupted.

facilities. certain

has

via interrupt

considerable

Further,

The

host

memory

locawhen

flexibility

the controller

in

does not

provide any segmentation or reassembly of cells; that is the responsibility of the host software. Each ATM cell carries a payload of 44 bytes; in addition, there are 9 bytes of ATM and segmentation-related headers and trailers. In our experiment

we chose packet

well

close

as being

enough

sizes that to

the

were an integral

Ethernet

and

FDDI

number packet

of cells as sizes

for

link,

we

comparison. 2.2 The Testbed In order

and Measurement

to isolate

built a minimal on the network.

the performance

stand-alone testbed, There is no operating

software is executing workstations (either through processor

an isolated rated

at

Methodology of the controller

which simply sends and receives packets system intervention since only minimal

on each machine. two DE Citations network. 18.5

The

SPECmarks,

and the network

The testbed hardware consists of two or two SparcStations) connected

DECstation and

the

uses

a 25 MHz

SparcStation

MIPS

I uses

R3000 a Spare

processor rated at 24.4 SPECmarks. The DECstations were connected in turn to an Ethernet, an FDDI ring, and an ATM network. The SparcStations were connected to an Ethernet. The DECstation network devices are connected on the 25 MHz TURBOChannel [13] while SparcStations use the 25 MHz SBUS [29]. We measured

the performance ACM

of each configuration

Transactions

on Computer

Systems,

in sending

a source

Vol. 11, No. 2, May 1993.

184

C. A, Thekkath and H, M.

.

packet from one node to the Packets are sent and received data

over the host

are enabled, packet dedicated

the

While

fashion

other from

bus is included

so both

arrival.

Levy

and

the

is generally

by disabling

in receiving a packet in response. memory, so the cost of moving the

in our measurements.

sender

it

and host

receiving

possible

network

Network

hosts

to

access

interrupts,

provide

periodic

interrupts.

These

on the DECstations (about SparcStations (once every

were

on the

Wire.

the propagation

This

time

clock

set to provide

chips

interrupts

Measurements trip

can be broken

is the transmission

because

were

time

it is negligible

on

network

in

a

time-sharing

once every 244 microseconds) and 625 microseconds). No significant

involved in fielding a timer interrupt. least 10,000 successful repetitions. The component costs for the round —Time

the

conventional

access will involve interrupt-processing overheads. Both the DECstation and the SparcStation have

interrupts

are interrupted

that

can

at 4096

Hz

1600 Hz on the processing is averaged

over

at

up as follows:

of the packet.

for the length

We ignore

of cable

we are

using. —Controller the sending has made the data

Latency. controller the data

This is the sum of two times: (1) the time taken by to begin data transfer to the network once the host available

at the receiving

to it, and (2) the delay

controller

and the time

between

the arrival

it is available

of

to the host.

—Control / Data Transfer. Data has to be moved at least once over the host bus between host memory and the network. Some of our controllers use DMA incurs transfer describe the

to move

no

data

data

over

transfer

the

host

overhead.

bus to the

However,

network;

the

overhead because it has to use special the location of the data to the controller.

actual

Latency

time

to do the

data

transfer

CPU

thus incurs

the

CPU

a control

memory descriptors to With such controllers,

is captured

in

the

Controller

item.

—Vectoring the packet

the Interrupt. arrival interrupt

The overhead —Interrupt

involved

Service.

some essential

On the receive side, host to the interrupt handler

is a function On taking

controller-specific

of the CPU

an interrupt, bookkeeping

software must vector in the device driver.

architecture.

host

software

before

the

must interrupt

perform can be

dismissed. 2.3

Performance

To determine

Analysls the latency

of the controller

between a pair of hosts controller’s status registers of data, expected simply

we ran separate

experiments

host polled the indicated arrival

a new transmission was begun. In the cases where the to copy data to the controller’s memory before transmission, programmed

giving it any data. data for the host ACM

itself,

with interrupts disabled. Each in a loop. As soon as the register

Transactions

the controller Similarly, to copy,

on Computer

to start

the data

transfer,

without

host was the host actually

when the controller indicated the arrival of new the host ignored the data and began the next

Systems,

Vol.

11, No. 2, May

1993.

Low-Latency Communication transfer.

In

started.

addition,

In these

by the host

descriptors

circumstances

per round

trip.

were very

on High-Speed prefilled

before

few machine

The time

required

Networks the

data

instructions

to execute

185

.

transfer

are executed

these

as well

as the

time This

spent on the wire was subtracted from the total measured round trip. method gives satisfactory results in most cases but has the disadvantage

that

it does not account

This

artifact

multicell

for any pipelining

is particularly

packets

transmission

are exchanged.

of data

transmission instance, in

visible

between

that

in the Typically,

its

the controller

a controller

internal

buffer

the

perform.

controller

chip

and

between host (or on-board memory) sending a multicelled packet through

host can fill the FIFO with a cell while from the FIFO onto the network.

might

case of our ATM

when

can overlap

network,

the

with

the

and the chip itself. the ATM controller,

the controller

sends

For the

the previous

cell

Table I shows the cost of round-trip message exchanges on the host/network combinations described above. A few points of clarification are in order here. First, which

in the

case of the

perform

flushing

the

overhead. from

DMA cache

While

higher-level

is kept

software

data. Second,

arrived

only

two interrupts. row

in the named

Most of the microseconds have

cache flushes

network,

it interrupted

Sum

of

in

the

necessary

locations,

means

this

Table

reported

include

In addition, each

is therefore

is the

if data that

the

cost of accessing

I does not

experiments,

cost of

Service

are not strictly

packets.

Components

the

Interrupt

the host only when

in our

controller,

receives,

pay a performance

multicell

Thus,

SparcStation

sum

this

the cost

the controller

a complete

packet

round

incurs

trip

a lower

of the

bound.

rows

above

it.

time, the sum of our component measurements is within 1–9 (about 270) of the observed round-trip time. The only exception

case of the

transfers

2.3.1

that

The performance

overestimated

ATM

network

in sending

the round-trip

time

of the amount

of overlap

Latency

and

on the

packets,

12%. The most

between

where likely

we

cause

the host memory–FIFO

transfer. It

Latency.

to Time

multicell

by about

and the FIFO-network

Throughput

Controller

is included

eventually

FIFO.

is an underestimate data

DMA

the

on packet

in uncached-memory

will

so that

had

and

memory

the

and reassembling

was programmed

is in the

controller

to host

in the case of the ATM

for segmenting

The

after

it is true

the network

FDDI

directly

is

interesting

Wire.

to

For small

compare

packets,

the

this

ratio

is 0.4 on

the DECstation Ethernet, 0.8 on the SparcStation, 7 for the FDDI controller, and 3.2 for the ATM network. For larger packets of approximately 1514 bytes, these ratios are respectively 0.02, 0.04, 0.9, and 1.0. While these numbers are specific to the controllers we use, we believe they are indicative of a trend: The

tant

bandwidths

packet

difference

exchange

between

are improving times

on

throughput

packets is the goal, then exchange on our DECstation

dramatically

Ethernet,

we can Ethernet

and

FDDI,

latency.

while and

If

latencies

ATM

show

low

latency

achieve a round-trip in only 253 pseconds;

ACM TransactIons

on Computer

Systems,

are not. the

for

impor-

small

60-byte message FDDI on similar Vol. 11, No. 2, May 1993,

186

.

C. A. Thekkath and H. M. Levy Table

I.

Hardware-Level

Round-Trip

Component Ethernet (DIK) 60/60 1514/1514 Time

on the Wue

Controller

Latency

Control/Data Vecmrmg

Transfer the Intemupt

Interrupt

Service

Sum of Components

Round Tnp Time

Measured

hardware,

despite

single

cell

lower

bound

checking

in ,useconds

ATM (DEC) 53/53 1537/1537

2442

115

2442

13

245

6

176

53

89

103

97

230

16

161

40

600

6

6

40

253

17

45s

25

25

12

12

25

25

25

25

26

26

42

42

92

140

9

20

257

3146

264

2605

267

893

73

840

253

3137

263

2611

263

894

73

7.$6

10-fold

bandwidth

is

in

about

because

whether

Times

51

which

transfers

Exchange

115

its

the same operation,

Packet

Round-Trip Time (f(seconds) Packet Size in bytes (serrdk’ecv) FDD1 (DEC) Etbcrnct (Spare) 60/60 1514/1514 60/60 1514/1514

we

longer.

4~0

73

have

the cell is part

advantage, The ATM

~seconds,

ignored

takes network

However

switching

of a larger

message

263

pseconds

is capable

this

delays, that

for

of doing

is an optimistic and

the

needs

cost

of

fragmenta-

tion/reassembly. On the other hand, the high bandwidth of ATM and FDDI is significant for large packets. For example, latency for a 1514-byte packet on the DECstation Ethernet

is 4.2 times

packets

even

larger

worse than

than 1514

ATM

and

3.5 times

worse

bytes,

the

Ethernet

situation

than

FDDI.

For

is relatively

worse, because FDDI will require fewer packet transmissions. The ATM comparison is slightly more complex; as pointed out earlier, we have ignored ATM segmentation and reassembly costs in arriving at the figures in Table I. The particular software implementation of the segmentation/reassembly we use requires

about

11 ~seconds

per cell.

If this

is included,

then

a 1537-byte

packet (29 cells) takes about 1065 ~seconds, which is about 1.2 times the FDDI time. The situation further improves in favor of FDDI for packet sizes beyond

this

2.3.2

limit.

Controller

Structure

and

Latency.

As noted

earlier,

both

the

DEC-

station and the SparcStation use the same Ethernet controller. However, their performance is not identical. Recall that on the SparcStation the controller uses DMA transfers on the host bus. The overhead for this is included in Controller Latency; the Control / Data Transfer costs include

only

the

cost

of the

instructions

to program

troller is able to overlap the data transfer the network. Thus the sum of Controller Transfer is relatively unchanged by the

the

controller.

The

con-

over the bus with the transfer on Latency and Control/ Data packet size. In contrast, the DEC-

station controller incurs a heavy latency overhead because the host first has to copy the data over the bus before the controller can begin the data transfer. Consequently, for larger packets, the SparcStation controller performance is likely to be better even though for small packets both have comparable round-trip times. Controllers that use on-board packet buffers instead of DMA or FIFO will generally incur this limitation. However, controllers with DMA or FIFO are not without problems; we will defer a more detailed discussion ACM

until

TransactIons

Section on Computer

4. Systems,

Vol.

11, NrI

2, May

1993

Low-Latency Communication It

is

also

interesting

to compare Both

various controllers. interrupt-handling

times.

the The

on High-Speed

the

Networks

interrupt-servicing

Ethernet

latency

controllers

SparcStation

figure

.

have

is slightly

187 on the

comparable

higher

because

a cache flush operation is included in the cost due to the DMA transfer. except the ATM controller use descriptor-based interfaces between the and the interface. than

Programming

the simpler

FIFO

this

interface

requires

many

on the ATM.

more

The FDDI

All host

accesses to memory controller

is the most

complex one to service and is 7– 10 times as expensive as the ATM. While figure is for a specific pair of controllers, we believe that FIFO-based trollers

will

in general

reduce

Our experiments with lowered latency, simple the more

traditional

of controllers

programming

overhead.

low-level message passing seem to suggest that for FIFO-based controllers have some advantages over

types

of controllers.

on user-to-user

However,

cross-machine

we describe the impact

a higher-level of controllers

message-passing

performance,

user-level

to

with

concerned

additional

ultimately

it is the impact

communication

the next section, RPC and explore be

this con-

that

communication in this context. cross-machine

issues

like

is crucial.

system Unlike

communication

protection

and

In

based on low-level has

input

packet

demultiplexing. 3, THE DESIGN,

LOW-LATENCY The

IMPLEMENTATION, RPC

performance

system

of

the

hardware

are two of the three

If the

performance

poor,

this

twofold. systems

of the

can easily

and

message-passing

components third

dominate

Second,

of the

low-level

of user-level

component, the overall

we wish

used

by higher-level

software

ture

of the network

controller.

the

high-level

how

might

system,

is

section

is

of this

exists for building RPC in the previous section

some of the low-latency these

latency.

RPC

cost. The purpose

to outline and

communication

interact

techniques

with

the

struc-

Design and Implementation

The RPC system tion

OF A

First, we wish to show that the technology that are so efficient that the costs described

are significant.

3.1

AND PERFORMANCE

[30]

[27] built

has demonstrated

optimization. Our where performance tions

system

system

differs

(1) We differ performed: promising RPC. (2) Our

system

for the DEC the

SRC Firefly

latency-reducing

multiprocessor

effect

of a large

RPC system closely follows the SRC is a goal, many details in the structure

must

be

from

SRC RPC in several

dictated

by

in the way stubs we use a scheme protection differs

kernel

in the way but ACM

hardware

of

design; however, of a communica-

environment.

Thus,

our

respects:

are organized that minimizes

between

(3) We do not use UDP/IP,

the

workstanumber

and

in which

and the copying user

control

instead

use network

Transactions

on Computer

way marshaling is costs without com-

spaces

as is done

transfer

is done.

datagrams Systems,

in SRC

directly.

Vol. 11, No. 2, May 1993.

188

.

These

C. A. Thekkath and H M. Levy

differences

Our

are described

low-latency

ning

the Ultrix

The

system

RPC operating

is

in assembly RPC

system

is entirely

during

face previously common byte

the

binding

by the

to impact

Multipacket

SunOS other

that

is linked

process,

server. transfers

code. The

the user’s

address

on different

imports

the

inter-

optimized

machines

for the

with

the same

code path

so as not

is reduced

by ensuring

that only the client, but not the server, swaps byte order when needed. The underlying communication in the RPC relies on a simple request response

packet

however,

there

beyond what data copying,

exchange

similar

are

fundamental

three

is required for con trol transfer,

to that

described aspects

dis-

the need to program

specially

overhead

or

high-quality

by a separate

case. Byte-swapping

systems

networked

client

between

run-

SunOS.

operating

into

the

We have

RPCS are handled

the common

not felt

generate

5000

I running

that is integrated into the kernel. clients and servers can be placed

case [4] of single-packet

order.

and

in C; we have

subsections.

DECstation

impacting

our compilers

runtime

exported

on the

Ultrix

component

space and another component Like other RPC systems, machines;

the without

because

has a runtime

in the following

and on the SparcStation

into

machines

language,

detail

is prototype

system

integrated

executing on these tributed services. Our implementation

in more

system

in Section

to RPC

and

2. In addition,

that

add

overhead

a simple message exchange: marshaling and and protocol processing. Our system achieves

its low latency by optimizing are somewhat interrelated, and

each of these areas. The optimizations we consider them in turn in the following

subsections. 3.1.1 simple specific

Marshaling procedural and depend

procedures

marshal

and

Data

interface

Copying.

for the

client

on the particular the

call

RPC

stubs

and

server.

service

arguments

into

function and

out

create Stubs

the

illusion

of a

are application-

that

is invoked.

of the

message

These packet.

Even in highly optimized RPC systems such as SRC RPC, marshaling time is significant. Marshaling overhead depends on the size and complexity of the arguments. Typically the arguments are simple—integer, booleans, or bytes; more involved data structures are used less frequently [4]; therefore simple byte copying is sufficient. Related to the cost of marshaling

is the cost of making

the network

packet

available to the controller in a form suitable for transmission. A complete transmission packet contains network and protocol headers, which must be constructed by the operating system, and user-level message text, which is assembled by the application and its runtime system. There are several strategies for assembling the packet for transmission, with the cost depending on the capability of the controller and the level of protection required in the kernel. With a controller that does scatter-gather DMA over the bus, like the SparcStation controller, the data can be first marshaled in host memory by the host (in one or more locations) and then moved over the bus by the controller. This is the scheme we use on the SparcStations—the data packet comes from user space, and the header is taken from kernel space, a techACM

Transactions

on Computer

Systems,

Vol. 11, No. 2, May

1993

Low-Latency Communication nique way

commonly

used

SparcStation

and from

in such

DMA

a fixed

range

on High-Speed

an environment

is architected, of kernel

the

virtual

is more

expensive

than

our RPC design uses mapping One general drawback with

addresses.

accesses packet

of the data and

Likewise,

again given

performs

This

the

when

the

controller with

copying

it

of the user

over

to the

approach is to relax kernel/user user space and allow the user copies

to one. However,

there

controller

ment-marshaling into

the

To do this

kernel

generate

and

optimized

rather

executed.

to the

Code for

network.

or a FIFO, into

two

technique

Another

retains

synthesis

specific

which

has been

situations

all the

used

argu-

space,

is then in the

to achieve

and

buffer into number of

address

code on the fly,

the

a buffer

copies.

that

in the user’s

two

of the

the cost of two copies. call arguments, we perform

than

we synthesize

routines

the data requires

is an alternative

in the kernel

conventional.

pieces

conthis

at least

the pieces

memory

to

a threshold,

protection and to map the packet direct access. This reduces the

benefits of the protection without incurring In an effort to minimize copying of the

only mapping

accessible to the primitives for

it involves

the

on-board

marshaling

due to the

necessitates

the host builds

transfers

special

189

DMA

sizes below

is that

over the bus: once when

technique

kernel

for packet

only selectively. a DMA scheme

a controller

straightforward

copying

.

[18, 25]. However, controller

the user’s data buffer into kernel addresses that are troller. Since the cost of using SunOS virtual-memory mapping

Networks

as is linked

past

high

to

perfor-

mance [20, 22]. Our focus is slightly different: we are more concerned with avoiding the copy cost than with generating extremely efficient code for a special situation. At bind time, when the client imports the server’s interface, the client calls into

the

directly

kernel

pointers

1 As only

procedure instructions. each input

a template

of the

simple-valued

to bytes.

procedure. involves

with

supports

Using

this

mentioned

assignments

is nothing

earlier,

more

than

passed

procedure.

as words, kernel

assembling

the task

the

procedure

contains This

right

simple

synthesized

code to check

procedure

and a

of primitive the validity

has the benefit

and reply are known in advance, the can be avoided if arguments and results

the

and

of synthesizing

sequence

approach

kernel

bytes,

a marshaling

is typically

Thus,

The

halfwords,

synthesizes

marshaling

copying.

at runtime.

since the sizes of the request general multipacket code path a single network packet. The kernel then installs

the

the

and byte

marshaling

such

template

The marshaling argument

types

as a system

of that

more fit in

call

for

subsequent use by this specific client. Thus, the stubs linked into the user’s address space do not do marshaling; they merely serve as wrappers to trap into the kernel where the marshaling is done. A client RPC sees a regular system call interface with approach has the benefit required

without

1 We do not yet have

all the usual of performing

compromising

a template

the

compiler, ACM

protection rules the minimum

safety

of a “firewall”

and so our kernel

Transactions

that go with it. This amount of copying

stubs

on Computer

between

are currently

Systems,

hand

the

user

generated.

Vol. 11, No. 2, May 1993.

190

.

C. A. Thekkath and H, M, Levy

and the kernel, or the user and the network controller. impose the overhead of probing the validity of pointers copied,

but

this

In addition RPC

system

is not a significant scheme,

provides

another

interface

into

a specialized

case. Instead

of calling

which

client calls into a fixed-kernel tors. Each descriptor describes type—IN descriptors space ing

and the

done using cally

handles

controller’s

exports

memory

compromising

this

several

cases, our general

procedure,

the

The kernel can use these directly between the user

thereby

eliminating

Unmarshaling

on the

server

above.

A server

interface

with

received packets. We have used kernel-level

common

for a more

kernel-marshaling

or FIFO,

procedures

the kernel

of the

is designed

supported. arguments

safety.

general-purpose

provides

RPCS.

most

that

scheme does data can be

entry point, passing an array of data descripa primitive data type (including its parameter

or OUT) that is directly to marshal and unmarshal

cost without

server

cost for most

to this

Our before

as described

with

different

a generic

types

template

marshaling

that

of

extra

is

typi-

arguments.

applies

and unmarshaling

copyside

The

to all types

with

of

the DECsta-

tion Ethernets, but a hybrid scheme is used with the FDDI controller. On transmissions the usual scheme is used, but on receives, the controller’s DMA engine copies the data over the host bus and hands it to the kernel. The kernel unmarshals the data either by copying or by virtual-memory mapping if the

alignments

controller. driver

performs

could

read

required, fact,

are

In this

suitable.

case, before

reassembly

only

the

A similar the

RPC

from

the

network

accessed

method, but reassembly is such ATM that we did not feel justified

is used

receives

in a page-aligned

header

and if not let the RPC layer

in a non-ATM

approach

layer

buffer.

FIFO

and

the data

via FIFOS,

this

and this

the the

Furthermore,

determine

unmarshal

a common in making

with

the packet,

ATM device

the driver

if reassembly from

would

the FIFO.

is In

be the preferred

frequent operation optimization.

on the

3.1.2 Control Transfer. Researchers have reported that context switching causes a significant portion of the overhead in RPC [27]; in addition, there is a substantial impact on processor performance due to cache misses after a context

switch

switching and finally the

[24].

the client switching

server

out—can

An

RPC

call

typically

requires

four

context

out, switching the server in, switching the client back in. Two of these—switching be

overlapped

with

the

transmission

the

switches:

server out, the client or

of the

packet.

Systems with high-performance RPC usually have lightweight processes that can be context switched at low cost, but unless there is more work to do in the client and the server, or no work elsewhere, a process context switch usually occurs. Both the DECstation that can be significant context

switches,

our

and the SparcStation have context-switching times to the latency of a small packet. To reduce the cost of RPC

system

defers

blocking

the

client

thread

on the

call. Instead, the client spin-waits for a short period before it is blocked on the sleep queue. If the service response time is very small, the reply from the server is received before the client’s spin-waiting period has expired. When ACM

TransactIons

on Computer

Systems,

Vol

11, No

2, May

1993

Low-Latency Communication there

is no other

no penalty done,

work

to be done,

to spinning

the

caller

the caller

spins

for

i.e., when

round-trip

time

are

be to block

a short

greater

than

otherwise. using

available.

the

time

relative

An estimate

an estimate.

hint,

In general,

are processes

control

trip

technique

queue

switch

a thread

This

leads

often

with

In

contrast,

packet

directly

is done

from

layer.

performance.

the lowest

directly

within

the

Protocol

Processing.

layer

trip

is

and to spin statically

response

by

times

for latency

as

if there

protocol

handler,

layering.

The

the incoming packet on the interrupts are then used to to a modular we

to the destination

interrupt

scheme

round

either

past

from

wake

in that

is

to be

turn.

arises to queue software

of control

there quantum

penalty

throughput

traditional approach is for each layer input queue of the next higher layer; unacceptable

191

work

current

expected

be obtained

their

overhead

to the

by using

trades

waiting

transfer

is empty, is useful

round-robin

if the

could

or dynamically,

this

on the run

Additional

extension

to the context

of the round

a user-supplied

to its

spinning

related

there

.

of the server ensures that the process can be improved if estimates of the

without

some threshold

Networks

queue

When

A simple

caller

the run

indefinitely.

before blocking. In most cases the low response time is never put to sleep. This approach would

on High-Speed

try

approach

but

to dispatch

the

process.

yielding

This

a path

dispatch

of very

low

latency. 3.1.3 nate

communication

overhead

of protocol

processing

costs if general-purpose

The

protocols

are used

the usual case of a homogeneous and low service response times, employed to optimize There are typically protocol

like

specialized procedure

call semantics

Several multiple

that

protocol

aspects layers

increasing

layers

provides

to provide

in RPC

a basic a close

systems:

unreliable

a transport-level transport,

approximation

and

a

of conventional

for RPC.

of protocols

of protocol

the number

In

environment, with frequent remote requests special-purpose protocols can be effectively

the latency. two protocol

UDP/IP RPC

can domifor RPC.

contribute

tend

to RPC latency.

to add to the overhead

of context

switches

As mentioned of RPC

and increasing

above,

in two

the number

ways: of data

copies between layers. Further, the primary cost of using protocols such as UDP/IP is the cost of checksums. Calculating checksums in the absence of hardware support involves manipulating a packet as a byte stream; this can nullify any advantage gained by the controller or host processor in assembling the packet using scatter-gather or wordlength operations. For efficient RPC implementations, then, the checksum must be either calculated in hardware themselves transmission

or made

optional.

to the former medium

Most

conventional

approach.

The latter

is reliable

and that

protocol approach

the transport

formats

do not lend

presupposes protocol

that

is used

the only

for routing, not for reliability. These factors argue for the use of a simpler protocol and hardware checksumming. Our RPC uses a simple and efficient unreliable transport protocol and relies on a specialized RPC protocol to provide robustness, which is similar to ACM

Transactions

on Computer

Systems,

Vol. 11, No. 2, May 1993.

192

.

C. A. Thekkath and H. M, Levy

that used by the protocols reflects the

SRC or Xerox [5] RPC systems. In general, the choice of a set of assumptions about the location of clients and

servers

and

servers

are expected

area networks heavy

loads

error

characteristics

have good packet due to overruns.

the same LAN,

raw

by the controller. controller. If the

of the

to be on the same local

network.

loss characteristics,

In the case when

network

datagrams

Typically

area network.

clients

dropping

packets

the RPC destination

are used with

and

Furthermore,

local at only is within

the checksum

provided

An erroneous packet is simply dropped by the receiving target is not on the local network, it is a straightforward

extension to use UDP/IP without checksums. The choice can be made at bind time when the client/server connect is established, and the marshaling code can be generated In addition protocol

the appropriate imposed

per se adds to the latency.

to provide achieve

to include

to the overhead

a natural

high

The primary

set of semantics

performance,

header.

by transport-level reminiscent

our protocol

on the critical fast-path of the code. Under normal error-free operation,

protocols,

purpose of simple

was implemented the server’s

of RPC

the RPC protocols

procedure

is

call.

To

so as not to intrude

response

to a call from

the

client acknowledges it at the client end. Similarly the next call from the client acknowledges the previous response. The state of an RPC is maintained by the client and server using a “call/response” record. Call records are preallocated at the client at bind time. These contain a header, most of whose fields do not change the latency

with

side and contain In

order

packets

each call.

at call time. to

header recover

on the client

These

Similarly,

are therefore

response

information from

from

dropped

and the server

prefilled

records a previous

packets,

call that

our

RPC

side. On the client

blocked for the duration of the call, retransmission the client address space, as with the original call.

so as to minimize

are retained

at the server can be reused.

transport

buffers

side, since the client

is

proceeds from the data in Thus no latency is added

due to buffering when the call is first transmitted. On the server side, the reply is nonblocking; hence a copy of the data has to be made before returning from the kernel. The copy is overlapped with the transmission on the network. Once again no latency is added to the reply path. One alternative to this would be to use Copy-on-Write, which would not affect latency but can potentially 3.2

save buffer

RPC Performance

space for large

multipacket

RPCS.

Measurements

This section examines the performance of our RPC implementation. Our goal is to show that structuring techniques, such as those we have used, yield an RPC software system so effective that the hardware costs shown in Table I are significant. We gather the data to support our analysis of controller design in the next section. Table II shows the time in microseconds for RPC calls on the various platforms.

Two

procedures

takes two integer two arguments—an an integer ACM

result.

Transactions

called

Minus

and

MaxArg

were

timed.

Minus

arguments and returns an integer result. MaxArg takes integer parameter and a variable-length array, returning The exact

on Computer

sizes of the packets

Systems,

Vol. 11, No, 2, May

exchanged 1993

varies

depending

Low-Latency Communication Table II

II.

Allocation

on High-Speed

of RPC Time

.

193

in Microseconds

I Activity

Networks

(sccont

fltherr h4inus

Fim7cr

Ethcrrrl

(Spwc

MaxArg

Minus

Maxar

Minus

DEC) Maxarg

Fl)r

All

DEC)

Minus

=MaxArg

Chen[ Call Controller Latency

28

145

59

137

46

173

25

159

26

27

45

54

48

82

8

44

Time

58

1221

58

1221

4

122

3

88

25

25

27

27

56

70

17

20

Server Packet Receipt

39

470

59

169

42

42

29

347

Server Reply

27

2-1

46

46

36

36

25

25

Controller

25

25

44

44

~ 49

82

8

44

57

57

57

57

5

5

3

3

26

27

56 ’29

17

17

49

27 49

56

29

26 29

340

2052

471

1831

371

340

2070

496

1997

380

on the WUC for Call

Intcrntpt

Time

Handhng

Latency

on the Wue

Interrupt Chcnt

on Server

for Reply

Handling Reply

hcelpl

Total Attributed Measured

on CIIcn!

T]mc

Time

on the network 48 bytes

type.

on FDDI,

Minus

causes

and 53 bytes

60 bytes

single-user

rows

which

showed

are explained

below:

—Client Call. This is the total time out a call packet. It consists of five up the argument time to validate dure,

—Time

the packet on the

based

for setting

Latency.

getting

MaxArg

Wire

This for

Call.

major

parts:

locate

the cost

2

675

on the Ethernet,

transmits

to the server

on the ATM. on the three

The reply networks.

required on the client side for sending major components: the time for setting

and

of this

server

to do a transmit.

represents

the wire.

the

This

An estimate

controller’s

is computed

of the packet

latency

in

as in Section

2.

transmission

time

after

This is a sum of the times to vector the handler in the device driver, the time in the interrupt

is dismissed.

This is the time spent on the server machine before it is handed to user code. It consists

the cost of examining correct

unmarshaling

170

of the network.

and to return

—Server Call Receipt. the call packet arrives

The

figure

Handling on Server. interrupt to the interrupt

the handler,

693

less variance.

up the controller

to and from

on the bandwidth

—Interrupt network

29 776

to the system call, the time to perform a kernel entry, the the arguments, the time spent in the marshaling proce-

and the time

—Controller

29

164

both in single-user and in multiuser modes. The different; the times reported in the tables are the

measurements

The various

29

697

to be exchanged

on the ATM.

1514 bytes on Ethernet and FDDI, and 1537 bytes from the server is 60, 48, and 53 bytes respectively Measurements were made times were not significantly

s +

and

the packet

dispatch

copying/mapping component

varies

the data

and kernel packet,

into

depending

the

data

and

the

server’s

on the

when of two

structures time

spent

address

capabilities

to in

space. of the

controller. —Server needed

Reply. This is quite similar to Client Call. This includes the time to call the correct server procedure and to set up the results for the

system call, the time spent in the kernel in checking the time to set up the controller for transmit. ACM

Transactions

on Computer

Systems,

the

arguments,

and

Vol. 11, No. 2, May 1993.

194

.

C. A. Thekkathand

H. M. Levy

—Client ceipt

Reply Receipt. This is the counterpart and takes place on the client side.

—Total

Attributed

we have been measurements A few points

Time.

This

is the total

able to attribute or by estimating about

Table

of the

of the components.

to the various from functional

II are in order

Server

Call

It is the time

activities, either specifications.

here.

First,

Re-

though

by direct

the SparcSta-

tion and the DECstationv 5000 have similar CPU performance and use identical Ethernet controllers, the software cost for doing comparable tasks varies. In general, our experience has been that SunOS extracts a higher penalty than Ultrix. We believe this is primarily due to the richer SunOS virtual-memory architecture rather than to the SparcStation’s architectural peculiarities ATM

such

network,

packet

into

Similarly,

as register

windows.

Second,

29 cells are sent to the server.

29 cells at the sender

is included

the cost of reassembling

in the case of MaxArg

The cost of segmenting in the row entitled

the cells is included

on the the user

Client

in the Server

Call. Packet

Receipt row. In the case of single-cell transmission, the segmentation and reassembly code is completely bypassed. Finally, we exploited the flexible interrupt structure of the ATM controller to ensure that in the normal case only the last cell in a multicell the Total Attributed Time Section

2, in all cases except

in the range

one, the difference

of 1–8Y0. The only

on the ATM,

where

tables,

exception

for the reasons

overestimated the total Overall, as we can Ethernet

packet caused a host interrupt. In comparing with the Measured Time we note that as in

the

experimental

error,

is in the case of a multicell

packet

mentioned

time required. see by comparing

latency

for

is within

the

earlier

the

small

in Section

FDDI

Minus

and

RPC

the

2, we have DECstation

request

on FDDI

is

12% higher than for Ethernet. On the other hand, as expected, the much higher bandwidth on FDDI becomes evident for the larger packet sizes; the MaxArg RPC request takes nearly three times longer on Ethernet than on FDDI. Comparing the performance on the ATM network with the others, we see how effective a simple controller design is for reducing latency for small packets. is a factor

However,

for larger

in reducing

tion

and reassembly

are

being

built

[12,

packets,

performance. either 32].

the software ATM

in hardware It

would

fragmentation/reassembly

controllers

that

be interesting

characteristics of these controllers with FORE controller we use. As noted earlier in Section 2, for small

perform

or by dedicated the

to

simpler

packets,

fragmenta-

on-board

processors

compare approach

the taken

the cost of “doing

latency in the

business”

(i.e., controller latency) on the network has increased with FDDI. However, compared to the low-level software latency imposed by the packet exchange, the higher-level RPC functions (stubs, context switching, and protocol overhead) add only 87 pseconds for Ethernet, pseconds for ATM. Higher-level RPC functions cost more relating to the DMA interface ACM TransactIons

controller/host the high-level on Computer

117 ~seconds on

interface. First, cost of dispatching

Systems,

Vol

11, No. 2, May

FDDI

for

for two

FDDI, reasons,

and

97

both

because of the more complex the server process and data 1993.

Low-Latency management is greater Ethernet, the code path

Communication

for the required

on High-Speed

Networks

DEC FDDI. Second, for both for RPC for these controllers

different inherent

from that used for a simple packet exchange. overhead added by RPC. However, the overhead

of FDDI

than

for the

controller/host While

Ethernet

because

of some of the

the absolute

increase

traps and context However, the bulk

contributed

switching that of the overhead

main

factors

that

use of preallocation, exploiting

simple

Our

experience

that

the

ments

and

with

also determine in Tables

Our experiments, ordinary

network

latency-impacting

AND NETWORK

RPC

on a variety

has

latency

latency.

experienced the

fact

In this

and the network

that

influence

included

controllers

and no DMA,

tradeoffs

features the

make

minimum

Transfer. the

between

allow

by an RPC.

Our

a faster

achievable.

posed by the lower the communication

access

measure-

network

we discuss

that

are capable

us to examine

these

The

the data

details

capabilities

it difficult

efficient

network

does not

the

effects

of

I/ O

(2) the overhead of transferring in more detail below.

on

indicates on

on latency.

different

sor moves the data to and from the bus, transfer primitive). These issues are (1) the

4.1.1 Data

the

DESIGN

controller

to restrict Excessive

to the user.

and

number

on the Certain

of data

unnecessary

each of these

system

movements data

movement

on Computer

Systems,

of

to one, im-

performance of between con-

troller types and copying costs. Table III summarizes the number of copies that the controller, and the user need to perform, respectively, for each type of device: Transactions

bus vary

combinations

levels of the system can lead to bad overall system. We discuss below the interaction

ACM

There

using a block the device and

We discuss

data

controller.

the

types.

controllers that (i.e., the proces-

usually without cost of servicing

of moving

of the

of scatter-gather

some of the essential

are basically two issues to be considered in choosing between provide DMA and those that support programmed 1/0 (PIO)

depending

for

different

section,

over-

are (1) the

of controllers

a significant Similarly,

the

transmission,

optimizing

reinforce

which

performance

and network

processors.

II

DMA,

is to keep

controller,

a lower

4.1 DMA versus Programmed

reduction

system’s

overheads.

the

I and

imply

the controller

DMA,

the

controller

for

to the cost of features like

(4) the speed of the host

low-latency

of the

this

computation

of

case,

communication

necessarily both

of the

is comparable

RPC adds about 130% is due to architectural

to the

FOR CONTROLLER

design

cross-machine

contribute

peculiarities

common

4. IMPLICATIONS

protocols

characteristics

may not scale with processor speeds. can be reduced with increasing CPU

(2) overlapping

the

and

Thus, there is an is more in the case

by RPC functions

as a percentage, of this overhead

speeds. A significant factor in achieving heads of memory copying to a minimum.

(3)

FDDI and is slightly

interface.

both ATM and Ethernet, low-level messages. Part

The

195

.

the kernel, PIO, DMA,

Vol. 11, No. 2, May 1993.

196

.

C. A. Thekkathand Table

III.

H. M, Levy

Number

Number

of Copies

Controller

Types

of Copies (Device/KemelAJser)

PIO (KM)

Fragment

for Various

I PIO

I DMA

I DMA (S-G)

Send Header Data

otlm 0/1/0

Ollp

I/lp

lPIO

0/1/1

1/0/1

lPIO

OIIP OIIP

0/1/0

1/1/0

1/0/0

0/1/1

Ip/1

lPP

Receive Header Data

and DMA with scatter-gather. The some form of kernel-level marshaling FRPC

running

represents

on the ATM

scatter-gather

Certain

implicit

controller

assumptions

made

the cost of copies

the

internal

either by using that with PIO, into

multiple

copied)

user

use a contiguous DMA,

the

data

over

address

spaces,

the

while

with

and

that

on

tiguous

locations

of allowable (or

a few)

in memory.

location(s),

Thus, and

cost

can

user

scatter

Most

data

DMA

S-G

be made

data

and

negligible

with

(instead

of

the DMA

can

scatter-gather

small

controller

amounts can

goes to the correct

of

perform

destination.

the number of times data is moved on and the host memory. With PIO and to keep

the

number

scatter-gather

the user

of copies

to one

DMA this would not be possible, in could be located in multiple nonconcontrollers

for each segment

then

so that that

the

First,

Second, we assume be reliably mapped

arbitrarily

DMA,

below.

the network

is mapped

memory,

we assume

With data

size requirements segments.

this

DMA,

it is possible

compromising protection. because application-level

minimum

between

incoming

without general, have

are clarified

occur

of transferring

so that

marshaling

table

might

in kernel

one would like to minimize bus between the network

kernel-level

in the

Finally,

is capable

bus

The column

or some similar technique. memory, or FIFO, cannot

set of addresses.

controller

its FIFOS.

Typically

to the header

demultiplexing

Ideally, the host

that

buffer.

video RAMs the on-board

to be adjacent

with

DMA,

we are ignoring controller’s

column PIO (KM) represents PIO with as described in Section 3, for instance,

will

have

have

the

to marshal

controller

(e.g.,

and a maximum the data

move

it.

LANCE) number into

While

one it

is

possible to build controllers to overcome this restriction. Using several small segments to gather data comes at a price, because the controller has to set up multiple DMA transfers. A similar situation is true on the receiving side as well. As shown in Table III, using PIO with kernel-level marshaling allows the data-copying cost to be kept to the minimum possible. However, one aspect of copying that is not captured in the table is the different rate at which data is

z the Autonet Sept. ACM

controller

being

built

at DEC

SRC;

personal

communication,

1991. Transactions

on Computer

Systems,

Vol. 11, No

2, May

1993

Charles

P. Thacker,

Low-Latency Communication

on High-Speed

Networks

.

197

moved over the bus for PIO and DMA. Typically, word-at-a-time PIO accesses over the bus are slower than block DMA accesses; this is the case on both the DECstation TURBOChannel bus and the SparcStation SBUS. While PIO versus

DMA

point

is of limited

beyond

which

is required

concern

PIO will

to touch

each byte

across the bus (for instance, more

efficient

4.1.2 and

Cache

and

with DMA

current might

TLB

the total

bookkeeping

cache lines

corresponding

IV,

of cache

flushes

flush

Newer

it

adequate

support

data

transfers

If the cache does not snoop

by the

to the data are

that

a serious

the situation

for

DMA

and

The cache flush required

processor/cache

coherence

without

controllers.

costs

designs In

penalty

interruptcontroller-

the time

architecture

taken

to flush

memory

the table

from

architec-

suggests,

because

cache flush instructions, by destroying locality. this

absence

the

As is evident

on current

than

recognize the

The total essential

was transferred.

is even worse

[15].

the

cost is simply host

in addition to the costs of executing additional flushes have a negative impact on performance ory

moving it is usually

subsystem with This subsection

controller-initiated

due to cache effects.

instructions

cache

above,

subsystem,

overhead.

the

However,

in software),

size.

cost of our DMA

sum

to execute

tures.

the processor than

blocks could be left incoherent as a result of the requiring cache flushes. If the cache is write-back,

interrupt-handling

Table

other

need to be purged before a DMA operation from memory. the contribution of the cache flush cost as a percentage of

cost is the

related

a certain

detail. costs outlined

of overhead

on 1/0 operations, cache DMA operation to memory,

handling

for reasons

a checksum

is a break-even

unless

interaction of the memory overall penalty than PIO.

and processor

dirty entries may Table IV shows

data

there

Thus,

While Table 111 seems to indicate that PIO can perform the same number of copies, very often

architectures the extract a heavier

be a source

packets, DMA.

Effects.

DMA

the memory

could

of the

beyond

describes this effect in more In addition to the copying from

short than

to generate

to use DMA

scatter-gather

for

be slower

problem

and

of snooping

cache

provide

caches,

mem-

the

usual

“solution” to this problem is to allocate buffers temporarily in uncached memory before they are copied to user space. This approach, used in the stock SunOS

Ethernet

driver,

either

incurs

an extra

benefit of cached accesses to frequently used Another cost of DMA is the manipulation necessary. On packet arrival, the controller however, there is generally no way to guarantee the

correct

option this

destination

of either remapping

address

remapping

space.

Thus

or performing

can require

an explicit

copy

overhead

data. of page

tables

or loses that

the

is often

stores the data on a page; that the page is mapped into the

an extra

kernel

is faced

with

the

copy. On a multiprocessor,

and expensive

TLB

coherency

opera-

tion [6]. To summarize our experience with the various controllers, we believe that DMA capabilities without adequate support from the cache and memory subsystem can be bad for performance in modern RISC processors. We also believe

that

controllers

that

have

ACM

simpler

Transactions

interfaces

on Computer

to the

Systems,

host

have

the

Vol. 11, No. 2, May 1993.

198

C. A. Thekkathand

.

H. M Levy Table

IV.

Cache

Flush

Cost

Received Packet Size in bytes FDDI (DEC) Ethernet (Spare) 1514 60 60 1514 Total Interrupt Cache

Percentage

,

potential simple 4.2

Overhead

for reducing

overheads.

controller/host

are two

controller.

common buffer

in memory.

other

ways

and

14

48

15

43

next

subsection

argues

data

a range

between

(possibly

the

data

to the

overhead

FIFO

user.

We have

shall

therefore

and

costs of different controller types. with the two types of controllers

interface

such

component.

as that

found

processors

Our in the

objective ATM

shows

the

cost

and

already

of servicing

through a copy or a remapping here is to compare the interfaces

the to be

descriptors

for transmits

and

discussed

in an the

restrict

ourselves

mentioned

above

to indi-

the

and (2) the controllerresearch has studied

[3, 26], and so we shall is with

interface such as that found in the Ethernet We reproduce some of the measurements table

host

overhead can be significantly reduced cost is composed of two components:

costs on RISC

second

the

and the host share

(1) the CPU-dependent cost of vectoring the interrupt dependent cost of servicing the interrupt. Previous only

use of

all) of host memory

is to use a simple

cate that the interrupt-handling by using a FIFO. Interrupt-handling

interrupt-vectoring

for the

host access it directly. A significant overhead to the processor system is the cost of servicing the

movement

interrupt-handling Our experiments

70 30

and have the controller

getting

of data

7

of transferring

alternative

receives and have the interfacing the controller interrupt

46

10

Memory Interface

used as a packet

role

21

3

The

One way is to designate The

21

interfaces.

Host / Controller

There

Time

Flush Time



to

compare

a more

elaborate

or the FDDI from Section

interrupt

and

examine

a simple

controllers. 2 in Table

transferring

FIFO

descriptor V. The the

data

operation, as appropriate. Since our intent and not the network, we have ignored the

reassembly overhead on the ATM controller. As the table indicates, for transferring small

amounts

of data

to the user,

the overhead of the FIFO-based controller (10 ~seconds) is less than half that of the best-case descriptor-based controller (24 ~seconds for the DEC Ethernet). For larger packets, the balance is in favor of controllers that allow the kernel to perform page-mapping operations, because the copying cost dominates the interrupt-handling cost. The ability to map data into user spaces at low cost is one alternative to kernel-level marshaling with PIO, which retains the benefits of protection and reduced data movement without the need to synthesize code. A typical limitation with FIFO-based controllers, such as the FORE ATM, is that there is no easy way to map the memory in a protected manner simultaneously into multiple user spaces. ACM

TransactIons

on Computer

Systems,

Vol.

11, No

2, May

1993

Low-Latency Communication on High-Speed Networks Table r

V.

Interrupt-Handling

(DEC) 1514

Ethernet

60

199

.

Cost

Ethernet (Spare) 60 1514

FDDI (DEC) 60 1514

Intenupt Time

13

13

21

21

46

10

11

201

9

39

10

70 10

5

Copy/Mapin

5

138

24

214

30

60

56

80

10

148

Time

Toral

In contrast, to support

with

descriptor-based

address

mapping

packet

irrespective

memory,

of DMA

it is possible

support.



ATM (DEC) 1537 53

in principle

However,

typically,

on descriptor-based PIO controllers, the existence of a small amount of on-board buffer memory makes it difficult to provide address-mapping support, because that memory is a scarce resource that must be managed sparingly. For instance, the DECstation’s Ethernet controller has only 128 Kbytes of buffer memory to be used for both send and receive buffers. Mapping pages of the buffer memory into user space would be costly, because the

smallest

Reducing dropped tively

units

that

the number packets

during

managing

the

can

be

individually

of available

buffers

periods

scarce

of high

buffer

mapped

this

way could

load.

On the

resource

results

are

4-Kbyte

pages.

lead to delays other

in the

hand, kernel

due to

conservamaking

extra copy of the data from user space into the packet buffer. With a trivial amount of controller hardware support, it is possible

an

to solve

the protection granularity problem, providing a larger number of individually protected buffers in controller memory. The basic idea is to populate only a fraction of each virtual page that refers to controller memory. As a concrete example, Figure 2-Kbyte

we consider an alternative design to a DEC Ethernet 1 shows the sketch of the design. Controller memory buffers,

us to have write troller

each of which

64 buffers

directly

in our

to controller

ignores

will

hold

128-Kbyte memory

the high-order

the physical page number the TLB. This has the

bit

controller. without

of the

Our larger based

experience simple

host/controller

FIFO-based

packets packet

with

would buffer.

same controllers.

packet; To allow

sacrificing

page-offset

this

would

user

processes

protection,

and

as

allow

the

concatenates

to con-

it with

(PFN) field of each physical address presented by effect of causing each 2-Kbyte physical page of

controller memory to be doubly mapped half of a 4-Kbyte process virtual page. while

an Ethernet

controller. is organized

are

served

it might

To our knowledge,

interfaces on board is the VMP-NAB in the absence of memory system

both

the top half

interfaces

controllers

be better Thus,

into

leads

ideally

with

a more

be beneficial the only

suited

us to believe for

small

conventional

to support

controller

[19]. Our support,

and the bottom

that

experiments DMA may

both

that

packets, descriptor-

forms

has multiple

on the host

also suggest that incur the cost of

additional copies and/or cache-flushing overheads. In such cases, it would be advantageous to use PIO with a descriptor-based packet memory that the host CPU can either copy or map into user space. This could be done either by providing enough buffers or with hardware support (e.g., as mentioned above). The decision to copy or map in will depend on whether the processor is required to touch every byte of the packet or not. For instance, if it is ACM

Transactions

on Computer

Systems,

Vol. 11, No. 2, May 1993

200

.

C. A. Thekkath and H, M, Levy

. : :.

. Processor Virtual

: .

Address Space

: :

: :. : :

Processor

Fig.

1.

Address

Controller

mappmgfor

Memory

Controller

Ethernet

buffers.

necessary to calculate a software checksum, then an integrated copy and checksum loop (as proposed in [9] and [ 10]) would suggest that mapping is of limited benefit. This consideration will be even more important with future memory transfers. initiated adequate 4.3

subsystem designs If the processor

that will provide support for I/O-initiated does need to mediate the transfer, then

data moves would be of benefit support for cache consistency.

Network

if

the

memory

system

data 1/’0-

provides

Types

Compared to an Ethernet, a high-speed token ring like FDDI offers greater bandwidth. However, token rings have a latency that increases with the number of stations on the network. Consider a network where the offered load is quite small. That is, on the average, only a few nodes have data to transmit at a given time. As stations are added to the network, the Ethernet latency remains practically constant while the latency of a token ring will increase linearly due to the delay introduced by each node in reinserting tokens. This implies that even on a lightly loaded, moderately-sized token network, achieving low-latency cross-machine communication is difficult. As an example, if each station introduces a one-microsecond token rotation delay, a network of 100 stations would make it infeasible to provide lowlatency communication. As load is increased, both the Ethernet and the FDDI token ring will experience greater Iatencies, with the FDDI reaching an asymptotic value [32]. Thus, on balance, it appears that low-latency communications are not well served by a token ring, despite high bandwidth. We ACM

TransactIons

on Computer

Systems,

Vol

11, No

2, May

1993

Low-Latency

Communication

on High-Speed

Networks

201

.

should point out that in our experiments with RPC, token rotation latency was not a problem because we used a private ring with two nodes on it. However, if we added more nodes to the ring, we would expect to see a degradation

in the latency

ATM-style the

networks

delay

instance,

of RPC.

that

fragment

and

of

fragmenting

and

reassembly

our

experience

indicates

that

interleave of

with

packets

have

medium-sized software

to incur

packets.

reassembly,

For

latency

begins to be impacted with packet sizes in the range of 1500 bytes. It is not immediately clear how adding fragmentation and reassembly support in hardware overall

or in an on-board latency,

5. SUMMARY Modern

processor,

even though

as provided

in [12] and [31],

fragmentation/reassembly

will

affect

is fast.

AND CONCLUSIONS

distributed

systems

require

both

high

throughput

and

low

latency.

While faster processors help to improve both throughput and latency, it is high throughput, and not low latency, that has been the target of most newer networks In this cations

and controllers. paper

we have explored

on new-generation

implemented designs

a low-latency

in

addition

avenues

networks RPC

to our

own.

for achieving

(specifically, system

Using

using

newer

low-latency

FDDI

techniques RISC

communi-

and ATM).

We have

from

processors

previous

and

perfor-

mance-oriented software structures, our system achieves small-packet, round-trip, user-to-user RPC times of 170 pseconds on ATM, 340–496 ~seconds on Ethernet, and 380 ~seconds on FDDI. Our RPC system demonstrates that it is possible to build an RPC system whose overhead is only 1.5 times the bare

hardware

cost of communication

for small

packets.

Our experiments indicate that controllers play an increasingly crucial role in determining the overall latency in cross-machine communications and can often

be the

bottleneck.

However,

we believe

controller design that can provide lowered niques that achieve excellent performance. us to believe

that

FIFO-based

packets

and

that

DMA-

hidden trollers

costs that

and

network

that

there

are alternatives

latency, facilitating software Specifically, our experience interfaces

descriptor-based

are well controllers

suited may

depending on the memory system architecture. provide multiple host interfaces appear to be

alternative to current factor for performance.

for have

to techleads small many

Hybrid conan attractive

designs. Of course, the network itself is an important For instance, both of our high-throughput networks

have some peculiarities that could affect latency of packets: e.g., the token rotation latency in FDDI network and the fragmentation and reassembly in the ATM Finally,

network. we believe

that

with

careful

design

at all levels

tion system, communications latencies can be substantially entirely new approaches and applications for distributed

of the communicareduced, systems.

enabling

ACKNOWLEDGMENTS

The authors thank Ed Lazowska for his numerous patient readings of earlier drafts of this paper and his comments, which added greatly to its clarity of ACM

Transactions

on Computer

Systems,

Vol. 11, No. 2, May 1993

202

.

C. A. Thekkathand

presentation.

Thanks

suggestions of the

are also

for improving

FDDI

helping

H.M.

Levy

due

to Tom

the paper.

controller

from

us understand

DEC,

many

Anderson

Finally, who

for

we wish

gave

his

comments

to thank

so willingly

and

the designers

of their

time

in

of its details.

REFERENCES

1.ADVANCED

MICRO

Advanced 2. Ross,

F. E.

Devices,

FDDI—A

3. ANDERSON, T. E., architecture

Am7990

DEVICES.

Micro

Sunnyvale, tutorial.

LEVY,

H.

and operating

on Architectural

Support

Local

Area

Calif.,

1986.

IEEE

Network

Cornmun.

M.,

BERSHAD,

system

design.

Mug.

B. N.,

Controller

24, 5 (May

1986),

10-17.

AND LAZOWSBA,

E. D.

The

In Proceedings

for Programming

Languages

Trans.

5. BIRRELL,

Comput.

A.

Comput.

D.,

Syst.

6. BLACK,

D.,

tural

B.

39-59.

1984),

R.,

GOLUB,

J.

D.,

and

Implementing HILL,

approach.

In

for Programming

interaction

Operating

Systems

Lightweight

of

Conference (Apr.

remote

1991).

procedure

call.

1990).

1 (Feb.

A software

Support

8, 1 (Feb.

AND NELSON,

2,

RASHID,

consistency: York,

Syst.

(LANCE).

of the 4th International

4. BERSHAD, B., ANDERSON, T., LAZOWSKA, E., AND LEVY, H. ACM

for Ethernet

remote

C., AND BARON,

Proceedings

Languages

procedure R.

Translation

of the 3rd

and

calls.

ACM

Operating

Systems

LEVY,

M.,

ACM

Trams

lookaside

buffer

Conference (Apr.

on Architec-

1989).

ACM,

New

113–122.

7. CHASE, J. S., AMADOR, Amber

system:

ACM

12th

LAZOWSKA,E. D.,

F. G.,

Parallel

programming

Symposium

on

on a network

Operatzng

Systems

H.

AND LITTLEFIELD,

of multiprocessors. Principles

(Dec.

In 1989).

R. J.

The

Proceedings

of the

ACM,

York,

New

147-158. 8

CHERITON, (Apr.

D. R

1984),

The

V kernel:

A software

9. CLARK, D. D., AND TENNENHOUSE, protocols. tures

In

and

10. CLARK,

Protocols IEEE

11. COMER, model.

1990).

Commun.

B. S.

SIGCOMM New

(Sept.

Proceedings

ACM,

New

ROMKEY,

Msg.

27,

J.

York,

6 (June

A new

A host-network

for

distributed

design 1990

interface

USENIX

An

1, 2

for a new generation

for

ATM.

Architectures

analysis

of

Architec-

of TCP

systems:

Conference

architecture

on Communications

In

and

The

(June

remote

1990),

memory

127–135.

Proceedings

Protocols

processing

of the

(Sept.

1991).

1991 ACM,

307-315. TURBOChannel

CORPORATION.

PMADD-AA

Specification,

Rev.

1.2. Workstation

EQUIPMENT SYSTEMS. C. A. R.

Systems

CORPORATION.

TCA-1OO

Pittsburgh,

HOARE,

Softw.

on Communications

H.

23–36.

CORPORATION.

17

IEEE

200–208.

1989),

EQUIPMENT

Systems,

considerations

J., AND SALWEN,

14. DIGITAL DIGITAL

systems.

Symposium

EQUIPMENT

16, FORE

distributed

Architectural

13. DIGITAL

15

for

SIGCOMM

of the Summer

Symposium

York,

D. L.

of the 1990

D., AND GRIFFIOEN, In

12. DAVIE,

Proceedings

D. D., JACOBSON, V.,

overhead.

base

19-42.

Alpha

Hardware

Engineering, ATM

Ethernet

1991.

Module

Functional

1990.

Architecture

TURBOchannel

Speczficatzon.

TurboChannel Reference

Computer

Man ual,

Interface,

1992.

User’s

Manual.

FORE

Pa. 1992 Communicating

sequential

processes.

Comm un

ACM

21,

8 (Aug.

1978),

666-677. 18, JOHNSON, D. B., AND ZWAENEPOEL, Rep. 19

COMP

KANAKIA, mance

TR91-152,

H.,

mm

network

Symposmm

Dept.

CHERITON,

W.

The

of Computer D. R.

communications

The

Peregrine Science,

VMP

high

network

for multiprocessors.

on Comm unzcatLons

Architectures

performance

Rice Univ., adapter

In

and

board

Proceedings

Protocols

RPC

system.

Tech,

1991. (NAB):

of the 1988

(Aug.

1988).

ACM,

High-perforSIGCOMM New

York,

175-187. 20

KEPPEL, ings

guages 21

and

A portable 4th

Operating

Syst.

Transactions

interface

International

LI, K., AND HUDAK, Comput.

ACM

D.

of the

Systems P.

Memory

7, 4 (Nov. on Computer

for

on-the-fly

Conference

1989),

(Apr.

1991).

coherence

instruction

on Architectural ACM,

New

in shared

York,

virtual

321-359.

Systems,

Vol.

11, No. 2, May

space

modification.

Support

1993

for

In

Programming

ProceedLan-

86-95. memory

systems.

ACM

Trans.

Low-Latency Communication 22. MASSALIN,

H., AND Pu,

ings

of the 12th

York,

191-201.

23. METCALFE, computer

R.

M.,

J.

C.,

Proceedings Languages

AND BORG,

and

Compzd.

Proceedings

J. K.

input/output

R.

Ethernet:

effect (Apr.

Newsletter

31. BRENDAN,

aren’t 1990

IEEE C.,

INC.

operating

systems

USENIX

Conference

New

switching

for

local

on

performance.

In

New

cache

Support

York,

for

Programming

77-84.

of a capability-based getting

Performance

results.

Systems

SBUS Specification

tures

Protocols

and J. N.

Proceedings

Trans.

TRAW,

In

(June

operating

faster

as fast

1990),

247-256.

of Firefly

RPC.

Performance

B.O. Sun

S.,

Comput.

37, 8 (Aug.

AND SMITH,

Proceedings

of the 1991

(Sept.

1991).

A timed-token

ring

of the 7th IEEE

ACM, local

Conference

J.

M.

1988), A

system.

as hardware?

ACM

Evaluation

Microsystems,

Trans.

In

Comput.

Cooperative, Inc.,

1990.

Mountain

View,

New area

E. H.,

JR.

Firefly:

Sympos~um

York,

317–325.

network

and

on Local

A multiprocessor

909–920.

high-performance

SIGCOMM

host

interface

on Communications

its

Computer

performance

Networks

for

ATM

Architec-

characteristics.

(Feb.

1982).

IEEE,

In New

50-56.

33. VAN RENESSE, R., VAN STAVEREN, H., AND TANENBAUM, distributed 34. MINZER,

Received

Proceed-

ACM,

1-17.

benchmark

networks.

Mug.

In

1989).

1990.

workstation.

York,

switches

ACM,

30. THACISSR, C. P., STEWART, L. C., AND SATTERTHWAITE,

32. ULM,

packet

on Architectural

The design

kernel.

(Dec.

395–404.

context

1991).

A. S.

Synthesis

203

.

289-299.

Why

1990),

of

Networks

Principles

Distributed 1976),

Conference

D., AND BURROWS, M.

1 (Feb.

in the

Systems

19, 7 (July

The

Systems

of the Summer

29. SUN MICROSYSTEMS, Calif.,

A.

Operating

27. SCHROEDER, M. 28. SPEC.

D.

ACM

International

J. 29, 4 (1986),

8,

and

on Operating

S. J., AND TANENBAUM,

26. OUSTERHOUT,

Syst.

Commun.

of the 4th

25. MULLENDER,

Threads

Symposium

AND BOGGS,

networks.

24. MOGUL,

C.

ACM

on High-Speed

operating S. E.

27, 9 (Sept.

July

1991;

system.

Broadband 1989),

revised

Sofku—Pratt. ISDN

17-24,

and

Exp.

asynchronous

A. S.

The performance

19, 3 (Mar. transfer

1989), mode

of the Amoeba

223-234. (ATM).

IEEE

Commun.

57.

November

ACM

1992;

accepted

Transactions

January

on Computer

1993

Systems,

Vol. 11, No. 2, May 1993.