Programmable Peripheral Devices

Programmable Peripheral Devices Patrick Crowley Department of Computer Science and Engineering University of Washington Seattle, WA 98043 1 Introduct...

Author: Vernon Byrd

5 downloads 0 Views 221KB Size

Report

Download PDF

Recommend Documents

82C55A CHMOS PROGRAMMABLE PERIPHERAL INTERFACE

82C55 Programmable Peripheral Interface PPI

8255A Programmable Peripheral Interface (PPI)

Chapter 9: Peripheral Devices

Philips Semiconductors Programmable Logic Devices

The Evolution of Peripheral Devices

Chapter 9: Peripheral Devices Overview

Encryption of Computer Peripheral Devices

5635 Spring C55 Programmable Peripheral Interface (PPI)

Lecture-42 INTEL 8255: (Programmable Peripheral Interface)

Programmable Logic Devices Verilog IV CMPE 415

Philips Semiconductors Programmable Logic Devices AN036

Output Interface Circuits and LSI Peripheral Devices

Slot Machine Peripheral Devices. Week # 6

Management of Peripheral Intravascular Devices. Technical report

Owner s Manual Intelligent Peripheral Devices, Inc

Output Interface Circuits and LSI Peripheral Devices

Peripheral Devices. What is a peripheral device? 21. What are the options for standard input devices? 22

Programmable Logic. Objective: Preparation: Devices used: Experiment: ROM

Single Event Upset Mitigation Techniques for Programmable Devices

Modeling and Integration of Peripheral Devices in Embedded Systems

TECHNICAL NOTE Choosing Peripheral Devices for the VidyoDesktop VIDYO

12th IFAC Conference on Programmable Devices and Embedded Systems 2013

Peripheral Sharing Switch (PSS) For Human Interface Devices. Protection Profile

Programmable Peripheral Devices Patrick Crowley Department of Computer Science and Engineering University of Washington Seattle, WA 98043 1

Introduction

Many important server applications are I/O bound. For example, large-scale database mining and decision support applications are limited by the performance of the storage I/O subsystem, and, likewise, web servers and Internet backbone routers are constrained by the capabilities of the network I/O system. For this reason, many proposals have been made in recent years to migrate functionality from servers onto programmable storage and network devices in order to improve application performance. This report surveys these proposals and evaluates research areas concerning these programmable peripheral devices. Programmable disks can be used to: scale processing power with the size of large scanintensive database problems, build scalable, secure, and cost-efficient storage systems, and implement sophisticated storage optimizations at the disk. Programmable network interfaces (NIs) can be used to: unburden the host from managing data transfers in fast networks, scale processing power with the number of links in a network, and enable complex packet processing, such as the aggressive program-in-a-packet Active Network proposal, at network speeds. Both types of programmable peripherals contain all the components found in computer systems: processor, memory, and communications subsystem. Based on this observation, this report concludes with a set of common issues, including technical vulnerabilities and areas for future research. This report is organized as follows. Section 2 contains background information and a historical perspective concerning programmable peripherals. Sections 3 and 4 discuss the designs for and applications of programmable disks and network interfaces, respectively. Examples of other programmable peripherals are briefly discussed in Section 5. A set of issues common to

1

both programmable disks and network interfaces is presented in Section 6. The report concludes with a summary and brief set of research proposals in Section 7. 2

Background

The seasoned reader will note that programmable I/O devices are, in fact, far from a new idea. Programmable peripherals have been implemented and abandoned for good reasons in the past. To consider the arguments against programmability in these devices, we first recall a Pitfall and Fallacy from the storage and network I/O chapters, respectively, of a popular computer architecture textbook [Hennessy and Patterson 1996]. Pitfall: Moving functions from the CPU to the I/O processor to improve performance. An I/O processor, in this context, is a direct memory access (DMA) device that can do more than shuffle data. The authors are recalling the programmable I/O processors found in classic machines such as the IBM 360 which, in the 1960s, had programmable I/O channels [Amdahl et al. 1964, Cormier et al. 1983]. One application of this programmability was support for linkedlist traversal at the I/O processor. (Interestingly, one group recently proposed the addition of execution engines at each level in the CPU cache memory hierarchy to enable the overlap of computation and communication in linked-list traversals [Yang and Lebeck 2000].) This alleviated the host CPU of the traversal task, so it was free to do other work. The argument against this usage was that the advances in host CPU performance would prove far greater than the performance advances of the I/O controller in the next generation. Thus, applications that used and benefited from this optimization in generation N, actually saw decreased performance when running on the machine of generation N+1. Put plainly, the host CPU was by far the most expensive and powerful compute element in the system; to bet against it was folly. Fallacy: Adding a processor to the network interface card improves performance. In elaborating on this fallacy, the authors argue essentially the same point: the advantages of host CPU speed. The issues raised by H&P are answered by the state-of-the-art in embedded microprocessors today. Rather than being far slower than high-performance desktop CPUs, embedded microprocessor integer performance is now within a factor of two of their desktop counterparts [Keeton et al. 1998] (note the co-author on this reference). This change in relative processor

2

performance has been a consequence of Moore’s Law [Moore 1965]; increasingly, and at all levels of abstraction, communication is a far scarcer resource than computation. Generally speaking, increasing I/O performance in a computer system is a matter of cost: improvements can be achieved by spending more money. Improvements in I/O performance are costly because I/O components are generally standards-based, with many companies offering competing, compatible products. For example, any new PC interconnect technology must gain wide acceptance to achieve the economies of scale necessary to be cost-effective. Thus, only the most urgent and important problems get solved with expensive, customized I/O systems. Furthermore, I/O subsystems comprise the bulk of the cost of modern computer systems, despite being built with commodity components [Hill et al. 2000, editors' introduction to Ch. 7]. I/O systems are relatively costly since they, generally speaking, do not benefit from Moore’s Law as do semiconductor devices like processors and memories. Thus, many I/O advances aim to either reduce the cost for a given level of performance, or improve performance for a given level of cost. Programmable microprocessors are an increasingly cost-effective solution for providing sophisticated control in I/O devices. Advances in VLSI technology have made powerful embedded microprocessors small, powerful and relatively inexpensive. In fact, most peripherals common to today's computer systems, including disks, graphics accelerators, and network interface cards, are built around microprocessors. As mentioned previously, this situation has sparked numerous research efforts attempting to exploit any excess compute power at peripheral devices, particularly on devices related to storage I/O and network I/O, to speed I/O intensive applications in a cost-effective manner. This report surveys the research efforts under way for programmable disks and programmable networks, unifies and gives context for their common problems, and identifies areas of future research. 3

Programmable Disks

Compared to modern processors, disks are slow as a result of the physical motion needed to access data. This fact enables disk manufacturers to implement much of the disk control logic in software/firmware executed on a microprocessor. Software-based control reduces the number of electronic and ASIC components on the disk, and therefore reduces cost. In this section, we

3

Figure 1. Mechanical components of a disk drive. Source: [Ruemmler and Wilkes 1994]. consider the design of modern disks and survey the approaches taken by researchers to leverage this programmability in order to increase application performance. 3.1

Basic Operation

Magnetic disk drives store information in the form of magnetic flux patterns. This encoded data is arranged on sectors within tracks on a platter, as shown in Figure 1. To store information, the drive receives blocks of digital data through a host interconnect channel, such as SCSI, maps block addresses to physical sectors, moves the read/write head over the appropriate disk sector, and encodes the data as flux patterns that are recorded onto the magnetic surface. Information retrieval is similar, except data is sensed and decoded rather than encoded and written. As noted by [Ruemmler and Wilkes 1994], modern disk drives contain a mechanism, which includes the recording and positioning components shown in Figure 1, and a disk controller, which consists of a microprocessor, memory, and host interface as shown, among other things, in Figure 2. Recording and positioning components. The overall performance of the disk is dominated by the engineering tradeoffs found in the disk mechanism. Two different but intimately related aspects contribute to disk performance: media transfer rate and storage density. The media transfer rate for a fixed storage density is primarily determined by two common performance measures: spindle rotation speed and seek time. Very fast spindle rotation requires a powerful motor, which consumes more energy, and high-quality bearings, which are more expensive. Seek time refers to the time needed to position the head over a particular cylinder. 4

This time is limited by the power of the motor that rotates the arm and the stiffness of the arm itself. The storage density for a fixed media transfer rate is a consequence of two forms of density: linear recording density and track density. The former is constrained by the maximum rate of magnetic phase change that can be recorded and sensed. Track density refers to how closely tracks may be packed together on the platter and is the primary source of density improvement. Track density is influenced heavily by the precision provided by the head positioning and media sensing mechanism. Both linear and track density are influenced by, and in turn influence, the speed of the encoding process. The read-write data channel encodes and decodes the data stream into or from a pattern of magnetic phase changes. Error correction is built into the encoded data stream (and DSP techniques can be used to increase data channel speed), and positioning information is recorded onto the disk surface by the manufacturer to help determine the location of head. 3.2

Disk Controller

The disk controller governs the operation of the mechanism described above. The controller receives and interprets SCSI requests, manages the media access mechanism, manages data transfers, and controls the cache. The heart of the controller is the microprocessor. The current trend is to reduce cost and improve performance by replacing electronic components with

Figure 2. The structure of a disk controller and integration trends. Source: [Riedel 1999].

5

software/firmware, augmenting the processor with DSP capabilities, and tightly integrating the interfaces to hardware, which permits direct control. 3.2.1

Processor

Since disk performance is limited by the media access rate, which is slow relative to microprocessor speeds, the control processor does not need to be particularly fast. However, embedded microprocessor price/performance continues to improve, so, increasingly, disk control logic, which controls spindle rotation and arm actuation, is being moved into software executed by the control processor. [Adams and Ou 1997] describe their experience in doing so. Chip-level system integration is also having an impact. Cirrus logic sells a system-on-a-chip disk controller, called 3Ci, that integrates: a 66MHz ARM7 32-bit RISC processor core, disk control logic, a DSP-based integrated read/write channel (PRML), 48 KB SRAM, 128KB ROM, and a memory controller for off-chip Flash, SRAM and DRAM memory [Cirrus Logic ]. The next generation of this device will include a 200 MHz ARM core with more on-chip memory. 3.2.2

Memory System

There is considerable semiconductor-based buffer storage (between 64KB and 1MB) on disks today, and future devices will have even more. Originally, buffer memory performed only rate matching between the media access rate and the host transfer rate. Today, data caching is used and, in some cases, provides excellent improvements. Read caching can be performed optimistically since there is no on-disk penalty associated with reading unnecessary data as the head moves across a platter; it can simply be discarded. Outside of the disk, however, host-based prefetching may be affected if the host makes cache content assumptions based on its own reference pattern. Write caching permits the disk to: organize data before writing and reorganize disk blocks during operation without interrupting the host CPU. However, write caching, for reliability, is generally implemented on non-volatile memory to avoid loss of information if power fails before the cached data can be written. IRAM [Patterson et al. 1997] [Keeton et al. 1997] has been proposed [Patterson and Keeton 1998] as a good integrated processor and memory system architecture due to its potential low-latency and high-bandwidth characteristics. Cirrus, by virtue of their 3Ci device, agrees with this call for integration.

6

3.2.3

Communication

All high-performance disk drives use the small computer system interface (SCSI). The SCSI standard defines both an interconnection fabric and programming interface. SCSI interconnects are parallel busses shared by several devices. Historically, bus-based interconnects have been the standard for connecting hosts and storage devices. Following the trend seen in LANs, however, high-performance, serial, point-to-point interconnect technologies like Fibre Channel [FCIA 2000] are rapidly replacing SCSI in server systems. Fibre Channel is a serial interconnect technology that uses fewer wires than SCSI (4 rather than the 25, 50, 68, or 80 used in various SCSI generations) and, therefore, has a smaller connector, and is considerably faster (125 MBps vs. 80 MBps for SCSI-2). SCSI, the programming interface, can be implemented on top of Fibre Channel. The SCSI interface has proven to be a successful abstraction between hosts and storage devices. The SCSI interface frees programmers and the host from having to manage the storage device and, furthermore, permits the storage device to implement optimizations beneath the interface. RAID [Patterson et al. 1988] is an example of an optimized system that presents itself to the host as a standard SCSI device. However, SCSI is a low-level interface, and one recommendation of the network-attached secure disk drive (NASD) [Gibson et al. 1997] project is to replace it with a higher-level, objectbased interface to permit devices to better manage data, meta-data and security. With an objectbased interface, the device would manage the storage of blocks that belong to a particular object. Hence, when a request comes for that object, the drive has knowledge of all the blocks that are of potential interest. Presently, the interface permits no expression of relationships between blocks. The object-based interface also simplifies security concerns, which are paramount for disks than can be accessed directly across a network by multiple hosts, by associating capabilities with host/object pairs. 3.3

Control Software

Modern disks are built around microprocessors, and, accordingly, a software control system is responsible for governing the operation of the device. The control software is not exposed to the programmer, and it generally resides on a disk-resident ROM or EEPROM. Disk-based operating systems proposed in the research literature will be discussed in Section 3.4.1. 7

3.4

Applications of Programmability

We have examined the design of modern disks and the factors that have made them programmable. In this section, we survey the manner in which this programmability has been exploited to solve problems. These proposals fall into two categories: storage systems and distributed disk-centric applications. 3.4.1

Storage Systems (NASD, Virtual Log-based FS)

The afore-mentioned NASD project describes a cost-effective scalable storage architecture with network-attached and secure programmable disks [Gibson et al. 1997]. Disks directly attached to the network require changes in the programming interface and security model, as mentioned in Section 3.2.3. The NASD project addresses these issues, and proposes the following four characteristics: direct data transfer between drive and client, a capability based access control system permitting asynchronous oversight by a centralized file manager, cryptographic integrity, and an object-based interface. The NASD work culminates in the demonstration of a parallel distributed file system, built on a prototype NASD, that provides file system support to a parallel data mining application. Application performance in their prototype system scales linearly with the number of NASDs. Another proposal implements a virtual log based file system on a programmable disk [Wang et al. 1999]. Wang’s file system uses a virtual log, that is, a disk-based log with non-contiguous entries, to achieve good performance for small synchronous writes while retaining all the benefits of a log based file system, including transactional semantics. The technique involves migrating some of the low-level file system implementation to the disk and performing small atomic writes near the location of the disk head when data arrives. The authors note that these techniques do not necessitate a programmable disk; the technique only requires a file system with precise knowledge of disk state. 3.4.2

Distributed Disk-bound Applications (Active Disks, IDISK)

A number of researchers [Acharya et al. 1998, Gray 1998, Keeton et al. 1998, Riedel et al. 1998] have proposed executing application-level code on programmable disks, in particular NASDs or IDISKs, as a means of scaling processing power with the size of very large data sets in certain scan-intensive database problems. While database machines, which scaled processing power 8

with the number of read/write heads on a single disk, failed in the 80s, these researchers contend that there are now important applications that need to scale processing power with data set size. Researchers from CMU describe target applications as: 1) able to leverage the parallelism available in systems with many disks, 2) operate with a small amount of state, processing data as it "streams" past, and 3) execute few instructions per byte of data [Riedel et al. 1998]. Most of these applications are scan-intensive database operations used in data-mining where the same queries are run over all data, producing a result set that requires further processing. This approach makes use of the processors in all disks, and does not require all data to be sent across a host I/O bus for processing. The Active Disk literature proposes a programming model [Acharya et al. 1998] and an analytical model [Riedel et al. 1998]. The programming model proposed by the group from UC Santa Barbara/Univ. of Maryland is simple and calls for stream-based message passing. In this model, the disk runs DiskOS, a disk-resident OS which handles memory management and stream communication, and the application developer partitions code between the host and the disks. This approach is fine for big problems that justify customization, as is the case for certain problems solved by message-passing multiprocessors, but is far from a comprehensive programming model. We return to this issue and the larger issue of software models in Section 6.2. The analytical model from CMU is designed to give intuition about the performance of an active disks system compared to a traditional server [Riedel et al. 1998]. This model, in addition to most of the arguments from the Active Disk literature, primarily speaks to and argues for distributing these computations and, therefore, applies to other scalable approaches as well. Clusters of inexpensive machines are another way of doing this [Arpaci-Dusseau et al. 1998]. In fact, the arguments for a distributed, serverless file system were laid out in xFS [Anderson et al. 1996], which organized all workstations on a network as peers providing file system services. The question is whether it is more cost-effective to run the software at each disk, or on a PC that manages a few disks. The Santa Barbara group compared clusters to active disks for a set of target applications and found that they were equivalent in terms of performance [Uysal et al. 2000]. However, the active disk solution was 60% less expensive, given late ’99 prices and the authors’ bargain-hunting skills. The IDISK project from Berkeley specifically argues for independent disks with considerable processing power and memory that are capable of autonomous communication; in particular, 9

they state the case against clusters with respect to IDISKs. They point out four weaknesses in cluster architectures: 1) the I/O bus bottleneck, 2) system administration challenges, 3) packaging and cost difficulties, and 4) inefficiency of desktop microprocessors for database applications. The first three items are clear. The fourth points out that desktop microprocessors are slightly more powerful than embedded microprocessors when executing database codes, but are far more costly. We return to this point in discussion of the future of programmability in Section 6.1. The IDISK work distinguishes itself from Active Disks primarily by arguing for considerable resources on each disk; the current Active Disks proposals seek out applications that require minimal computation per byte. 3.5

Summary

In this section, we have surveyed the motivations, designs and applications for programmable disks. The most compelling motivation for executing application code on these devices is the need to scale processing power with problem size; data-set sizes for important disk-bound problems continue to grow rapidly. However, cluster-based systems are already in use for this purpose, and they must be displaced for Active Disks to become a reality. Special processor and memory architectures -- other than IRAM -- have not been investigated for disks, as they have for network interfaces as we will see, because disk performance is already limited by the mechanical speeds of the spindle and arm. Any microprocessor performance increase will have a marginal effect on the overall performance of the disk. (Unless the device is tailored to increase the physical properties of the disk, as in using DSP techniques to increase effective density [Smotherman 1989].) The IDISK proposal contends that there is clear benefit, however, in improving the computational resources afforded application level code executing at the disk, presuming, as do the Active Disk proponents, that additional processing power can be added for marginal cost. 4

Programmable Network Interfaces

Network performance is increasing dramatically, outpacing the increase in memory speeds, with no end anticipated in the near future [Schoinas and Hill 1998]. This fact, coupled with the limitations in server I/O bus performance described in the previous section, has motivated highperformance NI designs built around powerful microprocessors that require minimal host CPU interaction. 10

Network interface design issues have traditionally been categorized according to network type: local-area networks (LAN), system-area networks (SAN) and massively parallel processing networks (MPPs). Since the reliability, bandwidth, and latency characteristics of these network types are converging, the primary distinction that remains, one that is a large performance factor, is the location of the NI/host connection. LAN and SAN NIs typically connect to the host I/O bus. MPP NIs ordinarily attach to the node processor’s memory bus or processor datapath [Dally et al. 1992]. In this report, we focus on the LAN/SAN type NI. However, as we shall see, the integration of network processors on these devices raises many of the same issues confronting MPP NIs. While no two NIs are identical, in the following discussion of NI operation and design, we use the Myrinet [Boden and Cohen 1995] host interface, as depicted in Figure 3, as a running example. The Myrinet system area network (SAN) was a ground-breaking advance in interconnect technology. It was the product of two academic research projects, namely Caltech’s Cosmic Cube [Seitz 1985] and USC’s Atomic LAN [Cohen et al. 1992]. The Myrinet host interface is similar in the important ways to other high-performance interfaces, and the bulk of the differences lie in the relative sophistication (or lack thereof) of the network processor. 4.1

Basic Operation

A switched network such as Myrinet consists of host network interfaces and switches. Myrinet switches range in size from 4 to 16 ports. Myrinet is self-configuring and source routed – it uses the blocking-cut-through (wormhole) routing technique found in MPP systems such as the Intel Paragon and the Cray T3D. The switches have no hard state or software; they only steer the source-routed packets. Moore's Law has helped make switched-based networks economical since switches and crossbars can be implemented on a single chip. Upon packet arrival, the link, or packet, interface handles framing, error detection, and the media access protocol. The link interface accepts a frame via the media access protocol, checks for errors via cyclic redundancy checks (CRCs), and writes the frame into the buffer memory. Minimally, this buffer is used to cope with asynchrony between the network link and the host interface. In many high-performance NIs, additional processing of the packet takes place, as discussed further in Section 4.2.1. Once all processing is complete, the packet is moved via DMA into the host CPU’s memory.

11

Unlike disks, networks are fast compared to modern processors; high-performance NIs are too fast for host I/O busses. For example, one link in a Myrinet LAN has full-duplex bandwidth of 110 MBps (1.28 Gbps), which is greater than the 10 MBps peak bandwidth available on host PCI I/O busses, which is shared by all I/O devices. As previously mentioned, networks have been increasing in speed and bandwidth faster than memory [Schoinas and Hill 1998]. Consequently, the performance of the microprocessor, memory and operating system that run the interface card can have tremendous influence on the capabilities of the device for certain applications. The design of high-performance processors and execution environments has become a research issue, and, in the case of network processors, a fledgling industry full of start-up companies and established semiconductor vendors. A major push behind these efforts is the need to meet the increasing bandwidth and functionality requirements of the Internet. Traditionally, the middle of the network has been kept simple and fast, with sophistication being implemented at the edges. However, to meet these demands, functionality is being pushed from servers on the edge of the network onto internal network nodes. Sophistication is moving into the network in the form of application data caching, tunneling, content distribution techniques, etc. There remain proponents on both sides of this issue. However, it seems unlikely that the services being deployed in the network today can ever be reigned back in. The tremendous momentum behind Internet related technologies has inspired much research in: network interface design, communications based operating systems research, and large-scale systems research that implement network services. In this section, we first consider the design of modern network interfaces, including the design alternatives investigated in the research literature. Then, we survey various proposals for exploiting the programmability in network interfaces. 4.2

NI Organization

As was the case with disks, modern NIs have all the components found in computer systems: processor, memory, and a communications subsystem. Figure 3 depicts how the Myrinet host interface and most NIs are organized. In this section, we consider each of the major components individually.

12

4.2.1

Processor

The most important function of the processor is to manage packet delivery and protocol specific tasks. In Myrinet, for example, each network has a manager (chosen manually or automatically by the network) that is responsible for continuously mapping the network by sending messages to all hosts. This mapping enables source routing. So, in addition to managing host packets, the processor must also adhere to the control protocol of the network. The LANai processor, found on the Myrinet NI, is a simple, 32-bit RISC processor clocked at 33 MHz with integrated link and host interfaces. The LANai is a relatively meager processor compared to processors found on other high-end devices. For example, the 3Com 3CR990 ethernet network interface is built around a 200 MHz ARM9 processor core; this device aggressively handles security (IPsec) and TCP segmentation and reassembly completely on the NI [3Com 2000]. Research directions in NI processors can be grouped in two categories: communication processors and network processors. The first category emphasizes low-cost interrupt and message handling, and the second focuses on higher-level packet processing.

Figure 3. Myrinet NI structure. Source: [Boden and Cohen 1995]. Communication Processors. The I/O processors described earlier unburdened the host CPU from handling all the details of data transfers. Similarly, communication processors are used on NIs to manage data movement. Rather than using polling or interrupts on the host (or network processor), NIs use smaller, less powerful communication processors to poll the network for data. Significant work on programmable communication processors has been done in the context of message-passing MPPs. For example, the Stanford FLASH multiprocessor project developed a communication processor called MAGIC to manage the movement of all data between host CPU, memory and the network [Kuskin et al. 1994]. MAGIC managed all data movement on a processor node, thus, in addition to unburdening the host CPU, cache coherence and 13

communication mechanisms were implemented in software. In the following paragraphs, we consider: the effectiveness of communications processors, the feasibility of zero-overhead application message handling on communication processors, and design concerns specific to communication processors built around general-purpose microprocessors. Recently, [Scheiman and Schauser 1998] evaluated network performance in an MPP both with and without a communication processor using the Meiko CS-2 multiprocessor. Results indicate that implementing application, or user, level message handlers on a communication processor, despite being slower than the host CPU, improves latency. The authors report that the improvement is due to (1) the faster response time of the communication processor, and (2) to the task offloading that frees the main processor from polling or handling interrupts. Similar performance evaluations have been published concerning the CMU Nectar communication processor [Menzilcioglu and Schlick 1991, Steenkiste 1992]. Much work has been done on the support needed to perform zero-copy application and user level messaging on high-speed NIs [Chien et al. 1998, Mukherjee 1998, von Eicken and Vogels 1998]. In one case, [Schoinas and Hill 1998] show that it is possible to perform zero-copy application-level messaging in software on a communications processor. This minimal messaging attempts to move data directly from the NI into the host data structures in main memory. (Here, host refers to either the host CPU or the network processor, depending on which is receiving the data.) The key issue is providing efficient virtual/physical memory address translation [Chen et al. 1998] on the network interface. Finally, some recent work describes the support needed to implement low level operations [Cranor et al. 1999] involved with network-specific data transfer on microprocessor-based communication processors. Specifically, they use multiple thread contexts to limit the overhead involved with servicing DMA completion interrupts. This helps reduce overall message latency when using a general-purpose embedded processor. This technique improves communication processor performance by replacing costly polling with low-overhead interrupts. In cases where additional packet processing requirements are low, this approach can remove the need for separate communication and network processors. Network Processor. The distinction between a communication processor and a network processor is far from settled. A general description would be that communication processors handle low-level data-link protocol details (e.g., ethernet or Myrinet specifics) and message 14

handling, while network processors perform high-level, network and transport layer processing (e.g., IP and TCP/UDP processing). The tasks carried out by the communication processor are the tasks traditionally performed by NIs. The network processor implements functionality found in host device drivers and applications. This distinction is relatively new, but as the processors found on NIs increase in power, the need for a communication processor to unburden the network processor will grow for very reasons discussed above. Industry is producing numerous network processors, most of which employ chipmultiprocessor or fine-grained multithreaded processor architectures, to provide highperformance on the NI [Crowley et al. 2000]. For example, the recently announced Prism network processor from Sitera [Sitera 2000] is a 4 processor chip-multiprocessor with hardware support for packet classification and quality of service. Our work, performed here at UW, made the following contributions to network processor research: 1) identified a set of network processor workloads, 2) showed that chip-multiprocessors (CMP) [Nayfeh et al. 1996] and simultaneous multithreaded architectures (SMT) [Tullsen et al. 1995] can exploit packet-level parallelism, while aggressive superscalar and fine-grained multithreaded architectures cannot, 3) showed that packet classification can be performed economically in software on network processors, and 4) showed that SMT adapts better to the variability in real multiprogrammed workloads [Crowley et al. 2000, a paper describing parts 3 and 4 is currently under review]. A problem with this work is that it only reports throughput. The results give no intuition about how these processor designs impact latency, an oversight warned against by [Hennessy and Patterson 1996]. It is likely that latency considerations are slight, given the wide-area applications considered, but the subject should have been addressed. 4.2.2

Memory System

Memory serves as the bridge between the processor and the network. Network interfaces typically use high-speed SRAM to buffer packets. The bandwidth and latency characteristics of the memory system figure prominently in the amount of processing that can be performed on each packet at network speeds. Surprisingly, there has been little reported in the research literature on memory systems for high-performance network interfaces. The general question of how to architect the memory system is open, although there is significant discussion of this in the trade news. For example, the Prism network processor from 15

Sitera uses optional SRAM and is the first network processor to integrate a RAMBUS memory controller [Crisp 1997]. Recent proposals have appeared in the trade news for integrated DRAM memories on network processors. The idea of using IRAM to buffer packets is somewhat obvious, but unexplored. However, latency is a very important consideration, and standard approaches of integrating processors and memory may not help [Cuppu et al. 1999]. One related proposal uses a standard CPU cache memory to implement single-cycle IP route lookups [Chiueh and Pradhan 1999]. Following work by the same authors includes cache modifications to increase the effectiveness of this technique [Chiueh and Pradhan 2000]. This work is rooted in finding longest-matching prefixes with the dynamic prefix trie (as in retrieval) data structure [Doeringer et al. 1996]. The technique uses the cache as a hardware assist for performing fast matches between addresses and next hop values; in essence, IP addresses are treated as virtual memory addresses. 4.2.3

Communication

Network interfaces connect to a network media on one side, and the host interface on the other. With an eye toward large-scale routers, several companies are developing very fast back-end interconnects that permit high-bandwidth, low-latency transfers between network interfaces. The proposals from the common switch interface (CSIX) consortium [CSIX 2000] and IX archtitecture forum [LevelOne 1999] are seeking to standardize these interfaces. Not surprisingly, research on interconnects for message-passing multiprocessors has inspired these efforts. For example, Avici Systems, a start-up founded by Bill Dally from Stanford/MIT, basically uses the J-Machine, and in particular its interconnect, to do terabit routing [Avici 2000]. 4.3

Control Software

Control on the Myrinet interface is the responsibility of the Myrinet control program (MCP) that is loaded into device memory on boot-up. The MCP implements network specific control processing, such as network mapping, and handles DMA requests both within the interface and into host memory. The research community has embraced Myrinet due in large part to its open interfaces and open, modifiable MCP [Bhoedjang et al. 1998]. 16

Key challenges in the design of control software include low overhead user-level messages, because going through the OS for permissions checking is too slow, and utilizing a minimum number of copy operations [Eicken et al. 1995]. A good survey of efficient techniques for userlevel messages is provided by [Bhoedjang et al. 1998]. Communication in general is very latency sensitive – in many cases it matters just as much as throughput. SPINE describes a safe, extensible system for executing application-specific code on programmable NIs [Fiuczynski et al. 1998]. 4.4

Applications

We have considered the motivations for and designs of programmable NIs. In this section, we survey the applications and techniques that have been proposed for exploiting this programmability. 4.4.1

Fast LANs/System Area Networks

As mentioned previously, Myrinet employs host controllers that were built around microprocessors. Since the introduction of Myrinet, manufacturers have produced solutions for fast LANs employing the same technique. For example, Asanté’s GigaNIX gigabit ethernet NI is built around two 32-bit embedded RISC processors [Asanté 2000]. Programmability is generally required in these devices since the network-specific control (i.e., network mapping, flow control) is more easily and economically implemented in software on cost-effective embedded microprocessors. 4.4.2

Computing at Network Speeds

Emerging network applications and services require a fast path that does not involve the latency penalty associated with crossing the I/O bus to get to the host CPU. Examples of such services include: IPsec, routing, server load-balancing, and quality-of-service (QoS). The additional latency to get to the host CPU makes these services infeasible at network speeds. Hence, a network processor is included on the network interface to execute these applications. These applications are indicative of the general trend of pushing more computation and sophistication into the network, as discussed in Section 4.2.1. Other examples of this trend include web caching, network-address translation (NAT), firewalls, and virtual private networks 17

(VPNs). This trend has the potential to radically increase the computational resources required at each link in the network. The execution of many applications at network speeds requires a significant amount of processing power that, furthermore, scales with the number of network connections in a computer system. This trend has particular significance for services running at the backbone of large internetworks. A special class of machines, traditionally called routers, service many network links simultaneously. It is particularly necessary to execute network services on network interfaces in these devices so that processing power, and hence overall service performance, can scale with the number of links. A number of researchers have proposed the use of high-performance programmable network interfaces connected via a fast interconnect to implement large-scale routing systems [Peterson et al. 1999, Walton et al. 1998]. This proposal closely matches what is actually taking place in industry. 4.4.3

Active Networks

Active networks is a new approach in network design that provides a customizable infrastructure to support the rapid evolution of new transport and application services by enabling users to upload code into programmable network nodes. This is the most aggressive example of computing at network speeds: each packet can contain a unique program. The last few years have seen considerable coverage of active network research. Directions of inquiry have included designs for: software platforms and programming models [Wetherall et al. 1999] [Hicks et al. 1999], active network node architectures [Decasper et al. 1999] [Nygren et al. 1999], and operating systems for active nodes [Merugu et al. 2000] including emphases on QoS [Alexander et al. 2000] and security [Campbell et al. 2000]. Two recent articles comment on the results thus far [Smith et al. 1999] and lessons learned [Wetherall 1999]. This proposal poses big challenges in safety, performance, and management. 4.5

Summary

The preceding section surveyed the motivations, designs and applications for programmable network interfaces. More so than disks, innovative design and research proposals for programmable NIs are being investigated to meet the growing performance and functionality requirements in next-generations networks. 18

5

Other Examples

There are other examples of peripheral devices that are now programmable, including graphic display adapters and printers. Graphic display adapters for many years have implemented graphics pipelines and other display primitives at the device. Considerable work has done for graphics and media-specific programmable architectures [Basoglu et al. 1999, Rixner et al. 1998]. The relatively high bandwidth required for graphics on consumer PCs lead Intel to device the accelerated graphics port (AGP) [Intel 2000]. The AGP bus uses the same I/O “switch” as the processor and main memory in order to “fatten and shorten the pipe” between the processor on the graphics card and main memory. Intel had noticed that graphics cards were beginning to ship with significant amounts of memory, forcing the primarily graphics-based multimedia processors to manage memory in a fashion similar to the host CPU. This extension also helps Intel’s MMX extensions to speed graphics processing in ways that were infeasible across the standard peripheral I/O bus. Postscript printers have been programmable I/O devices from the beginning [Tennenhouse and Wetherall 1996]. A postscript document is, in fact, a program generated by an application that is sent to the printer and interpreted by the printer's control microprocessor. 6

Common Issues

In this section, we consider a set of issues common to both programmable disks and programmable NIs. 6.1

The Future of Programmability

Is programmability here to stay? There are at least two reasons why these devices may cease to be programmable: 1) ASICs become more cost-effective at implementing the required functionality, or 2) vastly faster host CPUs connected to passive peripherals via fast, switched I/O networks make relatively slow embedded processors a performance liability. The first issue presumes that these devices do not need the flexibility offered by software, and is, to a large extent, answered by the state of the industry today. The integration of a microprocessor core with device specific hardware and interfaces seems to be the preferred solution for the time being. However, if application specific hardware design were to unexpectedly become fast and cost-effective, this could change. A general framework for

19

intelligent I/O devices is gaining industry support, and will likely help keep these devices programmable [I2O Special Interest Group 1997]. The second issue is a more serious challenge. If the performance gap between desktop CPUs and embedded processors begins to widen, and desktop machines adopt a fast switched I/O interconnect between the CPU and peripherals [Mukherjee and Hill 1997], then clustering passive, low-end disks with powerful CPUs may be more cost-effective than a system of highend active peripherals. [Keeton et al. 1998] do not expect this to happen. They cite cost/power/price differences between desktop and embedded microprocessors that range between 5X and 20X that translate to SPECint 95 performance differences of only 1.5X to 2X. Desktop CPU markets can afford to pay heavily for marginal improvements in performance on SPECint 95. However, these improvements are not justified or beneficial in embedded systems. [Uysal et al. 2000] show that performance is comparable between clusters and active disk systems on the workloads that inspired active disks; the only difference is cost. These researchers report active disk system costs at less than half of the cluster system cost. Regardless, since people the world over are currently programming clusters to solve real problems, and active disks do not exist yet, this issue remains a serious challenge. In any case, the trend of system level integration, which leads to systems-on-a-chip (SOCs), is not likely to stop any time soon. Particularly given that communication, and wires, will continue to be the expensive resource going forward, as memory and compute resources become nearly infinite. This trend seems to lend itself to any solution that involves a high computation to communication ratio. 6.2

Research Directions

In this section, we propose a number of research issues that face programmable peripherals. As distributed systems go, the environmental conditions for systems of network peripherals are pleasant: a high-performance and reliable interconnect, reasonable processing and memory resources, and a single administrative domain. Given this environment, the items listed here can be considered properties necessary or desirable in systems comprised of programmable peripherals.

20

6.2.1

Comprehensive programming model

Programming model concerns include safe extensions, reasonable host/peripheral interaction, reasonable host/host interaction, and reasonable support for scalable software. A number of proposals have recommended programming models for individual device functions. One active disk proposal [Acharya et al. 1998] describes a programming model for partitioning certain database applications between hosts and disks, but the model does not address interactions with or support for other types of applications. Furthermore, the programmer manages all communication. This recalls the programming models for message-passing multiprocessors, which are hard to program since all applications require heavy customization. For network interfaces, pattern-based languages [Begel et al. 1999, Engler and Kaashoek 1996] and objectoriented systems [Morris et al. 1999] for packet-classification, filtering and routing have been proposed. These are heavily used, but no proposals have been made to integrate these functions into a comprehensive programming model for network interfaces. The software system needs to also provide protection from untrusted, malicious and faulty code. The SPINE operating system [Fiuczynski et al. 1998] advocates the use of safely extensible operating systems to govern the operation of these programmable peripherals. This notion is descended from the work developed in user-extensible operating systems such as SPIN [Bershad et al. 1995] and Exokernel [Engler et al. 1995]. 6.2.2

Platform independence

Traditional disks and network interfaces are integrated into server and workstation operating systems through device drivers. As application code migrates onto these peripherals, however, application code compatibility becomes an issue. Across-the-board device compatibility is necessary to keep the programmable peripheral market a commodity market, and therefore price competitive with passive peripherals. This can be achieved via object-based interfaces, such as CORBA and COM, or through the use of platform independent binary code executed on virtual machines such as the Java VM [Sirer et al. 1999]. Platform independence is tightly integrated with the overall design of the execution environment.

21

6.2.3

Support for unbalanced performance

Applications commonly exhibit "hot spots" in which certain portions of data require a relatively greater amount of work. This issue has not been raised with the full-scan database workloads initially considered for Active Disks, however it will be a concern for more general-purpose applications of programmable disks. Similarly, as components fail, it is likely that older devices will be replaced with newer ones with greater performance. This introduces an imbalance in the ideal partitioning of work onto devices. The general need is for execution environment load balancing support for applications and devices with varying performance characteristics. 6.2.4

Support for multiprogrammed workloads

The execution environment must also support multiple tasks simultaneously. The initial Active Disk proposals have avoided this completely, limiting their studies to single applications. SPINE has started to address this for NIs. In addition to balancing between applications, in devices such as disks and network interfaces there are elements of real-time constraints that must be managed. A disk controller, for example, must be able to schedule disk arm movements along with client requests and buffering tasks, simultaneously. It is unclear that the resource sharing techniques implemented in standard operating systems will work well under these conditions. With network interfaces, certain applications, such as guaranteeing a particular quality of service, may require a tighter integration between the mechanisms allocating processor resources and the mechanisms allocating network resources. 6.2.5

Support for centralized control

As noted in [Sirer et al. 1999], centralized control makes some difficult problems much easier. These problems include managing software versions, security, auditing, and performance analysis. Centralized control that does not limit performance is a nice property in distributed systems; an execution environment for programmable devices will provide it. [Keeton et al. 1998] contend that IDISKS will do away with much of the administrative costs associated with clusters by assuming that IDISKS will incur the support and maintenance costs of disks rather than cluster nodes. This may be the case since, for example, diagnostic checks may be simpler in integrated devices as compared to desktop-like systems where many components may fail and need to be tested separately. In general, taking steps to ease the 22

administrative problem in distributed systems is a valuable line of research with enormous potential impact. 7

Conclusion

This report has surveyed the motivations, designs and applications for programmable disks and network interfaces. These programmable peripherals have all the basic components found in computer systems: a microprocessor, memory, and a communications subsystem. Given today’s technology trends, embedded processors, and, hence, programmability, are likely to have a permanent place in these devices. In order to improve I/O performance, a number of proposals have been made for migrating data- and application-specific functions from servers onto these devices. Initial implementations have provided functionality and support for specific tasks, such as a decision support database environment for disks and language and programming systems for packet classification on network interfaces. Based on these finding, a remaining challenge for this general approach is to provide a software model that incorporates a comprehensive programming model and the right set of libraries and OS services, including security and resource management, needed in these peripheral environments. This report concludes with some specific areas of future research. These directions focus on programmable network interfaces, and, in particular, network processor design, which is the author’s current field of research. 1. Memory system design for network processors. This study compares the performance of modern and proposed memory technologies and cache hierarchies for a selection of network interface organizations. 2. Analytical performance model for network processors. This model incorporates the parallelism found in network workloads, the parallelism exploited by modern network processor architectures, and includes memory system parameters. The purpose is to guide network interface provisioning and give intuition concerning the relative importance of processor and memory improvements. 3. Thread scheduling on SMT for network processor workloads. Initial results have suggested that more sophisticated thread scheduling policies on SMT may be beneficial for network processor workloads. This study examines ideal resource allocation for these workloads and investigates scheduling policies on SMT that approximate the ideal.

23

References [3Com 2000] C. 3Com. The EtherLink® 10/100 PCI NIC with 3XP Processor. , http://www.3com.com/technology/tech_net/tech_briefs/500907.html, 2000. [Acharya et al. 1998] A. Acharya, M. Uysal, and J. Saltz. Active Disks: programming model, algorithms and evaluation. Proceedings of the Eigth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 81-91. San Jose, November 1998. [Adams and Ou 1997] L. Adams and M. Ou. Processor Integration in a Disk Controller. IEEE Micro vol. 14, no. 4, July 1997. [Alexander et al. 2000] D.S. Alexander, W.A. Arbaugh, A.D. Keromytis, S. Muir, and J.M. Smith. Secure Quality of Service Handling: SQoSH. IEEE Communications vol. 38, no. 4, pp. 106-112, April 2000. [Amdahl et al. 1964] G.M. Amdahl, G.A. Blaauw, and J. F. P. Brooks. Architecture of the IBM System/360. IBM Journal of Research and Development , April 1964. [Anderson et al. 1996] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang. Serverless Network File Systems. ACM Trans. on Computer Systems vol. 14, no. 1, pp. 41-79, Feb. 1996. [Arpaci-Dusseau et al. 1998] R.H. Arpaci-Dusseau, A.C. Arpaci-Dusseau, D.E. Culler, J.M. Hellerstein, and D.A. Patter. The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs. Proceedings of the HPCA. Las Vegas, 1998. [Asanté 2000] T. Asanté. GigaNIX Gigabit Ethernet Adapter. , http://www.asante.com/new/2000/GigaNIX.html, 2000. [Avici 2000] S. Avici. The Avici Terabit Switch Router. , http://www.avici.com, 2000. [Basoglu et al. 1999] C. Basoglu, R. Gove, K. Kojima, and J. O'Donnell. Single-Chip Processor for Media Applications: The MAP1000TM. International Journal of Imaging Systems and Technology, 1999. [Begel et al. 1999] A. Begel, S. McCanne, and S.L. Graham. BPF+: Exploiting Global Data-Flow Optimization in a Generalized Packet Filter Architecture. Proceedings of the ACM Communication Architectures, Protocols, and Applications (SIGCOMM ’99), 1999. [Bershad et al. 1995] B.N. Bershad, S. Savage, P. Pardyak, E.G. Sirer, M.E. Fiuczynski, D. Becker, S. Eggers, and C. Chambers. Extensibility, Safety and Performance in the SPIN Operating System. Proceedings of the Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP), December 1995. [Bhoedjang et al. 1998] R.A.F. Bhoedjang, T. Ruhl, and H.E. Bal. User-level network interface protocols. Computer vol. 31, no. 11, pp. 53-60, Nov. 1998. [Boden and Cohen 1995] N. Boden and D. Cohen. Myrinet -- A Gigabit-per-Second Local-Area Network. IEEE Micro, 15(1):29-36, 1995. [Campbell et al. 2000] R.H. Campbell, L. Zhaoyu, M.D. Mickunas, P. Naldurg, and Y. Seung. Seraphim: dynamic interoperable security architecture for active networks. Proceedings of the IEEE 3rd Conf. on Open Arch. and Network Programming, pp. 55-64, 2000. [Chen et al. 1998] Y. Chen, C. Dubnicki, S. Damianakis, A. Bilas, and K. Li. UTLB: A Mechanism for Address Translation on Network Interfaces. Proceedings of the Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1998. [Chien et al. 1998] A.A. Chien, M.D. Hill, and S.S. Mukherjee. Design Challenges for High-Performance Network Interfaces. IEEE Micro:42-44, 1998. [Chiueh and Pradhan 1999] T.-c. Chiueh and P. Pradhan. High-Performance IP Routing Table Lookup Using CPU Caching. Proceedings of the INFOCOMM ’99, pp. 1421-1428, 1999. [Chiueh and Pradhan 2000] T.-C. Chiueh and P. Pradhan. Cache Memory Design for Network Processors. Proceedings of the 6th Int’l Symp. on High-Performance Computer Architecture, January 2000. [Cirrus Logic 2000] I. Cirrus Logic. New Open-Processor Platform Enables Cost-Effective System-on-a-chip Solutions for Hard Disk Drives. , http://www.cirrus.com/3ci, 2000. [Cohen et al. 1992] D. Cohen, G. Finn, R. Felderman, and A. DeSchon. The Atomic Lan. Proceedings of the IEEE Workshop on the Arch. and Impl. of High Performance Communication Subsystems, 1992. [Cormier et al. 1983] R.L. Cormier, R.J. Dugan, and R.R. Guyette. System/370 Extended Architecture: The Channel Subsystem. IBM J. Res. develop, 27(3):206-218, 1983. [Cranor et al. 1999] C.D. Cranor, R. Gopalakrishnan, and P.Z. Onufryk. Architectural Considerations for CPU and Network Interface Integration. Proceedings of the Hot Interconnects. Stanford, CA, 1999.

24

[Crisp 1997] R. Crisp. Direct RAMbus Technology: the New Main Memory Standard. IEEE Micro vol. 17, no. 6, pp. 18-28, March 1997. [Crowley et al. 2000] P. Crowley, M.E. Fiuczynski, J.-L. Baer, and B.N. Bershad. Characterizing Processor Architectures for Programmable Network Interfaces. Proceedings of the International Conference on Supercomputing, pp. 54-65. Santa Fe, N.M., May 8-11 2000. [CSIX 2000] CSIX. CSIX: The Common Switch Interface Consortium. , http://www.csix.org/, 2000. [Cuppu et al. 1999] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. Proceedings of the 26th Int’l Symp. on Computer Architecture, pp. 222-233, 1999. [Dally et al. 1992] W.J. Dally, J.A.S. Fiske, J.S. Keen, R.A. Lethin, M.D. Noakes, P.R. Nuth, R.E. Davison, and G.A. Fyler. The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro:23-39, 1992. [Decasper et al. 1999] D.S. Decasper, B. Plattner, G.M. Parulkar, C. Sumi, J.D. DeHart, and T. Wolf. A Scalable High-Performance Active Network Node. IEEE Network vol. 13, no. 1, pp. 8-19, Jan.-Feb. 1999. [Doeringer et al. 1996] W. Doeringer, G. Karjoth, and M. Nassehi. Routing on Longest-Matching Prefixes. IEEE/ACM Trans. on Networking vol. 4, no. 1, pp. 86-97, Feb. 1996. [Eicken et al. 1995] T.v. Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. Proceedings of the 15th ACM Symp. on Operating Systems Principles, pp. 40-53, 1995. [Engler and Kaashoek 1996] D.R. Engler and M.F. Kaashoek. DPF: Fast, Flexible Message Demultiplexing using Dynamic Code Generation. Proceedings of the ACM Communication Architectures, Protocols, and Applications (SIGCOMM ’96), 1996. [Engler et al. 1995] D.R. Engler, M.F. Kaashoek, and J. O’Toole. Exokernel: an operating system architecture for application-level resource management. Proceedings of the 15th ACM Symp. on Operating Systems Principles, pp. 251-266, 1995. [FCIA 2000] FCIA. Fibre Channel Technology Overview. . The Fibre Channel Industry Association, http://www.fibrechannel.org/, 2000. [Fiuczynski et al. 1998] M.E. Fiuczynski, R.P. Martin, T. Owa, and B.N. Bershad. SPINE: An Operating System for Intelligent Network Adapters. Proceedings of the Eighth ACM SIGOPS European Workshop, pp. 7-12. Sintra, Portugal, September 1998. [Gibson et al. 1997] G. Gibson, D. Nagle, K. Amiri, F.W. Chang, E. Feinberg, H. Gobioff, C. Lee, B. Ozceri, E. Riedel, D. Rochberg, and J. Zelenka. File Server Scaling with Network-Attached Secure Disks. Proceedings of the SIGMETRICS, June 1997. [Gray 1998] J. Gray. Put Everything in the Disk Controller. , ’98 NASD workshop, http://research.microsoft.com/~gray/talks/Gray_NASD_Talk.ppt, 1998. [Hennessy and Patterson 1996] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach, Second Edition. Kaufman Publishers, 1996. [Hicks et al. 1999] M. Hicks, J.T. Moore, D.S. Alexander, C.A. Gunter, and S.M. Nettles. PLANet: An Active Internetwork. Proceedings of the INFOCOM, pp. 1124-1133, 1999. [Hill et al. 2000] M.D. Hill, N.P. Jouppi, and G.S. Sohi. Readings in Computer Architecture. , First ed. Morgan Kaufmann, 2000. [I2O Special Interest Group 1997] I2O Special Interest Group. Intelligent I/O (I2O) Architecture Specification v1.5. , Available from www.i2osig.org, March 1997. [Intel 2000] C. Intel. Accelerated Graphics Port Technology. , http://www.intel.com/technology/agp/index.htm, 2000. [Keeton et al. 1997] K. Keeton, R. Apraci-Dusseau, and D.A. Patterson. IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck. Proceedings of the Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997. [Keeton et al. 1998] K. Keeton, D.A. Patterson, and J.M. Hellerstein. A Case for Intelligent Disks (IDISKS). SIGMOD Record vol. 27, no. 3, Nov. 1998. [Kuskin et al. 1994] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH Multiprocessor. Proceedings of the 21st Int. Symp. on Computer Architecture, pp. 302-313, April 1994. [LevelOne 1999] LevelOne. IX Architecture Whitepaper. An Intel Company, 1999 [Menzilcioglu and Schlick 1991] O. Menzilcioglu and S. Schlick. Nectar CAB: a high-speed network processor. Proceedings of the 11th Int’l Conf. on Distributed Computing Systems, pp. 508-515, 1991.

25

[Merugu et al. 2000] S. Merugu, S. Bhattacharjee, E. Zegura, and K. Calvert. Bowman: A Node OS for Active Networks. Proceedings of the INFOCOM 2000, pp. 1127-1136, 2000. [Moore 1965] G.E. Moore. Cramming more components onto integrated circuits. Electronics , pp. 114-117, April 1965. [Morris et al. 1999] R. Morris, E. Kohler, J. Jannotti, and M.F. Kaashoek. The Click Modular Router. Proceedings of the 17th ACM Symp. on Operating Systems Principles, Dec. 1999. [Mukherjee 1998] S.S. Mukherjee. Design and Evaluation of Network Interfaces for System Area Networks. Computer Science, pp. 189. University of Wisconsin, Madison, 1998. [Mukherjee and Hill 1997]S.S. Mukherjee and M.D. Hill. A Case for Making Network Interfaces Less Peripheral. Proceedings of the Hot Interconnects V. Stanford, August 1997. [Nayfeh et al. 1996] B.A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. Proceedings of the 23rd International Symposium on Computer Architecture, pp. 67-77, May 1996. [Nygren et al. 1999] E.L. Nygren, S.J. Garland, and M.F. Kaashoek. PAN: A High-Performance Active Network Node Supporting Multiple Mobile Code Systems. Proceedings of the 2nd Conf. on Open Architectures and Network Programming, pp. 78-89, 1999. [Patterson et al. 1997] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent RAM: IRAM. IEEE Micro:34-44, 1997. [Patterson and Keeton 1998] D. Patterson and K. Keeton. Hardware Technology Trends and Database Opportunities. , Slides from SIGMOD’98 Keynote Address, 1998. [Patterson et al. 1988] D.A. Patterson, G. Gibson, and R.H. Katz. A case for redundant arrays of inexpensive disks (RAID). Proceedings of the ACM SIGMOD Conference. Chicago, IL, June 1988. [Peterson et al. 1999] L. Peterson, S. Karlin, and K. Li. OS Support for General-Purpose Routers. Proceedings of the HotOS Workshop, March 1999. [Riedel 1999] E. Riedel. Active Disks - Remote Execution for Network-Attached Storage. CMU, Doctoral Dissertation, Tech. Report CMU-CS-99-177, Nov. 1999 Pittsburgh, PA. [Riedel et al. 1998] E. Riedel, G. Gibson, and C. Faloutsos. Active Storage for Large-Scale Data Mining and Multimedia. Proceedings of the VLDB, Aug. 1998. [Rixner et al. 1998] S. Rixner, W.J. Dally, U.J. Kapasi, B. Khailany, A. López-Lagunas, P.R. Mattson, and J.D. Owens. A Bandwidth-Efficient Architecture for Media Processing. Proceedings of the 31st Int’l Symp. on Microarchitecture, pp. 3-13, Nov. 1998. [Ruemmler and Wilkes 1994] C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEE Computer vol. 27, no. 3, pp. 17-28, 1994. [Scheiman and Schauser 1998] C.J. Scheiman and K.E. Schauser. Evaluating the Benefits of Communication Coprocessors. Journal of Parallel and Distributed Computing, 57(2):236-256, 1998. [Schoinas and Hill 1998] I. Schoinas and M.D. Hill. Address Translation Mechanisms in Network Interfaces. Proceedings of the 4th Int’l Symp. on High Performance Computer Architecture, 1998. [Seitz 1985] C.L. Seitz. The cosmic cube. Communications of the ACM vol. 28, no. 1, pp. 22-33, 1985. [Sirer et al. 1999] E.G. Sirer, R. Grimm, A.J. Gregory, and B.N. Bershad. Design and implementation of a distributed virtual machine for networked computers. Proceedings of the 17th ACM Symp. on Operating Systems Principles, pp. 202-216, Dec. 1999. [Sitera 2000] C. Sitera. The PRISM IQ2000 Network Processor Family. , http://www.sitera.com, 2000. [Smith et al. 1999] J.M. Smith, K.L. Calvert, S.L. Murphy, H.K. Orman, and L.L. Peterson. Activating Networks: A Progress Report. IEEE Computer Magazine vol. 32, no. 4, pp. 32-41, April 1999. [Smotherman 1989] M. Smotherman. A Sequencing-Based Taxonomy of I/O Systems and Review of Historical Machines. Computer Architecture News, 17(5):5-15, 1989. [Steenkiste 1992] P. Steenkiste. Analysis of the Nectar Communication Processor. Proceedings of the IEEE Workshop on the Arch. and Impl. of High Perf. Comm. Subsystems, pp. 1-3, 1992. [Tennenhouse and Wetherall 1996] D.L. Tennenhouse and D.H. Wetherall. Towards an Active Network Architecture. ACM Computer Communications Review, 26(2):5-18, 1996. [Tullsen et al. 1995] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 392403. Santa Margherita Ligure, Italy, June 1995. [Uysal et al. 2000] M. Uysal, A. Acharya, and J. Saltz. Evaluation of Active Disks for Decision Support Databases. Proceedings of the 6th Int’l Symp. on High-Performance Computer Architecture, pp. 337-348, 2000.

26

[von Eicken and Vogels 1998] T. von Eicken and W. Vogels. Evolution of the Virtual Interface Architecture. IEEE Micro , pp. 61-68, November 1998. [Walton et al. 1998] S. Walton, A. Hutton, and J. Touch. Efficient High-Speed Data Paths for IP Forwarding using Host Based Routers. Proceedings of the Proceedings of the Ninth IEEE Workshop on Local and Metropolitan Area Networks, May 1998. [Wang et al. 1999] R.Y. Wang, T.E. Anderson, and D.A. Patterson. Virtual Log Based File Systems for a Programmable Disk. Proceedings of the Third USENIX Operating System Design and Implemenation Conference. New Orleans, LA, February 1999. USENIX. [Wetherall 1999] D. Wetherall. Active network vision and reality: lessons from a capsule-based system. Proceedings of the 17th ACM Symp. on Operating Systems Principles, pp. 64-79, Dec. 1999. [Wetherall et al. 1999] D. Wetherall, J. Guttag, and D. Tennenhouse. ANTS: Network Services Without the Red Tape. IEEE Computer Magazine vol. 32, no. 4, April 1999. [Yang and Lebeck 2000] C.-L. Yang and A.R. Lebeck. Push vs. Pull: Data Movement for Linked Data Structures. Proceedings of the International Conference on Supercomputing, pp. 176-186. Santa Fe, N.M., May 8-11 2000.

27