A TeraFLOP Supercomputer in 1996: the ASCI TFLOP System Timothy G. Mattson, David Scott, Stephen Wheat Intel Corporation, Enterprise Server Group Beaverton, OR, 97006 {tgm, dscott, srwheat}@ssd.intel.com

Abstract To maintain the integrity of the US nuclear stockpile without detonating nuclear weapons, the DOE needs the results of computer-simulations that overwhelm the world’s most powerful supercomputers. Responding to this need, the DOE initiated the Accelerated Strategic Computing Initiative (ASCI). This program accelerates the development of new scalable supercomputers resulting in a TeraFLOP computer before the end of 1996. In September of 1995, DOE announced that it would work with Intel Corporation to build the ASCI TFLOP supercomputer. This system would use Commodity Commercial Off The Shelf (CCOTS) components to keep the price under control and will contain over 9000 Intel Pentium Pro processors. In this paper, we describe the hardware and software design of this supercomputer.

1

Introduction

Nuclear deterrence is a cornerstone of US national security. A nuclear stockpile can only serve its deterrence role if its integrity is without question. Traditionally, this was done with nuclear testing. For a variety of political, economic, and technical reasons, however, nuclear testing has been eliminated forcing the US Department of Energy (DOE) to find other ways to ensure the integrity of its nuclear stockpile. The DOE believes it can maintain the stockpile’s integrity through a combination of laboratory experiments and massive computer simulations. These simulations are so complex that today’s largest supercomputers can not run them. While the supercomputer industry will eventually develop systems with the required performance, the DOE can’t afford to wait. To accelerate the availability of sufficiently powerful supercomputers, the DOE initiated an aggressive technology development program called the

Accelerated Strategic Computing Initiative (ASCI). The goal of the ASCI program is to deploy a 1 TFLOP (Trillion Floating Point Operations per second) supercomputer by the end of 1996, a 10 TFLOP system by 1999, and a 100 TFLOP system by 2002. Ideally, these systems should be available at similar costs. The first phase of the program is the 1 TFLOP system. This system must satisfy the following specific performance goals: • Deliver a sustained TeraFLOP on MPLINPACK before the end of 1996. • Run a yet to be defined ASCI application using all the memory and all the nodes by June of 1997 One of the vendors capable of building a TFLOP computer is the Enterprise Server Group at Intel Corporation. Intel has a long track record in building large, scalable supercomputers; the most recent example being the Paragon supercomputer. Paragon supercomputers use a 2D mesh interconnection facility (ICF) that can move messages at a peak uni-directional bandwidth of 200 Mbytes per second. Each Paragon node contain two (the GP node) or three (the MP node)  Intel i860 processors. The system consists of three scalable partitions: • Service Partition: Interactive login sessions, application development, and network connections. • I/O Partition: Disk I/O and parallel file systems. • Compute Partition: where the actual computing takes place. The OS presents a single system image to the user with a full OSF/1 distributed Unix (the Paragon OS) on each node. The Paragon architecture is scalable to large numbers of nodes. Last December, for example, Intel worked with scientists from Sandia and Oak Ridge national laboratories to set the MP-LINPACK world record with 281 GigaFLOPS [1] on a Paragon XP/S supercomputer containing over 2,000 nodes.

Using the Paragon architecture as a starting point in its TFLOP supercomputer design, Intel submitted the winning proposal to build the 1 TFLOP ASCI supercomputer. The system design used commodity commercial off the shelf (CCOTS) components throughout the system. These components kept the overall system costs low. More importantly, the use of CCOTS components puts the ASCI program on the price/performance curve for the commodity

Compute Cabinets (74) • 64 nodes / cabinet • 2 x 8 x 2 ICF • Populated with 4,536 Compute Nodes (200 MHz )

Black Access & Data

Red Access & Data

Sandia’s Paragon XP/s 140 system

management (system nodes), will run a distributed UNIX that provides a single system image. The ASCI TFLOP system will support two modes of operation: classified mode and open mode. In classified mode, the system must be completely isolated with data that is never available to outside computer networks. To support easy transition between these modes, two physically independent disk subsystems are provided. Each disk subsystem provides 1 TByte of

Service & Spares Cabinet (1) • 64 nodes / cabinet • 2 x 8 x 2 ICF • Populated with 32 Service Nodes and 16 on-line ‘hot’ spare nodes

ICF Switch (“color change”)

RAID Cabinets (8) • 256 GBytes / cabinet • 4 Black cabinets • 4 Red cabinets

I/O Subsystem(s) Each ‘color’ with: • 1 Boot / Ethernet node • 1 Node Station (RAS) node & ‘hot’ spare • 3 double ATM nodes (5 OC12 channels) • 16 - 20 RAID nodes ( 2 FW-SCSI channels each) • 8 I/O Node locations for spares / expansion • 1 Ethernet & FDDI node

Figure 1: ASCI TFLOP System Floorplan. The size of Sandia’s current Paragon XP/S supercomputer is shown by the shaded region. microprocessor market. In figure 1, we show a schematic representation of Intel’s ASCI TFLOP supercomputer. The system occupies approximately 1,600 sq. ft. of floor-space – including the access area required on all sides of the system. The system’s 9,216 Pentium Pro processors with 297 Gbytes of RAM are organized into a 38 x 32 x 2 mesh. The system has a peak 1.8 TFLOP computation rate and a cross-section bandwidth (measured across the two 32 x 38 planes) of over 51 GB/sec. Different parts of the ASCI TFLOP system will run different system software. The nodes involved with computation (compute nodes) will run an efficient, small operating system called the Light Weight Kernel (LWK). The nodes that support interactive user services (service nodes) and overall system

storage capacity and is accessed by 16 I/O Nodes to deliver a combined 1 GB/sec in both “on-line” and “stand-alone” modes. These disk subsystems can run independently from the rest of the ASCI TFLOP supercomputer so the data is available even when the subsystem is not connected to the compute cabinets. The two disk subsystems are connected/disconnected from the rest of the system through a simple mechanical interconnect (the ICF switch) represented by the gray ellipse in Figure 1. When scaling to so many nodes, even low probability points of failure can become a major problem. To build a robust system with so many nodes, the hardware and software must provide RAS: Reliability, Availability and Serviceability. All major system components are hot-swappable and repairable while the system remains under power, and in most

simpler, RISC instructions called micro-operations (or uops). The uops execute in a RISC core inside the Pentium Pro processor with the order of execution dictated by the availability of data. This lets the CPU continue with productive work when other uops are waiting for data or functional units. This out of order execution is combined with sophisticated branch prediction and register renaming to provide what Intel calls, “dynamic execution”. The Pentium Pro processor’s RISC core can execute a burst rate of up to five uops per cycle. These correspond to the five functional units: • Store Data Unit • Store Address Unit • Load Address Unit • Integer ALU • Floating Point/Integer Unit Up to 3 uops can be retired per cycle of which only one can be a floating point operation. Hence, the peak flop rate for the Pentium Pro processor is 200 MFLOP at 200 Mhz. The floating point unit, however, can only do a multiply every second clock for a peak multiply rate of 100 MFLOP at 200 Mhz. Fortunately, multiply and add operations can be interleaved, so it is possible to approach the peak one flop-per-clock rate with real applications. The Pentium Pro processor includes separate onchip data and instruction L1 caches (each of which is 8 KBytes) and an L2 cache (256 KBytes) packaged with the CPU in a single dual-cavity PGA package. The L1

cases, while the system continues to run. Of the 4,536 Compute Nodes and the 16 on-line hot spares, for example, all can be replaced without having to cycle the power of any other module. Similarly, system operation can continue if any of 308 patch service boards (to support RAS functionality), or 640 disks, or 1540 power supplies, or 616 ICF backplanes should fail – each of which reduce system capacity by less than 1% For the remainder of this paper, we present an overview of the ASCI TFLOP system design. We begin  with the microprocessor used: the Intel Pentium Pro processor. We then describe the ASCI TFLOP ICF and the various node boards that will be used in the system. Next, we briefly outline the system software. We close the paper with a discussion of performance benchmarks on the Pentium Pro processor and expected performance of MP-LINPACK on the ASCI TFLOP system.

2.

The Pentium Pro Processor

The CPU used on all the nodes of the ASCI TFLOP supercomputer will be the Intel Pentium Pro processor. Detailed descriptions of the CPU can be found in the literature [2], so we will only emphasize the highlights in this discussion. The Pentium Pro processor is both a CISC and a RISC chip. The instruction set is essentially the same as  the original Pentium processor. At runtime, however, these complex instructions are broken down into Plane B Plane A

I/O Nodes

Service I/O Nodes Nodes

Compute Nodes …

PCI Node

Compute Node

Compute Node



PCI Node

Compute Node

Compute Node









Compute Node

Service Node

PCI Node

Compute Node

Service Node

Ethernet Node





ATM HiPPI other connections

Ethernet

System PCI Node

Node Station

Tapes

Operator(s) PCI Node

Compute Node

Compute Node

… …

Compute Node

Service Node

Boot Node

Boot RAID Eagles

Kestrels

Kestrels

Eagles

Figure 2: ASCI TFLOP System Block Diagram. This system uses a split-plane mesh topology and has 4 partitions: System, Service, I/O and Compute.

Data Cache is dual-ported and non-blocking, supporting one load and one store per cycle (peak bandwidth of 3.2 GB/sec on a 200 MHz). The L2 cache interface runs at the full CPU clock speed, and can transfer 64 bits per cycle (1.6 GB/sec on a 200 MHz Pentium Pro). The external bus is also 64 bits wide and supports a data transfer every bus-cycle. The Pentium Pro processor bus offers full support for memory and cache coherency for up to 4 Pentium Pro processors. It has 36 bits of address and 64 bits of data. Bus efficiency is enhanced through the following features: • A Bus pipeline with a depth of 8. • The capability to defer long transactions. • Bus arbitration on a cycle by cycle basis. The bus can support up to 8 pending transactions while the Pentium Pro processor and the memory controller can have up to 4 pending transactions. Memory controllers can be paired-up to match the bus’s support for 8 pending transactions. The bus can sustain data on every clock cycle, so at 66 MHz the peak data rate is 533 MB/sec. Unlike most CCOTS processor buses, which can only detect data errors by using parity coverage, the Pentium Pro data bus is protected by ECC. Control and address signals are protected by parity.

Multiple-Instruction / Multiple Data (MIMD) system. All aspects of this system architecture are scalable, including communication bandwidth, main memory, internal disk storage capacity and I/O. The ASCI TFLOP system derives its partition model from the Intel Paragon supercomputer. Four partitions are defined in Figure 2: Compute, Service, System, and I/O. The Service Partition provides an integrated, scalable “host” that supports interactive users, application development, and system administration. The I/O Partition’s nodes implement scalable file and network services. System Nodes – not present on the Paragon supercomputer – support system RAS capabilities. Finally, the Compute Partition contains nodes optimized for floating point performance and is where parallel applications execute. In the following subsections, we will describe the major subsystems of the ASCI TFLOP supercomputer: • Interconnection Facility (ICF). • Compute Nodes • I/O Nodes. • Service Nodes • System Nodes • Software

3

The ASCI TFLOP system uses Intel’s nextgeneration ICF which doubles the Paragon ICF data transfer rate to 800 MB/sec (400 MB/sec in each

The ASCI TFLOP System The ASCI TFLOP system is a distributed memory

3.1

Interconnection Facility

NIC

Node Board Node Board Peak (sustainable) Bi-Directional Bandwidth

400 MB/sec (360)

800 MB/sec (700)

A

MRCs

800 MB/sec (800)

400 MB/sec (400)

Peak (sustainable) UniDirectional Bandwidth

B Node Board

Z Y X

Node Board

Figure 3: ASCI TFLOP 2 Plane Interconnection Facility (ICF). Bi-directional bandwidths are given on the left side of the figure while uni-directional bandwidths are given on the right side. In both cases, sustainable (as opposed to peak) numbers are given in parentheses.

direction). For improved system maintainability, this ICF uses the split-plane topology (shown in Figure 3) with the mesh implemented in two parallel planes. The ICF implementation uses two custom VLSI components: The Mesh Routing Component (MRC) and the Network Interface Component (NIC). The MRCs are responsible for routing within the ICF and are located within routing planes positioned behind the card cages that hold the node boards. Each node has a high-performance, bi-directional NIC on its own local bus and may route to any other node via either of the ICF planes. High, sustainable throughput rates are achieved in the face of heavy traffic due to four virtual ICF channels. These lanes give each guaranteed ICF-link access even when there is network contention on a different lane.

3.2

The Compute Node

The Pentium Pro can support up to four processors in a shared memory configuration. For the ASCI TFLOP Compute Node, however, we put only two Pentium Pro processors on each bus. This was done to increase the memory bandwidth available to parallel applications. The compute node memory subsystem is implemented using Intel’s CCOTS Pentium Pro processor support chipset. It is structured as two rows of four, independently-controlled, sequentiallyinterleaved, banks of DRAM to produce up to 533 MB/sec of data throughput. Each bank of memory is 72 bits wide, allowing for 64 data bits plus 8 bits ECC,

which provides single bit error correction and multiple bit error detection. The banks are implemented as two 36-bit SIMMs, so that industry standard SIMM modules may be used to populate the memory. Using commonly available 4 and 8 MByte SIMMs (based on 1Mx4 DRAM chips) and 16 and 32 MByte SIMMs (based on 4Mx4 DRAM chips), 32 MB to 256 MB of memory per node is supported. Each Compute Node is paired with another Compute Node onto a single “Double Compute Node” called the Kestrel board. Local I/O support includes a serial port, called the “Node Maintenance Port”. This port interfaces to the system’s internal Ethernet through a terminal concentrator to facilitate RAS as well as system bootstrap and diagnostics. Two NICs on each Kestrel board are connected to each other through the NIC’s integrated cross-bar router (a subset of the MRC). One of the NICs makes the board’s single connection to the backplane. Each node on the Kestrel board also incorporates its own boot support (FLASH ROM and simple I/O devices) through a PCI bridge on its local bus. A connector is provided to allow testing of each node through this PCI bridge. The FLASH ROM contains the Node Confidence Tests, BIOS, and other support needed to diagnose board failures and to load a variety of operating systems.

3.3

The Service Node

Service Nodes are used to support login, software development and other interactive operations on the ASCI TFLOP system. These nodes use the Kestrel

Boot Support

L2 Cache

P6

L2 Cache

Expansion Connector

P6 I/O Bridge Memory Control

NIC Two 64-bit Local buses

ICF link

SIMMs Memory Control

NIC E I/O Bridge

L2 Cache

P6

L2 Cache

Expansion Connector

P6 Boot Support

PCI bus

Figure 4: The ASCI TFLOP Kestrel Board. This board includes two compute nodes daisy-chained together through their NICs. Only one of the NICs connects to the MRC.

board developed for the Compute Node. Each of the four Pentium Pro processors provides the equivalent performance of a high-end workstation and can easily support 1 - 5 interactive users.

3.4

data rates associated with common I/O devices fit well within the throughput supported by the PCI bus. UltraSCSI, for example, provides a hardware rate of 40 MB/sec. ATM-OC12 provides a peak interface bandwidth of approximately 77 MB/sec. Both are supported at these peak rates by CCOTS PCI adapters.

The I/O Node

The I/O Nodes (Eagle nodes) are shown in figure 5 and include two Pentium Pro processors configured with 64 MB to 1 GB of local memory. These two processors support two on-board PCI interfaces that each provide 133 MB/sec I/O bandwidth. One of the two buses can support two PCI cards through the use of a 2-level riser card. Thus, within a single slot, an I/O Node can be configured with up to 3 long-form PCI adapter cards. CCOTS PCI adapter boards can be inserted into these interfaces to provide Ultra-SCSI, HiPPI, ATM, FDDI, and numerous other custom and industry-standard I/O capabilities. In addition to add-in card capabilities, there are base I/O features built into every board which are accessible through the front panel. These include RS232, 10 Mbit Ethernet, and differential FW-SCSI. As with the Kestrel Nodes, each Eagle node incorporates a NIC to provide a high-performance, bidirectional connection to the ICF. Each I/O Node provides ample processing capacity and throughput to support a wide variety of highperformance I/O devices. The throughput of each PCI bus is dictated by the type of interface supported by the PCI adapter used, the driver software run on the I/O Node, and the higher-level protocols used by the application and the “other end” of the interface. The

3.5

The System Node

Booting and RAS services are provided by Eagle boards. The boot node is responsible for initial system booting and has its own independent RAID and Ethernet connections. In addition, it provides the Service and I/O Nodes additional single-system image support. Another Eagle node configured as part of the I/O Partition is cross-connected with the boot RAID to provide an on-line spare to the Boot Node. The node station uses an Eagle node to support the system’s RAS services. An extra Eagle node is configured with the system as an on-line spare for the node station.

3.6

The ASCI TFLOP Software

Our experience with the Paragon system has taught us a great deal about building operating systems and message passing software that scale to thousands of nodes. To capitalize on this experience, the software for the ASCI TFLOP system is an evolution of the scalable environment for the current Paragon system.

Boot Support

L2 Cache

P6

PCI bus Expansion Connector

I/O Bridge

Memory Control

NIC ICF link

64-bit Local bus Memory Control PCI I/O

L2 Cache

P6

I/O Bridge

Boot Support

Figure 5: The ASCI TFLOP I/O Node (Eagle Board).

Expansion Connector

PCI bus

SIMMs

The ASCI TFLOP software is complex. Describing the software in detail would require more space than is available in this paper. Therefore, we will limit this discussion to an overview of the Operating Systems and message passing software.

3.6.1 The ASCI TFLOP Operating Systems The requirements placed on an operating system varies from one partition type to another. The Service, System, and I/O Partitions run the Paragon operating system. Paragon OS is a distributed UNIX operating system (POSIX 1003.1 and XPG3, AT&T System V.3 and 4.3 BSD Reno VFS) that presents a single system image to the user. This means that users and applications see the system as a standard, singleprocessor UNIX machine despite the fact that the operating system is running on a distributed system of service and I/O Nodes. The Compute nodes run the Light Weight Kernel (LWK) operating system. The Paragon operating system interacts with the LWK through a highperformance interface. This interface includes a fast, direct communication path, an enhanced programming interface for UNIX applications, and a Parallel File System for high speed parallel access to large disk files. LWK is based on the Puma operating system developed at Sandia National Laboratories and the University of New Mexico [3]. Puma is designed for high-performance, TFLOP level scalability, low complexity, and small size (less than one MByte). The Puma architecture uses a message passing kernel to provide multi-tasking, memory management, and trap handling facilities. It is composed of two main entities: the Quintessential Kernel (Q-Kernel or QK) and the Process Control Thread (PCT). The Q-Kernel is the lowest level in the LWK architecture. It is responsible for controlling access to the physical resources of a node. It is the only entity that has direct access to the address mapping and communication hardware. The Q-Kernel provides basic computation facilities, communication facilities, and address space protection. It supports the direct transfer of message payloads from memory via portals, thus achieving high utilization of the ICF hardware. The PCT is one level above the Q-Kernel. The PCT provides process management (e.g., process creation and scheduling), naming services for finding a server process and initiating inter-group communication, and communication capabilities. The PCT is invoked whenever an application process completes its allocated quantum of work, either because of a timer interrupt, because the application gave up the processor, or due to a preemptive event. On invocation, the PCT checks for

requests from the application process whose execution was suspended, handles requests from other PCTs, then tries to execute the highest priority, runnable application or server process.

3.6.2 The ASCI TFLOP System Message Passing The ASCI TFLOP software provides a rich message-passing environment. A variety of message passing protocols are available including Portals, MPI, and NX. All of these are accessible to the application programmer. Based on these lower-level protocols, third-party support of PVM and other common API’s is available. Portals [3] are part of the Puma operating system and offer very low latency and high bandwidth. Specifically designed for scientific codes, Portals are ideally suited to applications that utilize regular message passing patterns. Bulk data is taken directly from source memory and is deposited directly into destination memory. This removes overhead, due to memory-to-memory copying. Portals also provide highperformance asynchronous transfers, with buffering provided in the application data space. MPI is the standard message passing library on the ASCI TFLOP system. This will be a full implementation of the MPI 1.0 specification and will incorporate new components as new MPI standards become established. To provide portability to existing NX-based applications, an NX compatibility library built on top of MPI will be provided. The most critical NX routines map directly onto chained MPI functions. Therefore, latency and bandwidth numbers will be comparable for similar operations regardless of whether they are implemented through MPI or NX.

3.7

The ASCI TFLOP RAS Capabilities

The ASCI TFLOP system will include state-of-theart capability for Reliability, Availability, and Serviceability (RAS). These features will let the system continue to operate in the face of failures in all major system components. RAS is implemented through a combination of component-level reliability features, specialized switches for coupling and decoupling disk subsystems, hardware redundancies, hot swap capability for field replaceable units (FRUs), and a dual-plane ICF. These RAS capabilities are driven by an active, scalable, Monitoring and Recovery Subsystem (MRS). Over the course of the ASCI program, each of these RAS features will be integrated into an intelligent diagnostic

system resulting in system stability uncharacteristic of such large scalable supercomputers. These RAS capabilities are based on Intel’s work with its next generation scalable computer called the Common Base Platform (CBP-II). This system was designed for the commercial market and uses many of the well known techniques for building highavailability computers for commercial applications.

space” portal used for file and host service interactions. The MPI implementation utilizes the portals infrastructure to provide “senderbuffered” messaging. Therefore, the message passing buffer space imposed on an application by the system scales at no more than a logarithmic rate.

4.2 4.

System Performance

System performance can be described in two parts: communication performance and the single node performance.

4.1

Communication Performance

The ICF hardware latency is roughly 2 µsec plus an additional 50 nsecs for each segment traversed within the ICF. Therefore, the H/W latencies should range from 2.05 µsec to 5.25 µsec for the ASCI TFLOP system configuration. When software overheads are included, we expect the following communication performance: • The latency for one-way trips with Portals should be 5 µsec, with a total worst-case of 10.25 µsec. Worst-case point-to-point asynchronous MPI message passing latency (long message to far corners) should be less than 20.5 µsec. • Point-to-point, asynchronous MPI message passing bandwidth of 49 MB/sec and 207 MB/sec for 1 KByte and 8 KBytes message lengths, respectively. • Using Portals, the inter-node uni-directional bandwidth is 76 MB/sec, 254 MB/sec, and 380 MB/sec for 1 KByte, 8 KByte, and 1MByte message lengths, respectively. • Using Portals, the aggregate communication bandwidth is 1.02 TB/sec as measured by all Compute Nodes sending and receiving messages from their nearest neighbors. This rate is obtained by multiplying the number of Compute Nodes (2 nodes per Kestrel board) by the achievable bi-directional data rate between two nodes on a single Kestrel board (500 MB/sec). • The use of local memory for message passing buffers is under strict control of the application. Applications can be written using only “writememory” and “read-memory” portals without utilizing any additional buffering memory beyond that used by the constant sized “comm-

MP-LINPACK Performance

The best implementations of MP-LINPACK use a block oriented data decomposition with blocks assigned to processors in a two dimension, Cartesian wrap mapped order [4]. This mapping is logical - i.e. the physical processor arrangement need not correspond to the mapping used in the algorithm. An Intel Paragon system holds the current MPLINPACK world record [1]. This research showed that for very large problems, at least 93% of MPLINPACK’s runtime is consumed by the BLAS-3 matrix multiplication code (DGEMM) and that overall performance can be predicted based on the performance of DGEMM. A Pentium Pro processor can start one floating point instruction every clock with the caveat that two multiplies cannot be started on successive clocks. The DOT product kernel of DGEMM contains (almost) the same number of adds and multiplies. Therefore, it should be possible to approach a rate of one flop per clock (i.e. 200 MFLOP at 200 Mhz). In practice it is necessary to arrange the inner loop to hide all of the memory references and loop control. Outer loop overhead and inner loop startup-costs limit DGEMM performance to about 85% of peak. Combined with the 93% parallel efficiency of MP-LINPACK (pivoting, communication, etc), we estimate about 1.4 TFLOPs for MP-LINPACK. Putting this in more pragmatic terms, available memory on the ASCI TFLOP system will hold a problem of size 180000 which the machine will solve in about 45 minutes.

4.3

Pentium Pro Benchmarking Results

A general, quantitative ranking of microprocessors is difficult and really not necessary for our purposes. Our goal is to show that the performance of the Pentium Pro processor is competitive with RISC processors occurring in high end workstations. We begin with the SPEC95 benchmarks. Both the SPECint95 and SPECfp95 numbers are shown in Table 1 for several modern CPUs.

Processor Pentium Pro Processor Pentium Pro Processor DEC A21164 DEC A21064A PA7200 PA7100 MPC604

Mhz 200 150 300 266 120 125 133

SPECint95 8.04 6.09 7.33 4.18 4.37 4.04 4.45

SPECfp95 5.82 4.53 11.59 5.78 7.54 4.55 3.31

System Intergraph TDZ-300 GLZ1T Intergraph TDZ-300 GLZ1T AlphaSta. 600 5/300 AlphaSta. 250 4/266 HP 9000/J210 HP 9000/735 IBM 43P

Table 1: SPECint95 and SPECfp95 benchmark results [5] for several microprocessors. From this table we see that the Pentium Pro processor has one of the industry’s leading SPECint95 numbers. Since the Pentium Pro processor has only one floating point pipe, its SPECfp95 numbers are respectable, but well behind those of the leading RISC processors. For applications dominated by floating point computation, these SPECfp benchmarks suggest the Pentium Pro will lag behind leading RISC processors. Many - if not most - applications involve a mix of integer and floating point computation. For these cases, the Pentium Pro Processor should provide competitive performance. We have verified this expectation by looking at many different workstation applications [5]. As a typical example, consider the NASTRAN program from MacNeal Schwendler Corp. This is a leading finite element code for modeling a wide range of structural engineering problems. Mathematically, the computation consists of solutions of sparse linear systems of equations - computations that involve a mix of integer and floating point computations. Results are given in Table 2 for a vibrational analysis of a power train model. This computation found the natural frequencies in the range of 0 to 100 Hz: a 36,000 degree of freedom computation involving two decomposition’s, 18 solves and 18 mode extractions [5]. The I/O time is largely a function of the amount of

RAM on the system and is therefore not a valid point of comparison. Looking at the Compute time, the Pentium Pro system from Integraph is slightly less than 2 times slower than a Cray C90 supercomputer. Of the RISC workstations for which we have the data, the Pentium Pro is the leading system. This conclusion has been confirmed for a range of applications [5]. We understand that this is not a systematic study and does not lead to a quantitative ranking of microprocessor performance. What these results do demonstrate, however, is that systems based on the Pentium Pro processor perform on par with workstations commonly used for scientific computing.

5.

Conclusion

In this paper we have described the ASCI TFLOP supercomputer. By the time this system is installed at Sandia National Laboratories, it will be the largest and fastest supercomputer in the world. While this will be an expensive supercomputer, it fit within the ASCI program’s budget due to the use of Commodity Commercial Off The Shelf (CCOTS) technology. Perhaps equally important, CCOTS puts this generation of supercomputers on the price/performance curves of the micro-processor industry. These price/performance curves suggest that over the next 10 years, we should be able to use this

System Pentium Pro Processor, 200 Mhz, 512 Kbyte, L2 Cache, 512 Mbytes RAM Intergraph IDZ-300, Pentium Pro Processor, 200 Mhz, 256 Kbyte L2 Cache Cray C90 HP 9000/819 IBM RS/6000 model 375

Compute Time Seconds 579.7

I/O Time Seconds 85.3

638.4

710.2

360.0 913 1098.9

184 120 1338

Table 2: NASTRAN runtimes (version 68.2) for the 36,000 degree of freedom Power Train vibrational model benchmark [5].

design to build 10 TFLOPs and even 100 TFLOPs systems for a cost similar to that of the ASCI TFLOP system (ignoring costs associated with additional memory). The ASCI TFLOP system is far too complex to adequately describe in a single paper. We have presented a fairly complete description of the system hardware. The software, however, was only minimally described. In addition, the ASCI TFLOP computer includes sophisticated technology developed by Intel to support commercial scalable computing markets where systems must be up 24 hours per day, every day of the year. These RAS features were also only minimally described. Future papers will address these and other issues.

Acknowledgments Peter Wolochow provided the figures in this paper. All trademarks are the property of their respective owners.

6.

References

[1] J. Dongarra, “Performance of Various Computers using Standard Linear Equation Solvers”, ORNL Technical Report CS-89-85, 1995. [2] L. Gwennap, “Intel’s P6 Uses Decoupled Superscalar Design”, Microprocessor Report, Feb. 16, 1995. [3] S.R. Wheat, R. Riesen, A.B. Maccabe, D.W. van Dresser, and T. M. Stallcup, “Puma: An Operating System for Massively Parallel Systems”, Proceedings of the 27’th Hawaii International Conference on Systems Sciences Vol II, p. 56, 1994. [4] L. Fiske, G. Istrail, C. Jong, R. Reisen, L. Shuler, J. Bolen, A. Davis, B. Dazey, S. Gupta, G. Henry, D. Robboy, G. Schiffer, M. Stallcup, A. Taraghi, S. Wheat, “Massively Parallel Distributed Computing”, Proceedings of the Intel Supercomputing User’s Group, http://www.cs.sandia.gov/ISUG/program.html, 1995. [5] “Pentium Pro Processor Workstation Performance Brief: version 1.0”, available on-line at http://www.intel.com:80/procs/perf/doc/pprows .ps, November 1995.