63:; r 4 cu

U5005301342191 United States Patent 1191 [11] [45] Scott [54] PARALLEL PROCESSING COMPUTER FOR SOLVING DENSE SYSTEMS OF LINEAR EQUATIONS BY FACI‘OR...
Author: Rodney Parrish
4 downloads 4 Views 1MB Size
U5005301342191

United States Patent 1191

[11] [45]

Scott [54] PARALLEL PROCESSING COMPUTER FOR SOLVING DENSE SYSTEMS OF LINEAR EQUATIONS BY FACI‘ORING ROWS,

Apr. 5, 1994

Primary Examiner-Thomas C. Lee

Assistant Examiner-Paul Harrity

COLUMNS, AND DIAGONAL, INVERTING ~

[57]

THE DIAGONAL, FORWARD ELMINATING, AND BACK SUBSTITUTING Inventor: David S. Scott, Portland, Oreg. Assignee: Intel Corporation, Santa Clara, Calif.

A parallel processing computer system for solving a

[22] Filed: [5 ll Int. Cl.5

ABSTRACT

system of linear equations having coef?cients residing in a ?rst matrix and right-hand sides of the linear equa tions residing in a ?rst vector. The ?rst matrix is divided into a plurality of ND row disk sections, a plurality of ND column disk sections and ND diagonal disk sec tions. Each of these sections, in a preferred embodi

[75] [73] [21] Appl. No.: 632,462 [52]

5,301,342

Patent Number: Date of Patent:

Dec. 20, 1990

ment, are known as disk sections, and are stored on

non-volatile media such as magnetic and/or optical disks. Further, the equations are de?ned by the ?rst vector, the ?rst vector comprising ND sections. Each of the plurality of j row sections and j column sections is factored. Then, the j diagonal section is factored and inverted. In a preferred embodiment, the inversion uses

................... .. G06F 15/347; G06F 15/32 US. Cl. .................................. .. 395/800; 364/735;

364/736; 364/754; 364/194; 364/937.1; 364/931.03; 364/931.42; 364/93l.01; 364/931.4l; 364/DIG. 2

[53]

Field of Search .............. .. 395/800; 364/ 133, 194,

[56]

References Cited U.S. PATENT DOCUMENTS 4,787,057 11/1988 Hammond ......................... .. 364/754 364/754 4,914,615 4/1990 Karmarkar et al.

364/719, 735, 736, 754

4,947,480 8/1990 Lewis .................... .. 5,136,538 8/1992 Karmarkar eta].

a Gauss-Jordan technique. These steps are repeated for all values of j that range between 1 and ND. Then,

forward elimination is performed for all sections in the ?rst vector using the ?rst matrix, and back substitution is performed for all sections in the ?rst vector using the ?rst matrix.

364/572

364/754

5,163,015 11/1992 Yokota .............................. .. 364/578

11 Claims, 15 Drawing Sheets

9”

>2" >2‘

A 11 A12 A1:: A14

A 21 A 22/1”’ 23/ A 24 ’////1’ b-i 4:1 4:2 Ass 4:4 801

Q 811 B12 B1:1 B14

812

r ”4 cu c11 c12/63:;

313 g/g/cra/efg/érg ’; J 811 / 22" as" aw 21/) LJ/?/iZ/?fZ/?/"?/A * _ C 511 5:4 B11 B\a2 \ ,

c=21 cs2’1/3”’?' c 0/2; 1‘ //////4 cs1 942/913: C24 //////

,---/

C

A41 442 A4: A44

544 B41 5:2 B43

814

//////'

_

//////;

641 91249231934 ’// //

\/ \J \./\\J 824

A

-

B

a

C

8422

8,23

US. Patent

Apr. 5, 1994

Sheet 1 of 15

100

151

SRM / 1 1 1 (COMPUTE NODE) 160

130

140

110 (COMPUTE NODE) 170



150 (V0 NODE)

18

SCSI l/O DEVICES

@ 184

/

QM”

?own’ FIGURE 1

US. Patent

Apr. 5, 1994

201

pgoegggél?a'

UNI-r

Sheet 2 of 15

203

5,301,342

204

NETWORK

NETWORK ‘

INTERFACE

13°

ROUTER :\‘\140 180 ‘I70

205 8 MEGABYTES OF MEMORY

202 /

K

110

FIGURE 2



U.S. Paten‘t

T

323 > /

/

Apr. 5, 1994

/

I

Sheet 3 of 15

5,301,342

'

, J/

321

l

FIGURE 3

T L

i

U.S. Patent

Apr. 5, 1994

Sheet 4 of 15

_

5,301,342

DOITHROW

v

407

/

|=j+1

408 NO

YES :09 k=O

k=k+1

41° 412 /

411

k-F' NO

YES

413

AI|=AI|*A1|

FIGURE 4a

AI=AI

AIk'A]

US. Patent

Apr.5,1994

Sheet 5 of 15

FIGURE 4b

5,301,342

US. Patent

Apr. 5, 1994

Sheet 6 of 15

522

DO ITH DIAGONAL

424 / Ali = Inverse

of All

FIGURE 4c

5,301,342

US. Patent

Apr. 5, 1994

Sheet 7 of 15

22 501

502

i=0 503 I=i+1

/

FIGURESa

5,301,342

US. Patent

Apr. 5, 1994

Sheet 8 of 1s

FIGURE 5b

_

5,301,342

US. Patent

Apr. 5, 1994

Sheet 9 of 15

1

3.

7

?n

EE

gm

2

4

a

Q2

@6

QB.

5

6

9

EE

Q2

52

E@

10 ga

11 g2

12 52

16 21

5,301,342

m 5!

14

600

§§.’J 15v

>

HGUREG

Y

US. Patent

Apr. 5, 1994

Sheet 10 of 15

2..

5,301,342

701

g

702

/

I=0

< >—'

703

/

|=l+1 7°‘

70s

‘a @ YES

706

/

I=|

4,

707 /

MAX = 0

712

/ h

FlGURE7a

US. Patent

Apr. 5, 1994

Sheet 11 of 15

5,301,342

STOP: MATRIX IS SINGULAR

715 /

I=0

:7 1 =1 + 1

716 / 717

YES

NO

718

r

721

/

/ TEMP = All

All = 1/An

1

719

1

722

/ All : A1]

/ ,

I=D

720 r

=

/

723 r

/

A1] = TEMP

All = All ' All

—_____I

FlGURE7b

US. Patent

Apr. 5, 1994

Ak] = Akl m ' All

Sheet 12 of 15

/

_____1

FIGURE 7c

5,301,342

US. Patent

Apr. 5, 1994

Sheet 13 of 15

102

i 736 /

I:0

737 /

M] = All * Al]

FIGURE 7d

5,301,342

US. Patent

Apr. 5, 1994

Sheet 14 of 15

5,301,342

MwEDUE

US. Patent

Apr. 5, 1994 '

901

START

Sheet 15 of 15

5,301,342

900 "

902 I=O

MATRIX-MATRIX PRODUCT

/

‘—--——-—>1

1

903

I=l+1

'

'/

904

905

“° YES

@

906 /

I=0

v‘

907 /

]:]+1

906 /

90a

TEMP] = 1 A]!

YES 310 k=O

k=k+1

£11

313

912 YES

D:

a'CH No

914 /

918 /

no

91 5 / .

no

4?‘

919

/

hh1 91s

I =l + 1

NO 920

921

/ Ali = TEMP]

HGUREQ

YES 311 TEMP] : TEMP] + o * Bjk

1

5,301,342

2

computer main memory, certain systems perform ma trix solutions “out-of-core.” In other words, elements

PARALLEL PROCESSING COMPUTER FOR SOLVING DENSE SYSTEMS OF LINEAR

within the matrix that are not needed at a particular

EQUATIONS BY FACI‘ORING ROWS, COLUMNS, AND DIAGONAL, INVERTING THE DIAGONAL, FORWARD ELIMINATING, AND BACK SUBSTITUTING

time may be written off to disk, and/or spooled to tape, for later retrieval and operation upon in the main mem

ory of each node in the parallel processing system. Such an arrangement provides an easy system of maintaining

backups (since they are constantly being spooled off to

BACKGROUND OF THE INVENTION tape, disk or other similar media), as well as providing a 10 natural check-point at which computation may resume 1. Field of the Invention if computer system power is lost, a system malfunction The present invention relates to an apparatus and method for solving large dense systems of linear equa occurs or some other event occurs which halts process tions on a parallel processing computer. More particu ing. Given the long time period for solving very large larly, this invention relates to an efficient parallel pro matrices (up to a full real-time month for some very

cessing method throughout which solves dense systems of linear equations (those containing approximately 25,000 variables) wherein the system efficiently utilizes

large systems, for example, those containing l00,000>< 100,000 elements), the possibility of losing

the computing and input/output resources of the paral lel processing computer and idle time of those resources.

20

system power or another type of malfunction may prove fatal to solving the matrix. Such a system may be termed an “out of core solver.”

SUMMARY AND OBJECTS OF THE INVENTION cross-section of aircraft, or simulating flow across an One object of the present invention is to provide for airfoil, very large systems of linear equations are used. a method which is especially tuned for solving large 25 These linear equations sometimes range on the order of matrix problems on parallel processing computers. 25,000 variables or more. It therefore is necessary to Another object of the present invention is to provide solve very large matrices comprising approximately ‘ a method for breaking down a system of large linear 25,000X25,000 elements or more for solving these lin equations for an efficient solution by a parallel process ear equations. This is typically dif?cult to accomplish ing computer. on prior art linear algebra computers due to their inher 30 These and other objects of the invention are provided ent limitations, such as processing power, or input/out for by a method in a parallel processing computer of put device speed. Also, general limitations have pre solving a system of linear equations having coef?cients vented the solutions for these problems, such as band residing in a ?rst matrix and right-hand sides of the width and the cost of super-computing resources. Paral lel computing architectures have been used for solving 35 linear equations residing in a ?rst vector. The ?rst ma 2. Prior Art For some applications, such as determining the radar

certain types of problems, including systems of linear equations, since many small processors may perform certain operations more efficiently than one large high

trix is divided into a plurality of ND row sections, a

speed processor, such as that present in a typical super

ment, are known as disk sections, and are stored on 40 non-volatile media such as magnetic and/or optical

computer. The practical limitations in partitioning the

plurality of ND column sections and ND diagonal sec tions. Each of these sections, in a preferred embodi

disks. Further, the equations are de?ned by the ?rst problem to be solved in the parallel-processing ma vector, the ?rst vector comprising N sections. Each of chines, however, has hindered the usefulness of this the plurality of j row sections and j column sections is architecture. factored. Then, the j diagonal section is factored and In a parallel processing machine, it is necessary to break down a problem (such as the operations to solve 45 inverted. In a preferred embodiment, the inversion uses a Gauss-Jordan technique. These steps are repeated for a large matrix representing a system of linear equations) all values of j that range between 1 and ND. Then, into a series of discrete problems in order for the system forward elimination is performed for all sections in the to generate a solution. Segmenting such a problem is a that vector using the ?rst matrix, and back substitution nontrivial task, and must be carefully designed in order is performed for all sections in the ?rst vector using the to maximize processing by each of the parallel nodes, ?rst matrix. and minimize input/output operations which must be performed. This is done so that the system does not BRIEF DESCRIPTION OF THE DRAWINGS spend the majority of the time performing input/output The present invention is illustrated by way of exam~ operations (thereby becoming “l/O bound”) while the ple and not limitation in the ?gures of the accompany computing resources in the system remain substantially ing in which like references indicate similar elements idle. Therefore, one goal of designing systems and prob and in which: lems is to maximize the amount of time that the system

FIG. 1 shows the parallel processing architecture of is processing. Another goal of parallel processing archi the preferred embodiment. tectures in implementing very large matrix solutions, is to balance the I/O capacity of the system so that the 60 FIG. 2 shows the internal architecture of one com system does not become totally “compute-bound” (e.g. pute node of the preferred embodiment. processing only, no input/output operations) and the FIG. 3 shows a matrix which is to be solved by being I/O units remain idle. divided into node and disk sections.

Another problem with parallel processing computers FIGS. 40, 4b and 4c show a factoring method for a is that large chunks of a matrix (perhaps the entire ma 65 matrix used in present invention. trix) need to be loaded into main memory of each paral FIGS. 50 and 5b show the forward elimination and lel processing node in order to compute the solution. Given that very large matrices require vast amounts of

back substitution phases for the solve method of the

preferred embodiment.

5,301,342

3

FIG. 6 shows a portion of a matrix and the order in which particular disk sections are factored in the pre ferred embodiment.

'

4

drive 184 has a mechanism which allows access to any

one of many rewritable optical disks contained within unit 184. This gives system 100 access to many giga bytes of data since these optical disks have a capacity of

FIGS. 70, 7b, 7c, and 7d show a method used for inverting a block in a matrix. FIG. 8 shows which node sections of the matrix are

approximately half a gigabyte. In systems having more

required by a compute node at a given time and the

media storage, such as optical disk carousels, is re

order in which those node sections are passed to other

quired. In the preferred embodiment, I/O node 150 is an

nodes in the parallel processing computer. FIG. 9 shows the method used for a node section

matrix-matrix multiply operation which is performed by each compute node in the preferred embodiment. DETAILED DESCRIPTION A method for solving a very large dense system of

than 25,000 variables, the used large amounts of ?xed

80386 microprocessor manufactured by Intel Corpora tion of Santa Clara, Calif.

_

I/O node 150 is interconnected with it's correspond ing compute node such as 110 in system 100 shown in FIG. 1 through a high speed network known as the Direct-Connect TM network manufactured by Intel

Corporation of Santa Clara, Calif. Each Direct-Con

linear equations in a parallel processing computer is described. In the following description for the purposes

nect ‘m channel such as 170 shown in FIG. 1 is a high

of explanation, speci?c memory lengths, input/output

transfers at a rate of up to 5.6 megabytes per second.

capacity dedicated connection allowing bidirectional

'I'hischannelallowsaccesstoallnodesinsystem l00by devices, architectures, and matrix sizes are set forth in order to provide a thorough understanding of the pres 20 each of the other nodes. All I/O nodes such as 150 and ent invention. It will be apparent, however, to one skilled in the art that the present invention may be prac

ticed without these speci?c details. In other instances,

I/O devices such 181-184 are treated on system 100 as

one logical entity using the Concurrent File System (CFS), also a product of Intel Corporation of Santa

well known circuits and devices are shown in block Clara, Calif. In other words, even if access is made to diagram form in order to not unnecessarily obscure the 25 different I/O nodes and devices by other computing present invention. nodes in system 100 than 110, these devices appear as Referring to FIG. 1, the apparatus on which the one logical entity to the node requesting the access. The method of the preferred embodiment is shown gener CFS of the preferred embodiment exploits parallel ac ally as 100 in FIG. 1. System 100 is a parallel processing cess to all I/O devices and all processors for ef?cient, machine known as the Intel iPSC ®/860 computer 30 automatic routing of I/O requests and data over the manufactured by Intel ® Corporation’s Supercomputer network. When large ?les are created by a single com Systems Division of Beaverton, Oreg. (Intel and iPSC pute node, CFS automatically distributes the ?les are registered trademarks of Intel Corporation of Santa among the available I/O devices. When many small Clara, Calif). System 100 is also known generically as files are accessed, the load is typically distributed the “hypercube.” Note that system 100 generally com among available I/O devices, available I/O channels, prises a series of nodes such as 110 and 150 shown in and the internal communication channels within system FIG. 1 in system 100 which are inter-connected by 100. series of channels such as 130, 140 and 160 within sys The Direct-Connect TM communications system tem 100. A node may be an input/output (I/O) node shown generally in FIG. 1, comprising input/output such as 150 or a compute node, such as 110 shown in 40 nodes such as 150, and compute nodes such as 110, are

FIG. 1. System 100 as shown in FIG. 1 implements a

distributed computing system wherein input/output

connected in a topology known as the “hypercube.” Each compute node such as 110 is connected with indi vidual channels such as 130, 140, and 160 wherein each

devices are available-to all other nodes in system 100 through their respective I/O nodes such as 150 and node has N neighbors. The maximum value of N, in the their compute nodes such as 110. 45 preferred embodiment, is 7. Therefore, in the hypercube V0 node 150 is coupled to compute node 110 topology of the preferred embodiment, a maximum of through a line 170. It is also coupled to a series of I/O 128 compute nodes (2” where N=7) may reside in the devices such as disk drives 181 and 182, tape drive 183 system. Each compute node can connect to at most one or optical disk carousel 184. These devices are coupled I/O node such as 150. Although FIG. 1 only shows to I/O node 150 through bus 180 shown in FIG. 1. In 50 eight compute nodes for illustration purposes, in the

the preferred embodiment, each of the I/O devices

preferred embodiment, 64 compute nodes reside in sys 181-184 are small computer system interface (SCSI) tem 100. In alternative embodiments, this value may be devices allowing transfer rates of up to 1 megabyte per another power of two. In the preferred embodiment, second to I/O node 150. In the preferred embodiment, system 100 has only eight I/O nodes, most of the com disks 181 and 182 are Winchester class sealed SCSI hard 55 pute nodes not being coupled to a corresponding I/O disk drives, and which, at this time, have a maximum node. Also, one compute node 111 in system 100 does storage capacity of approximately 760 megabytes. Tape not have a corresponding I/O node. This is because drive 183, in the preferred embodiment, is an 8 millime compute node 111 is coupled to a front-end unit known ter helical scan digital tape drive typically used in video as the System Resource Manager (SRM) 151 as shown tape applications. It has an approximate capacity equal in FIG. 1. The topology of system 100 allows nodes to three disk drives such as 181 and 182. Tape drive 183 residing anywhere in the hypercube to communicate allows the spooling of entire disk drives such as 181 and with other nodes residing in the hypercube thus sharing 182 onto the tape drive for later processing, or it may be resources such as disk drives, tape drives, and other used for performing backup operations. Lastly, I/O computing resources as if they were directly connected node 150 may be coupled to an optical disk drive carou 65 to the node. System 100 allows bidirectional transfers at sel unit 184. 184 may be optical disk drive carousel (“a rates of up to 5.6 megabytes per second within the CD ROM jukebox”) or may be a very large single opti hypercube. Even if a resource requested is coupled to cal drive unit in an alternative embodiment. Optical disk another compute node, the response from the I/O re

5

5,301,342

quest given by a node is as if the resource is directly connected to the node. A detailed description of the

Direct-ConnectTM communications technology used in system 100 of the preferred embodiment may be found in US patent application Ser. No. 07/298,551,

now abandoned, whose inventor is Stephen F. Nugent. As shown in FIG. 2, each compute node comprises an i860TM RISC (reduced instruction set computer) cen

computing nodes in the preferred embodiment. In FIG.

tral processing unit 201 manufactured by Intel Corpora

3, however, only sixteen node sections are shown for

tion of Santa Clara, Calif. Each compute node such as 110 comprises a bus 205 which is coupled to the i860 TM

central processing unit. Further, bus 205 provides com munication between central processing unit 201 and 8 megabytes of random access memory 202 for storing and processing of variables assigned to node 110. Fur ther, bus 205 is coupled to network interface 203 which allows communication with other nodes in system 100.

6

sections are particular portions of the disk sections which is solved by a particular compute node such as 110 shown in FIG. 1. Each disk section size (L) is deter mined by the number of compute nodes in system 100, and the memory and fixed storage capacity of each of these nodes. In system 100, each disk section such as 310 comprises 64 M XM node sections because there are 64

illustration purposes. In the preferred embodiment, taking into account double buffering and the three vari ables required from each node block, each node section size is limited to 0.9 megabytes of each computing p

node’s memory. This translates into a node block of dimensions 224 elements>