U5005301342191
United States Patent 1191
[11] [45]
Scott [54] PARALLEL PROCESSING COMPUTER FOR SOLVING DENSE SYSTEMS OF LINEAR EQUATIONS BY FACI‘ORING ROWS,
Apr. 5, 1994
Primary Examiner-Thomas C. Lee
Assistant Examiner-Paul Harrity
COLUMNS, AND DIAGONAL, INVERTING ~
[57]
THE DIAGONAL, FORWARD ELMINATING, AND BACK SUBSTITUTING Inventor: David S. Scott, Portland, Oreg. Assignee: Intel Corporation, Santa Clara, Calif.
A parallel processing computer system for solving a
[22] Filed: [5 ll Int. Cl.5
ABSTRACT
system of linear equations having coef?cients residing in a ?rst matrix and right-hand sides of the linear equa tions residing in a ?rst vector. The ?rst matrix is divided into a plurality of ND row disk sections, a plurality of ND column disk sections and ND diagonal disk sec tions. Each of these sections, in a preferred embodi
[75] [73] [21] Appl. No.: 632,462 [52]
5,301,342
Patent Number: Date of Patent:
Dec. 20, 1990
ment, are known as disk sections, and are stored on
non-volatile media such as magnetic and/or optical disks. Further, the equations are de?ned by the ?rst vector, the ?rst vector comprising ND sections. Each of the plurality of j row sections and j column sections is factored. Then, the j diagonal section is factored and inverted. In a preferred embodiment, the inversion uses
................... .. G06F 15/347; G06F 15/32 US. Cl. .................................. .. 395/800; 364/735;
364/736; 364/754; 364/194; 364/937.1; 364/931.03; 364/931.42; 364/93l.01; 364/931.4l; 364/DIG. 2
[53]
Field of Search .............. .. 395/800; 364/ 133, 194,
[56]
References Cited U.S. PATENT DOCUMENTS 4,787,057 11/1988 Hammond ......................... .. 364/754 364/754 4,914,615 4/1990 Karmarkar et al.
364/719, 735, 736, 754
4,947,480 8/1990 Lewis .................... .. 5,136,538 8/1992 Karmarkar eta].
a Gauss-Jordan technique. These steps are repeated for all values of j that range between 1 and ND. Then,
forward elimination is performed for all sections in the ?rst vector using the ?rst matrix, and back substitution is performed for all sections in the ?rst vector using the ?rst matrix.
364/572
364/754
5,163,015 11/1992 Yokota .............................. .. 364/578
11 Claims, 15 Drawing Sheets
9”
>2" >2‘
A 11 A12 A1:: A14
A 21 A 22/1”’ 23/ A 24 ’////1’ b-i 4:1 4:2 Ass 4:4 801
Q 811 B12 B1:1 B14
812
r ”4 cu c11 c12/63:;
313 g/g/cra/efg/érg ’; J 811 / 22" as" aw 21/) LJ/?/iZ/?fZ/?/"?/A * _ C 511 5:4 B11 B\a2 \ ,
c=21 cs2’1/3”’?' c 0/2; 1‘ //////4 cs1 942/913: C24 //////
,---/
C
A41 442 A4: A44
544 B41 5:2 B43
814
//////'
_
//////;
641 91249231934 ’// //
\/ \J \./\\J 824
A
-
B
a
C
8422
8,23
US. Patent
Apr. 5, 1994
Sheet 1 of 15
100
151
SRM / 1 1 1 (COMPUTE NODE) 160
130
140
110 (COMPUTE NODE) 170
‘
150 (V0 NODE)
18
SCSI l/O DEVICES
@ 184
/
QM”
?own’ FIGURE 1
US. Patent
Apr. 5, 1994
201
pgoegggél?a'
UNI-r
Sheet 2 of 15
203
5,301,342
204
NETWORK
NETWORK ‘
INTERFACE
13°
ROUTER :\‘\140 180 ‘I70
205 8 MEGABYTES OF MEMORY
202 /
K
110
FIGURE 2
‘
U.S. Paten‘t
T
323 > /
/
Apr. 5, 1994
/
I
Sheet 3 of 15
5,301,342
'
, J/
321
l
FIGURE 3
T L
i
U.S. Patent
Apr. 5, 1994
Sheet 4 of 15
_
5,301,342
DOITHROW
v
407
/
|=j+1
408 NO
YES :09 k=O
k=k+1
41° 412 /
411
k-F' NO
YES
413
AI|=AI|*A1|
FIGURE 4a
AI=AI
AIk'A]
US. Patent
Apr.5,1994
Sheet 5 of 15
FIGURE 4b
5,301,342
US. Patent
Apr. 5, 1994
Sheet 6 of 15
522
DO ITH DIAGONAL
424 / Ali = Inverse
of All
FIGURE 4c
5,301,342
US. Patent
Apr. 5, 1994
Sheet 7 of 15
22 501
502
i=0 503 I=i+1
/
FIGURESa
5,301,342
US. Patent
Apr. 5, 1994
Sheet 8 of 1s
FIGURE 5b
_
5,301,342
US. Patent
Apr. 5, 1994
Sheet 9 of 15
1
3.
7
?n
EE
gm
2
4
a
Q2
@6
QB.
5
6
9
EE
Q2
52
E@
10 ga
11 g2
12 52
16 21
5,301,342
m 5!
14
600
§§.’J 15v
>
HGUREG
Y
US. Patent
Apr. 5, 1994
Sheet 10 of 15
2..
5,301,342
701
g
702
/
I=0
< >—'
703
/
|=l+1 7°‘
70s
‘a @ YES
706
/
I=|
4,
707 /
MAX = 0
712
/ h
FlGURE7a
US. Patent
Apr. 5, 1994
Sheet 11 of 15
5,301,342
STOP: MATRIX IS SINGULAR
715 /
I=0
:7 1 =1 + 1
716 / 717
YES
NO
718
r
721
/
/ TEMP = All
All = 1/An
1
719
1
722
/ All : A1]
/ ,
I=D
720 r
=
/
723 r
/
A1] = TEMP
All = All ' All
—_____I
FlGURE7b
US. Patent
Apr. 5, 1994
Ak] = Akl m ' All
Sheet 12 of 15
/
_____1
FIGURE 7c
5,301,342
US. Patent
Apr. 5, 1994
Sheet 13 of 15
102
i 736 /
I:0
737 /
M] = All * Al]
FIGURE 7d
5,301,342
US. Patent
Apr. 5, 1994
Sheet 14 of 15
5,301,342
MwEDUE
US. Patent
Apr. 5, 1994 '
901
START
Sheet 15 of 15
5,301,342
900 "
902 I=O
MATRIX-MATRIX PRODUCT
/
‘—--——-—>1
1
903
I=l+1
'
'/
904
905
“° YES
@
906 /
I=0
v‘
907 /
]:]+1
906 /
90a
TEMP] = 1 A]!
YES 310 k=O
k=k+1
£11
313
912 YES
D:
a'CH No
914 /
918 /
no
91 5 / .
no
4?‘
919
/
hh1 91s
I =l + 1
NO 920
921
/ Ali = TEMP]
HGUREQ
YES 311 TEMP] : TEMP] + o * Bjk
1
5,301,342
2
computer main memory, certain systems perform ma trix solutions “out-of-core.” In other words, elements
PARALLEL PROCESSING COMPUTER FOR SOLVING DENSE SYSTEMS OF LINEAR
within the matrix that are not needed at a particular
EQUATIONS BY FACI‘ORING ROWS, COLUMNS, AND DIAGONAL, INVERTING THE DIAGONAL, FORWARD ELIMINATING, AND BACK SUBSTITUTING
time may be written off to disk, and/or spooled to tape, for later retrieval and operation upon in the main mem
ory of each node in the parallel processing system. Such an arrangement provides an easy system of maintaining
backups (since they are constantly being spooled off to
BACKGROUND OF THE INVENTION tape, disk or other similar media), as well as providing a 10 natural check-point at which computation may resume 1. Field of the Invention if computer system power is lost, a system malfunction The present invention relates to an apparatus and method for solving large dense systems of linear equa occurs or some other event occurs which halts process tions on a parallel processing computer. More particu ing. Given the long time period for solving very large larly, this invention relates to an efficient parallel pro matrices (up to a full real-time month for some very
cessing method throughout which solves dense systems of linear equations (those containing approximately 25,000 variables) wherein the system efficiently utilizes
large systems, for example, those containing l00,000>< 100,000 elements), the possibility of losing
the computing and input/output resources of the paral lel processing computer and idle time of those resources.
20
system power or another type of malfunction may prove fatal to solving the matrix. Such a system may be termed an “out of core solver.”
SUMMARY AND OBJECTS OF THE INVENTION cross-section of aircraft, or simulating flow across an One object of the present invention is to provide for airfoil, very large systems of linear equations are used. a method which is especially tuned for solving large 25 These linear equations sometimes range on the order of matrix problems on parallel processing computers. 25,000 variables or more. It therefore is necessary to Another object of the present invention is to provide solve very large matrices comprising approximately ‘ a method for breaking down a system of large linear 25,000X25,000 elements or more for solving these lin equations for an efficient solution by a parallel process ear equations. This is typically dif?cult to accomplish ing computer. on prior art linear algebra computers due to their inher 30 These and other objects of the invention are provided ent limitations, such as processing power, or input/out for by a method in a parallel processing computer of put device speed. Also, general limitations have pre solving a system of linear equations having coef?cients vented the solutions for these problems, such as band residing in a ?rst matrix and right-hand sides of the width and the cost of super-computing resources. Paral lel computing architectures have been used for solving 35 linear equations residing in a ?rst vector. The ?rst ma 2. Prior Art For some applications, such as determining the radar
certain types of problems, including systems of linear equations, since many small processors may perform certain operations more efficiently than one large high
trix is divided into a plurality of ND row sections, a
speed processor, such as that present in a typical super
ment, are known as disk sections, and are stored on 40 non-volatile media such as magnetic and/or optical
computer. The practical limitations in partitioning the
plurality of ND column sections and ND diagonal sec tions. Each of these sections, in a preferred embodi
disks. Further, the equations are de?ned by the ?rst problem to be solved in the parallel-processing ma vector, the ?rst vector comprising N sections. Each of chines, however, has hindered the usefulness of this the plurality of j row sections and j column sections is architecture. factored. Then, the j diagonal section is factored and In a parallel processing machine, it is necessary to break down a problem (such as the operations to solve 45 inverted. In a preferred embodiment, the inversion uses a Gauss-Jordan technique. These steps are repeated for a large matrix representing a system of linear equations) all values of j that range between 1 and ND. Then, into a series of discrete problems in order for the system forward elimination is performed for all sections in the to generate a solution. Segmenting such a problem is a that vector using the ?rst matrix, and back substitution nontrivial task, and must be carefully designed in order is performed for all sections in the ?rst vector using the to maximize processing by each of the parallel nodes, ?rst matrix. and minimize input/output operations which must be performed. This is done so that the system does not BRIEF DESCRIPTION OF THE DRAWINGS spend the majority of the time performing input/output The present invention is illustrated by way of exam~ operations (thereby becoming “l/O bound”) while the ple and not limitation in the ?gures of the accompany computing resources in the system remain substantially ing in which like references indicate similar elements idle. Therefore, one goal of designing systems and prob and in which: lems is to maximize the amount of time that the system
FIG. 1 shows the parallel processing architecture of is processing. Another goal of parallel processing archi the preferred embodiment. tectures in implementing very large matrix solutions, is to balance the I/O capacity of the system so that the 60 FIG. 2 shows the internal architecture of one com system does not become totally “compute-bound” (e.g. pute node of the preferred embodiment. processing only, no input/output operations) and the FIG. 3 shows a matrix which is to be solved by being I/O units remain idle. divided into node and disk sections.
Another problem with parallel processing computers FIGS. 40, 4b and 4c show a factoring method for a is that large chunks of a matrix (perhaps the entire ma 65 matrix used in present invention. trix) need to be loaded into main memory of each paral FIGS. 50 and 5b show the forward elimination and lel processing node in order to compute the solution. Given that very large matrices require vast amounts of
back substitution phases for the solve method of the
preferred embodiment.
5,301,342
3
FIG. 6 shows a portion of a matrix and the order in which particular disk sections are factored in the pre ferred embodiment.
'
4
drive 184 has a mechanism which allows access to any
one of many rewritable optical disks contained within unit 184. This gives system 100 access to many giga bytes of data since these optical disks have a capacity of
FIGS. 70, 7b, 7c, and 7d show a method used for inverting a block in a matrix. FIG. 8 shows which node sections of the matrix are
approximately half a gigabyte. In systems having more
required by a compute node at a given time and the
media storage, such as optical disk carousels, is re
order in which those node sections are passed to other
quired. In the preferred embodiment, I/O node 150 is an
nodes in the parallel processing computer. FIG. 9 shows the method used for a node section
matrix-matrix multiply operation which is performed by each compute node in the preferred embodiment. DETAILED DESCRIPTION A method for solving a very large dense system of
than 25,000 variables, the used large amounts of ?xed
80386 microprocessor manufactured by Intel Corpora tion of Santa Clara, Calif.
_
I/O node 150 is interconnected with it's correspond ing compute node such as 110 in system 100 shown in FIG. 1 through a high speed network known as the Direct-Connect TM network manufactured by Intel
Corporation of Santa Clara, Calif. Each Direct-Con
linear equations in a parallel processing computer is described. In the following description for the purposes
nect ‘m channel such as 170 shown in FIG. 1 is a high
of explanation, speci?c memory lengths, input/output
transfers at a rate of up to 5.6 megabytes per second.
capacity dedicated connection allowing bidirectional
'I'hischannelallowsaccesstoallnodesinsystem l00by devices, architectures, and matrix sizes are set forth in order to provide a thorough understanding of the pres 20 each of the other nodes. All I/O nodes such as 150 and ent invention. It will be apparent, however, to one skilled in the art that the present invention may be prac
ticed without these speci?c details. In other instances,
I/O devices such 181-184 are treated on system 100 as
one logical entity using the Concurrent File System (CFS), also a product of Intel Corporation of Santa
well known circuits and devices are shown in block Clara, Calif. In other words, even if access is made to diagram form in order to not unnecessarily obscure the 25 different I/O nodes and devices by other computing present invention. nodes in system 100 than 110, these devices appear as Referring to FIG. 1, the apparatus on which the one logical entity to the node requesting the access. The method of the preferred embodiment is shown gener CFS of the preferred embodiment exploits parallel ac ally as 100 in FIG. 1. System 100 is a parallel processing cess to all I/O devices and all processors for ef?cient, machine known as the Intel iPSC ®/860 computer 30 automatic routing of I/O requests and data over the manufactured by Intel ® Corporation’s Supercomputer network. When large ?les are created by a single com Systems Division of Beaverton, Oreg. (Intel and iPSC pute node, CFS automatically distributes the ?les are registered trademarks of Intel Corporation of Santa among the available I/O devices. When many small Clara, Calif). System 100 is also known generically as files are accessed, the load is typically distributed the “hypercube.” Note that system 100 generally com among available I/O devices, available I/O channels, prises a series of nodes such as 110 and 150 shown in and the internal communication channels within system FIG. 1 in system 100 which are inter-connected by 100. series of channels such as 130, 140 and 160 within sys The Direct-Connect TM communications system tem 100. A node may be an input/output (I/O) node shown generally in FIG. 1, comprising input/output such as 150 or a compute node, such as 110 shown in 40 nodes such as 150, and compute nodes such as 110, are
FIG. 1. System 100 as shown in FIG. 1 implements a
distributed computing system wherein input/output
connected in a topology known as the “hypercube.” Each compute node such as 110 is connected with indi vidual channels such as 130, 140, and 160 wherein each
devices are available-to all other nodes in system 100 through their respective I/O nodes such as 150 and node has N neighbors. The maximum value of N, in the their compute nodes such as 110. 45 preferred embodiment, is 7. Therefore, in the hypercube V0 node 150 is coupled to compute node 110 topology of the preferred embodiment, a maximum of through a line 170. It is also coupled to a series of I/O 128 compute nodes (2” where N=7) may reside in the devices such as disk drives 181 and 182, tape drive 183 system. Each compute node can connect to at most one or optical disk carousel 184. These devices are coupled I/O node such as 150. Although FIG. 1 only shows to I/O node 150 through bus 180 shown in FIG. 1. In 50 eight compute nodes for illustration purposes, in the
the preferred embodiment, each of the I/O devices
preferred embodiment, 64 compute nodes reside in sys 181-184 are small computer system interface (SCSI) tem 100. In alternative embodiments, this value may be devices allowing transfer rates of up to 1 megabyte per another power of two. In the preferred embodiment, second to I/O node 150. In the preferred embodiment, system 100 has only eight I/O nodes, most of the com disks 181 and 182 are Winchester class sealed SCSI hard 55 pute nodes not being coupled to a corresponding I/O disk drives, and which, at this time, have a maximum node. Also, one compute node 111 in system 100 does storage capacity of approximately 760 megabytes. Tape not have a corresponding I/O node. This is because drive 183, in the preferred embodiment, is an 8 millime compute node 111 is coupled to a front-end unit known ter helical scan digital tape drive typically used in video as the System Resource Manager (SRM) 151 as shown tape applications. It has an approximate capacity equal in FIG. 1. The topology of system 100 allows nodes to three disk drives such as 181 and 182. Tape drive 183 residing anywhere in the hypercube to communicate allows the spooling of entire disk drives such as 181 and with other nodes residing in the hypercube thus sharing 182 onto the tape drive for later processing, or it may be resources such as disk drives, tape drives, and other used for performing backup operations. Lastly, I/O computing resources as if they were directly connected node 150 may be coupled to an optical disk drive carou 65 to the node. System 100 allows bidirectional transfers at sel unit 184. 184 may be optical disk drive carousel (“a rates of up to 5.6 megabytes per second within the CD ROM jukebox”) or may be a very large single opti hypercube. Even if a resource requested is coupled to cal drive unit in an alternative embodiment. Optical disk another compute node, the response from the I/O re
5
5,301,342
quest given by a node is as if the resource is directly connected to the node. A detailed description of the
Direct-ConnectTM communications technology used in system 100 of the preferred embodiment may be found in US patent application Ser. No. 07/298,551,
now abandoned, whose inventor is Stephen F. Nugent. As shown in FIG. 2, each compute node comprises an i860TM RISC (reduced instruction set computer) cen
computing nodes in the preferred embodiment. In FIG.
tral processing unit 201 manufactured by Intel Corpora
3, however, only sixteen node sections are shown for
tion of Santa Clara, Calif. Each compute node such as 110 comprises a bus 205 which is coupled to the i860 TM
central processing unit. Further, bus 205 provides com munication between central processing unit 201 and 8 megabytes of random access memory 202 for storing and processing of variables assigned to node 110. Fur ther, bus 205 is coupled to network interface 203 which allows communication with other nodes in system 100.
6
sections are particular portions of the disk sections which is solved by a particular compute node such as 110 shown in FIG. 1. Each disk section size (L) is deter mined by the number of compute nodes in system 100, and the memory and fixed storage capacity of each of these nodes. In system 100, each disk section such as 310 comprises 64 M XM node sections because there are 64
illustration purposes. In the preferred embodiment, taking into account double buffering and the three vari ables required from each node block, each node section size is limited to 0.9 megabytes of each computing p
node’s memory. This translates into a node block of dimensions 224 elements>