80) Global Memory. Global Memory. Global Memory

Scalability of the Cedar System Stephen W. Turner Alexander V. Veidenbaum Center for Supercomputing Research and Development University of Illinois...
Author: Elisabeth Ford
5 downloads 0 Views 167KB Size
Scalability of the Cedar System Stephen W. Turner

Alexander V. Veidenbaum

Center for Supercomputing Research and Development University of Illinois Urbana, IL 61801

Abstract Cedar scalability is studied via simulation and measurement. The simulation methodology is veri ed by comparing simulated performance with that of the real machine. The performance scalability of the interconnection networks and memory modules which compose Cedar's shared memory system is then examined in detail. The system is shown to be basically scalable in performance, but not perfectly so. A "brute force" approach to increasing scalability, doubling the clock speed of the memory subsystem, is shown to be only moderately e ective at improving scalability. Finally, by limiting trac in the network, the scalablity of the system is increased sign cantly at very little cost.

1 Introduction Cedar was designed as a prototype of a scalable shared-memory multiprocessor. In this paper, we evaluate how successful that design is at achieving \scalable" performance. We de ned scalable performance via the execution time of benchmark kernels. As used here, scalable performance implies that the benchmark run times decrease linearly with an increasing number of processors, for a given range of system sizes. We are concerned here with systems containing between 32 and 512 processors. We also evaluate methods for improving the scalability of performance. We do so by examining the behavior of Cedar via hardware monitoring and by simulating larger versions of it. These simulations are used to assess improvements to the design and to con rm the results of our previous work regarding the performance of MINs [1, 2, 3, 4].  This

research was supported by the Department of Energy under Grant No. DE-FG02-85ER25001 and by the National Science Foundation under Grants No. US NSF-MIP-8410110 and NSF-MIP-89-20891, IBM Corporation, and the State of Illinois.

A major issue in any simulation study is veri cation: do the results match reality? Since scalability is usually studied via simulation, this type of study requires particular care. Accordingly, a major part of this work concentrates on developing a methodology which uses the Cedar system for verifying simulations against measurements. The next section describes this methodology, as well as the Cedar system and the benchmarks used to evaluate it. Following this, the performance scalability of the underlying architecture is examined and the bene t of a simple trac-limiting mechanism is demonstrated. Finally, we summarize the main results and conclude.

2 Methodology In this study traces of program behavior are used to drive a network and shared-memory simulator con gured to model Cedar. To verify the accuracy of the simulator, its results are compared to the performance of Cedar itself. The performance scalability of Cedar's memory system design is explored by simulating larger versions of the machine. This methodology also allows assesment of the e ectiveness of design alterations; two of which are described here. This section contains descriptions of the Cedar system, the technique used to gather program traces, the benchmark programs which were traced and simulated, and the veri cation of the accuracy of those simulations.

2.1 The Cedar architecture Cedar is a hierarchical shared-memory multiprocessor consisting of four clusters of vector processors connected to a 32-bank word-interleaved shared mem-

Global Memory

Global Memory

Global Memory

X-Bar Switch

X-Bar Switch

X-Bar Switch

X-Bar Switch

X-Bar Switch

X-Bar Switch

Cluster #0

Cluster #1

Cluster #N-1

(Alliant FX/80)

(Alliant FX/80)

(Alliant FX/80)

Figure 1: Cedar Architecture ory via two unidirectional multistage shue-exchange networks (see Figure 1). Each cluster is a modi ed Alliant FX/8, containing eight vector processors, a multiported shared cache and 32 megabytes of cluster memory. Each vector processor is provided with a separate port into the shared memory system. For a more detailed description of Cedar see [5]. The experiments described here concentrate on the performance of the shared memory system. Virtually all benchmark data is located in globally shared memory, with cluster memory being used only for compiler temporaries, loop indices and other such scalar variables. The traces used in this paper are derived from the Hardware Performance Monitor (HPM) facility of Cedar. This facility includes a data probe for each processor and bu ers to store data gathered by those probes. The probes monitor memory mapped locations and transfer data written to those locations into the bu ers, along with a 50ns-resolution 48-bit timestamp.

2.2 Tracing technique The Cedar FORTRAN compiler (cf) is used to produce optimized Cedar assembly code from FORTRAN versions of the benchmark kernels. Code fragments which write to the HPM are then inserted before and after each vectorized global memory access by use of an automatic instrumentation script. When the in-

strumented code is executed the resulting events are captured and timestamped by the HPM. Because there is an HPM event immediately before and after each vector access, the time between successive accesses is easily determined. This time is spent in local computation and local memory operations. Because this local time is very nearly independent of the behavior of the shared-memory system, our simulator does not need to emulate local operations. The trace used to drive the simulator includes those operations as xed, inter-burst delays. The HPM events describe the type of each vector access (fetch or write), its base address, stride and length. The simulator issues these shared-memory accesses into a register transfer level simulator of the shared-memory and networks. In all the experiments discussed here, the benchmarks use double-precision real values, and hence the basic access type is assumed to be one 8-byte word. All of the traced values are in multiples of that size. The trace also includes events which correspond to the beginning and end of each loop and each loop iteration. This information is used to partition the loop and statically schedule the iterations onto the processors of the architecture being simulated. The benchmarks are compiled with 64 large-grain tasks (or SDOALL iterations, in Cedar FORTRAN) which are partitioned over the 8-processor clusters which make up the target machine. Each large-grain task contains medium-grain parallelism in the form of parallel (CDOALL) loops and ne-grain parallel vector operations within those loops. The CDOALL loop iterations are scheduled across the processors within each cluster. The vector operations within each such iteration result in burst memory accesses overlapped with ALU operations by each processor.

2.3 The benchmarks We use four kernels to evaluate the performance of the shared-memory system. CG is an inner loop of a conjugate gradient solver. RankK multiplies a square matrix by a set of vectors, using a core loop of hand-coded Cedar Assembly language. PM3 is the core loop of a tridiagonal matrix solver. VFVW is a vector copy implemented via burst fetches followed by burst writes. These four benchmarks are chosen to represent a type of workload typical for large-scale shared-memory machines: memory-intensive scienti c codes. While they are all numeric codes that make heavy use of the shared memory system, they are quite di erent

1.04

Deviation from uniform distribution

Deviation from uniform distribution

1.8

1.02 1 0.98 0.96 0

4

8

12 16 20 24 28 Memory Module

1.6 1.4 1.2 1 0.8 0.6 0.4 0

64 128 192 256 320 384 448 512 Memory Module

Figure 2: Address Distribution for CG on 32 modules

Figure 3: Address Distribution for CG on 512 modules

from each other in many respects. CG contains complex local loop-control and very irregular memory accesses, including a scatter-gather operation. RankK makes very regular use of memory, and has minimal loop overheads, because it has been extensively handoptimized to t the Cedar system. PM3 is a \typical" core loop of a numeric solver. It has not been handoptimized, but it does exhibit regular memory access patterns. PM3 also has a moderate amount of local computation between memory references. VFVW has virtually no local time between memory references, and is intended as a \stress test" for the memory system. Because of all these variations the benchmarks cover a broad range of suitable applications, and so their results should give a good indication of system performance. All four benchmarks are scheduled statically across Cedar's four clusters, and are run with as large a data set size as is possible while still avoiding TLB faults. TLB faults result in large (circa 7.5ms) gaps of inactivity in the trace, and in order to observe \steadystate" behavior in the presence of these gaps, traces of much longer runs (1 second or more) would need to be gathered and simulated. Since this is impractical, we choose instead to concentrate our attention on the behavior of the system in the absence of such system overheads. These overheads should be relatively small in relation to the execution of entire programs, and so can be ignored in practice. The address distribution of the shared-memory ac-

cesses is an important di erentiator of the various kernels. This is particularly important as non-uniformity in the number of requests destined for each module can directly a ect the scalability of the benchmarks. As the size of the system grows, the number of memory modules increases, and the e ect of non-uniform distributions of memory requests also increases. To illustrate this, Figure 2 shows the distribution of accesses directed to each of 32 interleaved memory modules for the CG benchmark. Figure 3, on the other hand, shows the same addresses distributed over 512 memory modules. In each case the value shown is the number of accesses to a speci c module divided by the average number of accesses per module. Clearly, this increasingly unbalanced distribution of addresses can cause decreased performance as the system is scaled up. If the benchmarks were larger in size, it might help to alleviate the problem, but inevitably there will be some non-uniformity in addressing, and the e ect of this can increase as the number of independent modules increases.

2.4 Simulator veri cation Our simulator is intended to model the behavior of Cedar's networks and shared-memory system with very high delity. Also, the traces gathered via the HPM system contain a record of all important shared-

Table 1: Measured vs. Simulated Performance

512 448 384 Speedup

Run Time Measured Full Cedar 16 of 32 CG 392319 383461 383246 PM3 459717 458781 459224 RankK 2165716 2137913 1995231 VFVW 105473 103090 105910

320

Ideal RankK VFVW PM3 CG

256 192

memory events. However, there is always some doubt as to whether a simulation is accurately modeling the system under study. In order to demonstrate the accuracy of our tracing and simulation technique, the system's simulated performance is compared to the measured performance of Cedar. To provide a baseline for comparison, the execution time (as well as other performance metrics) of uninstrumented versions of the kernels is measured via the HPM directly. The simulations are veri ed in two ways. First, the method is tested by tracing the codes while running them on all 32 of Cedar's processors. These traces are then run through a 32-processor version of the simulator, and the results are compared with those determined by hardware measurement. These are found to be in good agreement. Second, in order to test that trace scheduling does not introduce distortions that invalidate the results, the codes are traced while running on 2 of Cedar's 4 clusters. The resulting traces are then scheduled and run through a 32-processor simulation. The results of these simulations are again found to be in good agreement with the numbers measured on the full machine, thus verifying the scheduling methodology. The results of these comparisons are shown in Table 1. The measured results are determined by timing uninstrumented versions of the kernels several times, and averaging the results. Clearly the match between run times is good, particularly for the two benchmarks which have simpler structures: PM3 and VFVW. This shows that the tracing and simulation method used here is capable of producing a very accurate re ection of reality. The simulator used here is a new version of those used in previous work, but it does not model any additional hardware features. We have previously veri ed those earlier versions against the performance of Cedar. Because both simulators match the performance of the real machine, we have high con dence in their results, and we do not repeat the previous experiments here. Instead we will summarize their key aspects here, as

128 64 32 64 128

256 System Size

512

Figure 4: Performance of scaled-up Cedar necessary to support our current study.

3 Scalability of Cedar We now use the simulation method described above to evaluate the performance scalability of the Cedar system. We call a system \scalable" if its performance is proportionate to the number of processors it contains. This is indicated by linear slopes in the speedup graphs of simulated benchmark performance. We call systems which achieve speedups with a unity slope on all benchmarks \perfectly scalable". We assume that the system organization remains unchanged with increasing size except for the change in the number of processors and memory modules and in the number and connectivity of the crossbars in the network. In other words, all system components retain the same basic organization. We also assume the software and its overheads remain the same, except for startup and iteration scheduling overheads in parallel loops. These overheads are adjusted during simulation, and are assumed to increase linearly with increasing system size. Because these overheads are a small fraction of total run time (less than 1%, even for the largest system) this should not be an important consideration. It can be argued that such simple scaling of the existing 32 processor implementation may not be practi-

System Memory Utilization (%) Size CG PM3 RankK VFVW 32 51 39 43 61 64 44 37 43 59 128 30 33 42 50 256 27 31 41 45 512 22 29 51 52

512 448

Speedup

384 320

Ideal RankK VFVW PM3 CG

256

Table 2: Memory use as a function of system size

192 128 64 32 64 128

256 System Size

512

Figure 5: E ect of faster memory cal in reality. We are not arguing that it is. The question we want to answer is: if this system was scaled as described above, how will its performance scale? But for those readers who want to argue that this implementation cannot be scaled up we will state that other implementations of this architecture are possible in the range of system sizes we are concerned with: 32 to 512 processors. We do not model these here because in doing so we would loose the advantage of veri cation. For an examination of implementation issues and the e ect of some possible alternative designs, see [6]. The speedups attained by simulating the execution of the benchmarks on larger versions of Cedar are graphed in Figure 4. The performance of two of the four benchmarks increases sublinearly with increasing system size. Even RankK and VFVW, which do exhibit nearly linear increases, do not achieve an ideal slope of unity. The curves are not perfectly straight lines in any of the cases. This is due in part to non-linearities in the way the system components are changed with increasing system size. For example: the switches in the the third stage of the interconnection networks are 2x2, 4x4 and 8x8 for 128, 256 and 512 processor systems respectively. Also, as discussed previously, the mapping of memory addresses to memory banks alters. These changes introduce second-order e ects on system performance, which subtly alter the shape of the speedup curves. Clearly, though, all of the four benchmarks achieve sub-optimal speedups.

Where does the fault lie? In previous work [2], we showed that there is no single element of the sharedmemory system which can be pointed to as the bottleneck. The performance of both networks (forward and reverse) and the memory modules is insucient and must be increased to realize signi cant overall performance gain. A simple, though expensive, way to improve the performance of the entire shared-memory system is to double its clock speed. The speedups attained by doing so are plotted in Figure 5. Here, the traces are executed exactly as before, but the network and memory modules are assumed to have a clock period which is one half that of the real machine. CG still fails to achieve very high speedups. Although there is very little sequential code in any of the benchmarks discussed here, CG does have the most. This sequential portion accounts for some of the reduction of speedup, but its uneven distribution of memory requests, (shown in Figure 3) is the primary reason that this benchmark fails to achieve scalable speedup. Table 2 shows the e ect of this distribution on the memory usage of CG. Whereas the other three benchmarks lose at most 10% of their possible e ective bandwidth because of this e ect, CG looses almost 30%, an intolerable hardship if this code is to scale in performance. There are two possible solutions. Either the code must be re-written to better spread the memory addresses across the memory modules, or the sharedmemory system must include an address-randomizing function within itself. The latter approach was explored in earlier work [2]. Although it helped smooth out the di erences in performance between di erent benchmarks, the overall gain was not shown to be very signi cant. In any case, hardware address randomization eliminates the ability of the compiler and/or programmer to optimize an application's address distribution for a particular architecture. The memory utilization of

RankK actually increases with increasing system size, in part because RankK uses a hand-coded assembly inner loop which does just that. It exhibits a very even distribution of reference addresses on all system sizes. The system can be considered scalable, though not perfectly so, as we have shown previously. To achieve this goal with an inexpensive extension of the current organization is still a challenge.

4 Limiting network trac In this section we examine an inexpensive yet effective technique to increase the performance of the Cedar shared-memory system. In previous studies we have shown that Cedar's shared-memory system can be improved by increasing the size of the bu ers in switches which make up the interconnection networks. Without a concomitant increase in the speed of the memory modules, however, the performance improvement from such improved switches is minimal. Similarly, the switches can be redesigned with O(n2 ) bu ers organized in a non-blocking fashion, which improves the throughput of the individual switches. Again, however, if the throughput of the memory modules is not improved by either increasing their number via a dilated network, increasing their speed or some other similar (and similarly expensive) technique, then the throughput of the entire memory system does not improve appreciably. No single change to the network examined before: not address randomization, not changing the amount of data transferred in each packet, not even greatly increasing the size of the bu ers on the memory modules themselves (although that technique, of all those explored previously, does provide the best performance improvement at relatively low cost); yields a system which approaches the goal of speedups with a slope of unity. In the end, a modi cation of the network interface to the \throttle" the trac, with virtually no cost if included in the design, is shown to bring the system quite near to this ideal. Trac throttling is not a new idea. In the WANworld it has long been commonplace for network routers to examine round-trip latencies and throttle back trac on congested routes accordingly. Scott and Sohi [7] as well as Farrens, Wetmore and Woodru [8] propose using feedback from \hot" memory modules to restrict the injection of trac destined for those modules. We are not proposing to add any such complex mechanisms, though.

More recently, Dally and Aoki [9] have suggested that since their network design is \unstable" at high loads, nodes should be restricted to injecting new packets into a subset of the available virtual channels. This reduces peak throughput at moderately high loads, while preserving it at maximum load (where without such throttling network throughput would be reduced to a small fraction of it's capacity). Our trac limitation proposal imposes no reduction of throughput for any load level and dramatically improves it at high loads. By setting a limit on the number of outstanding requests that any given processor may have in the shared memory system, we can reduce the amount of contention in the system. And because there is a limit on the number of parallel requests which can be usefully processed, this can increase overall system performance. The benchmarks used in this study spend a large fraction of their time accessing global memory. So limits which help reduce contention (and so increase memory system throughput) can improve their performance. This improvement comes with a possible cost; however, if the limits are appropriately set then this cost can be eliminated. The cost is reduced pipelining in the memory system: if fewer requests are allowed to be in the network simultaneously then the degree to which memory requests are pipelined is reduced. But there is a limit to how much pipelining can improve throughput. Consider an ideal functional unit with a latency Tl and a pipeline time Ti . Allowing more than No = TT operations to be in the functional unit at one time does not improve performance. Once No operations are in progress the unit is able to produce results at the maximum pipeline rate. Any less and NT will be the e ective pipeline time; while any more will not reduce the pipeline time below Ti . In our case, Tl is the latency and Ti is the interarrival time, or time between the return of two successive word requests. The situation is complicated by the fact that Tl and Ti are dependent on No , but the basic principle still holds. If each processor is allowed at least No outstanding requests, then no penalty due to reduced pipelining will be seen. In our case, Tl is approximately 20 cycles and Ti is nearly 4, under heavy load. The analysis above suggests that bene ts due to pipelining should cease after 5 requests per processor are in the network at once. The real minimum is slightly higher than this, which is explained by the fact that there are multiple independent servers: increasing the number of outstanding l i

l o

40 Processor Blocked (% of Cycles)

Interarrival Time (Cycles)

8 "RankK" "VFVW" "CG" "PM3"

7 6 5 4 3

35

"RankK" "VFVW" "CG" "PM3"

30 25 20 15 10 5 0

0

4

8 12 16 20 24 Outstanding Requests

28

32

0

4

8 12 16 20 24 Outstanding Requests

28

32

Figure 6: Interarrival time with trac limiting

Figure 7: Processor blocking with trac limiting

requests also increases the memory utilization, and so e ectively decreases Ti , at least until contention e ects increase it again. The e ect of changing No on interarrival time are shown in Figure 6. Contention in the forward network is directly related to the percentage of time that the processor is blocked from introducing new requests into the network. The e ect on contention on processor blocking is shown in Figure 7. Notice that while the di erent benchmarks exhibit increased contention starting at di erent numbers of outstanding requests, they all have signi cantly reduced contention at 8 outstanding requests, which is sucient to reduce the pipeline time to the minimum and to keep sucient memory modules busy. One may wonder what would happen if the traf c intensity was signi cantly reduced from the levels demonstrated by our benchmarks. Perhaps the lightly loaded memory system would bene t from greater memory utilization if contention with packets from other processors was not an issue? This is not the case. If the network is lightly loaded, then the vector request latency takes on its minimum value (14 cycles). Similarly, the interarrival time is reduced to its minimum (2 cycles). By the argument above, introducing more than 7 packets into the network at a time will not reduce the pipeline \chime" below 2 cycles, so again 8 packets in the network is sucient. Our conclusion from this is that it would improve system performance if Cedar's Global-memory Inter-

face Board (GIB) issued only 8 requests into the network at a time. More that this increases contention without a corresponding decrease in interarrival time. Though this was a simple experiment, it reveals an important design alteration which can signi cantly improve performance with virtually no hardware cost. To demonstrate this improvement, the speedups obtained by simulations of Cedar with such modi cations are shown in Figure 8. With trac limiting in e ect, system performance is almost as that achieved by the clock-doubled memory system. One possible variation on this scheme is to allow dynamic adjustment of the number of outstanding requests. This could realize the minimum interarrival time for each benchmark, which can be seen in Figure 6 to vary between applications. With this optimization, performance of the shared-memory system could be even higher, still with very little cost.

5 Conclusion We have used a veri ed simulation methodology to evaluate the scalabilty of the Cedar shared-memory system. We showed that the system \as-is" is basically scalable in performance, but that it fails to achieve the goal of unity-slope speedups on memory-bound benchmarks. In previous work we demonstrated that improving the performance of any single component of

References

512 448

Speedup

384 320

Ideal RankK VFVW PM3 CG

256 192 128 64 3264 128

256 System Size

512

Figure 8: Speedups with trac limiting the system was insucient to realize this goal. Several key components needed to be modi ed simultaneously in order to signi cantly improve system performance. We then presented a very low-cost modi cation to the existing hardware which can be used to dramatically improve the performance of the system. The general concept is to limit the number of memory requests to the amount that can be pro tably processed in parallel. By doing so, we still do not realize our goal of unity-slope speedups, but we do come as close to the goal as was possible via much more expensive modi cations. Future work will examine the impact of synergistically limiting trac in combination with other improvements, as well as evaluating the e ectiveness of such concepts for other system organizations.

6 Acknowledgements We would like to thank Hoichi Cheong and YungChin Chen for their contributions to the automatic instrumentation and tracing tools. We are grateful to Sunil Kim and Carl Beckmann, who were responsible for developing the initial Cedar simulator. Also, our thanks go to Kyle Gallivan and Ulrike Meier Yang for their help with the benchmark kernel codes.

[1] S. W. Turner and A. V. Veidenbaum, \Performance of a shared memory system for vector multiprocessrs," in Proceedings of the International Conference on Supercomputing, pp. 315{325, July 1988. [2] S. W. Turner, \Shared memory and interconnection network performance for vector multiprocessors," Master's thesis, University of Illinois, Urbana-Champaign, May 1989. CSRD-TR No. 876. [3] E. D. Granston, S. W. Turner, and A. V. Veidenbaum, \Design of a scalable, shared-memory system with support for burst trac," in Proceedings of the 1st Annual Workshop on Scalable SharedMemory Architectures (M. Dubois, ed.), Kluwer & Assoc., Feb 1991. [4] D. Kuck, et al., \The cedar system and an initial performance study," in Proceedings 20th International Symposium on Computer Architecture, pp. 213{223, May 1993. [5] J. Konicek, T. Tilton, A. Veidenbaum, C. Zhu, E. Davidson, R. Downing, M. Haney, M. Sharma, P. Yew, P. Farmwald, D. Kuck, D. Lavery, R. Lindsey, D. Pointer, J. Andrews, T. Beck, T. Murphy, S. Turner, and N. Warter, \The organization of the cedar system," in ICPP, vol. I, pp. 49{56, 1991. [6] S. W. Turner, System Design and Its Performance Implications for Shared-Memory Multiprocessors. PhD thesis, University of Illinois, UrbanaChampaign, 1994. In preperation. [7] S. L. Scott and G. S. Sohi, \The use of feedback in multiprocessors and its application to tree saturation control," IEEE Transactions on Parallel and Distributed Systems, vol. 1, pp. 385{399, Oct 1990. [8] M. Farrens, B. Wetmore, and A. Woodru , \Alleviation of tree saturation in multistage interconnection networks," in Proceedings of Supercomputing '91, pp. 400{409, 1991. [9] W. J. Dally and H. Aoki, \Deadlock-free adaptive routing in multicomputer networks using virtual channels," IEEE Transactions on Parallel and Distributed Systems, vol. 4, pp. 466{475, Apr 1993.

Suggest Documents