Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures

In Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Oct. 2007. Available from http:...

Author: Felicia Tate

17 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Performance Characteristics for OpenMP Constructs on Different Parallel Computer Architectures

Memory Performance on Dual Processor Nodes: Comparison of Intel Xeon and AMD Opteron Memory Subsystem Architectures

Portable Performance on Heterogeneous Architectures

Advanced processor architectures (APAZ_07.ppt)

SEISMIC PERFORMANCE OF

Main-Memory Hash Joins on Modern Processor Architectures

SEISMIC DESIGN AND PERFORMANCE OF COMPOSITE FRAMES

Biodiesel Production and its Emissions and Performance: A Review

Networks on Chip, router architectures and performance challenges

SEVERAL modern processor architectures such as Intel

SEISMIC ISOLATION CHARACTERISTICS OF BALL RUBBER BEARINGS

Money: Its Functions and Characteristics

SSD Architectures to Ensure Security and Performance

Seismic Driven Reservoir Characterization and Production Management

PROACTIVE MODELING OF MARKET, PRODUCT AND PRODUCTION ARCHITECTURES

Increase Multi-Processor Performance

Performance Characteristics:

ENTREPRENEURIAL CHARACTERISTICS AND PERFORMANCE OF LEARNING INSTITUTIONS

A Performance Comparison of Contemporary DRAM Architectures

Production migration to ASEAN

Review of Biodiesel Production, Emissions and Performance characteristics of Mahua oil T.TamilarasanP P, K.VinodP P, R.SaravananP

Review of Novel Computing Architectures for Neural Applications

PERFORMANCE AND SCALABILITY OF DISTRIBUTED SOFTWARE ARCHITECTURES: AN SPE APPROACH

Maximizing Performance with Processor and Core Affinity

In Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Oct. 2007. Available from http://www.cos.ufrj.br/~monnerat

Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures Jairo Panetta, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, Fernando M. Roxo da Motta, Silvio Sinedino Pinheiro, Ivan Pedrosa Junior, Andre L. Romanelli Rosa, Luiz R. Monnerat, Leandro T. Carneiro and Carlos H. B. de Albrecht Petróleo Brasileiro SA, Petrobras [email protected], {prps, ccunha}@petrobras.com.br, [email protected], {sinedino, ivanjr, aromanelli, luiz.monnerat, ltcarneiro.atos_origin, albrecht}@petrobras.com.br Abstract We describe the computational characteristics of the Kirchhoff prestack seismic migration currently used in daily production runs at Petrobras and its port to novel architectures. Fully developed in house, this portable and fault tolerant application has high sequential and parallel efficiency, with parallel scalability tested up to 8192 processors on the IBM Blue Gene without exhausting parallelism. Production load comprises thousands of jobs per year, consuming the installed park of a few thousand x86 CPU cores, with top production runs continuously using up to 1000 dedicated processors during 20 days. A built in mechanism automatically produces, collects and stores job performance data, allowing single job performance analysis and multi job performance statistics. Its port to quad-core x86 and Sony PlayStation 3 achieved very high price/performance and performance/watt gains over single core x86 machines. Port to the PS3 is described in detail. Experimental performance data on a modest PS3 cluster is also presented.

1. Introduction Finding hydrocarbons on earth subsurface and transforming exploratory drills into production sites demand vast amount of resources worldwide. A late 2006 version of Petrobras own investment plan for the 2007-2011 period reserves US$ 28 billions for exploitation and production [1]. Since a single exploratory drill costs about tens of millions of dollars and the success rate in finding reservoirs is internationally low, the careful selection of drill locations is central to reduce uncertainty and optimize investments.

The central tool for drill location selection is the seismic method: probe the subsurface with elastic waves and record reflected signals (data acquisition), process recorded signals to generate desired information (seismic processing) and interpret the results (interpretation). Seismic processing aims the production of high quality subsurface information from data surveys, and is composed by the cascade application of tens to hundreds of procedures. Seismic migration is the most time consuming of these procedures – by the end of 2005 Petrobras had in excess of five thousand x86 CPUs dedicated to seismic migration jobs. This paper describes the Kirchhoff prestack seismic migration currently used in daily production runs at Petrobras. Our objectives are to present an industrial strength HPC application (section 2), its development and daily production use (section 3), as well as its computational characteristics and performance (section 4). Furthermore, new trends in processor architecture forced by power consumption and dissipation concerns require testing the adequacy of new architectures to the application. We present test results on IBM Blue Gene (section 5), multi-core x86 boards (section 6) and Sony PS3 (section 7). Conclusions are drawn at section 8.

2. Seismic Migration We summarize the pertinent characteristics of the seismic method and the Kirchhoff migration. For an indepth presentation, see [2] and [3]. Figure 1 depicts data acquisition at the sea. The source periodically generates waves that produce reflections at subsurface boundary layers that are later collected by receivers. The set of signals originated by a single wave and collected by a single receiver is called a trace. A trace is composed by amplitude

values denominated samples, collected at discrete times. receivers

source

sea bottom

reflecting surface

Figure 1: Seismic data acquisition The set of all samples collected by a survey is represented by S(t,x,y,o), where t is signal travel time (from source to subsurface and back to receiver), x and y are surface coordinates of the source to receiver midpoint and o is the distance from source to receiver, known as offset. Since the ship moves and periodically triggers waves, data acquisition has redundancy: a single mid-point is associated to multiple traces, corresponding to multiple offsets, originated by waves generated at distinct source positions and received by distinct receivers. Data redundancy is desired, since it can be used to improve signal quality. A typical survey contains tens to hundreds of terabytes of data. Seismic processing extracts subsurface information from data surveys. The oil industry worldwide uses commercially available software packages (e.g. [5,6,7,8]) containing up to tens of millions of source lines, comprising hundreds of seismic modules (e.g. signal to noise enhancement, reverberation suppression), and desired functionalities such as a project database history, a tool to cascade seismic modules, etc. A seismic processing professional selects the modules that are most appropriate for the survey and the target area. Processing a survey demands months of team work, but its execution time is fully dominated by the seismic migration module. Seismic migration is the process of producing a subsurface image (i.e., positioning reflecting surfaces) that is consistent with acquired data. It is an inverse problem [4], since it produces model parameters from observed data. As many inverse problems, a consistent solution may require multiple executions, since output dependent information is required on input, such as propagation speed at each layer (with a reasonable estimate of reflecting surface position). The migration algorithm may use a representative simplified velocity field or a more complex field that accounts for wavefront deformation details. In the

former case, the process is called time migration and produces image T(τ,x,y,o), where τ represents vertical travel time, while in the latter case the process is called depth migration and produces image T(z,x,y,o), where z represents depth. Kirchhoff migration uses the Huygens-Fresnel principle to collapse all possible contributions to an image point. Wherever the sum of contributions causes constructive interference, the image is heavily marked, remaining blank or loosely marked on destructive interferences. A contribution to an image point T(z,x,y,o) is generated by an input sample S(t,x’,y’,o) whenever the measured signal travel time t matches computed travel time from source to (z,x,y,o) subsurface point and back to receiver. The set of all possible input traces (x’,y’,o) that may contribute to an output trace (x,y,o) lie within an ellipsis of axis ax and ay (apertures) centered at (x,y,o). Input traces are filtered to attenuate spatial aliasing. Sample amplitudes are corrected to account for amplitude spreading during travel. Kirchhoff depth migration is the computation, for each output sample (z,x,y,o), of

Tz , x , y ,o = ∑ β (αS t f, x ', y ',o + (1 − α )St +f ∆t , x ', y ',o )

where the sum of contributions is taken over all input traces (x’,y’,o) within the aperture ellipse, the superscript f denotes the selected filter, β denotes amplitude correction, t is the travel time corresponding to z and α is the interpolation weight required to map computed floating point travel time to discrete samples. Figure 2 contains the Kirchhoff algorithm. The underlined text establishes nomenclature for later use. For all offsets Clear output volume For all input traces (input trace loop) Read input trace Filter input trace For all output traces within aperture (migration loop) For all output trace contributed samples (contribution loop) Compute travel time Compute amplitude correction Select input sample and filter Accumulate input sample contribution into output sample End For End For End For Dump output volume End For

Figure 2: Kirchhoff algorithm Depth migration computes travel time by adding pre-computed source to target and target to receiver travel times (requiring retrieval of pre-computed large matrices), while time migration computes travel times by a geometric approximation. Consequently, depth migration requires substantially more memory than

ones) occupied positions 275 and 418 of the Top 500 list of Nov 2006 [11].

3. Development History

700 5.000 600 4.000

CPUs (core)

Petrobras has a long history of internal development of seismic modules, as part of a strategic policy to increase competitiveness. Seismic migration has been a target since the 70’s. At that time, Petrobras machinery was based on mainframes with vector accelerators. International processing standards include problem rank reduction to fit available processing power – survey data was stacked by collapsing samples from distinct offsets, achieving about hundred fold reductions on data size and processing demands. Processing occurred on 2D slices of poststacked data. A development sample of the last days of that period is available at [9]. On mid 90’s a version of the internally developed Kirchhoff time migration was deployed on newly acquired RISC-based machinery, to process 2D slices of poststacked 3D surveys. Shared memory OpenMP parallelism and innovative algorithm developments allowed splitting the 3D Kirchhoff operator into cascaded 2D operators, providing 3D-similar output at 2D costs. Top production runs at that time typically consumed 32 proprietary RISC processors for 45 days. By 1998 the use of Beowulf-based machinery [10] for seismic processing was a clear trend in the oil industry. In 1999 an internal research project built a 72 processor cluster (500 MHz Intel Pentium III CPUs) with open source system software and a second generation Kirchhoff time migration algorithm with distributed memory MPI parallelism. On mid 2000 the project reached its term demonstrating the superior cost/performance of the Beowulf cluster and indicating its production adequacy. The reliability of Beowulf clusters was tested from late 2000 to mid 2002, by splitting production load with stable, proprietary RISC based machinery. Increasing confidence over the years built the base for the acquisition of larger Beowulf clusters. Figure 3 shows the evolution of Beowulf CPU count over the years at Petrobras main production facility, counting only seismic migration dedicated machinery. By the end of 2006, CPU count reached 5166 x86 CPUs (a wide variety of Intel’s Xeon and AMD’s Opteron) arranged in five clusters with variable number of single core dual processor boards, with clock frequencies spanning from 1.8 to 3.06GHz, with combined processing power surpassing 26 TFlops and combined memory of about 8 TBytes, overseeing a disk farm of 250 TBytes. Two of these clusters (not the largest

800

6.000

500

400

3.000

300

Power (KVA)

time migration. In both cases, computing travel times requires the knowledge of propagation speeds.

2.000 200 1.000 100

0

0 2000

2001

2001

2003 CPUs

2004

2005

2006

Power

Figure 3: Beowulf CPU count and power consumption evolution at Petrobras Processing power build up allowed algorithm enhancement but required parallelism scalability. A full 3D poststack time migration algorithm was developed and later replaced by a prestack version. Parallelism was scaled to efficiently use up to 1000 CPUs. A 3D prestack depth migration was developed and also scaled up to 1000 processors. Production spans a few thousand jobs per year, with variable computational requirements caused by algorithm parameter selection, input survey size and output area size. Table 1 contains production samples: the average computational requirements of the 10, 100 and 1000 most demanding production jobs over one year. Data shows that high processor counts and long execution times are common practice in production. Table 1: Average CPU count and execution time of the 10, 100 and 1000 most demanding production jobs over one year Job count

Average CPU count

Average execution time (days)

10

787

22

100

656

10

1000

185

2

4. Computational Characteristics 4.1. Code and instrumentation characteristics The migration source code comprises about 32K lines of Fortran 95 and about 1K lines of C (to speedup IO). Code is Fortran 95 standard-conforming except for flush calls and C interoperability. There are two executables – one sequential and one parallel – that typically run in distinct machines. The sequential part feeds the parallel part with input data and stores results, while the parallel part executes the

migration algorithm. The two parts communicate by file exchange. The communication protocol is a simple file renaming upon write completion. The protocol eventually fails on NFS file systems serving multiple machines, due to NFS weak cache consistency [16]. The reading machine may view the file renamed before the write buffers were fully flushed, even when explicit flush operations were issued, causing a file read failure. A fault tolerant IO module circumvented this problem. Built-in execution time instrumentation generates performance data for every run. Generated data is automatically stored at a data base. Performance data is central to drive performance developments, to correct user’s choice of performance sensitive execution parameters (e.g. CPU count) and to detect machinery performance failures. The data base eases data retrieval and statistics computation over multiple jobs. OProfile and PAPI [12, 13] performance tools drove sequential optimization efforts on early code versions. Optimization work was concentrated on the two level nesting of migration and contribution loops (see Figure 2), that fully dominates the execution time. Since each instance of this double nested loop has plenty of independent floating point operations, optimization was concentrated on aggressive vectorization and L2 cache reuse. We briefly describe the L2 cache optimization. Each instance of the inner loop (contribution loop) requires about 160KB of memory, while one instance of the outer loop (migration loop) requires tens of MB. Consequently, the inner loop data fits most L2 cache sizes easily, while the outer loop data does not. Inner loop memory references are the filtered input trace (load), the output trace (load, modify, store), the velocity (load) and scratch area. The outer loop brings another output trace and velocity to the inner loop. Since the filtered input trace fully dominates the inner loop memory requirements, the obvious cache reuse policy is to keep the filtered input trace in cache and tolerate cache misses for the output trace and velocity. This policy was implemented with careful coding, resulting in L2 cache hit ratio in excess of 99%. Summarizing, the application has a high computational intensity, defined as the number of floating point operations per out of cache reference. Built-in instrumentation outputs our own application specific metric of computing speed: the number of contributions accumulated per second, defined as the execution rate of iterations of the contribution loop, which is used throughout this paper. An approximation of the usual Flops rate requires multiplication by 74 (average floating-point operations per contribution).

4.2. Parallelism and fault tolerance Parallel domain decomposition occurs at the output space, since the computation of each output trace is independent of any other output trace. The output (x,y) surface is partitioned into non-overlapping rectangular blocks of variable size. The block is the parallel unity of work (parallel grain) with blocks distributed to processes, parallelizing the migration loop. Block size selection does not modify the total work of migration loops (over all blocks), but affects the total number of input traces read and filter operations, since a single input trace may contribute to multiple blocks, requiring independent reading and filtering for each block. For a fixed output area, larger block sizes limit parallelism but reduce total execution time by decreasing re-reading and re-filtering of input traces. Block size impacts memory requirements. The average production block requires about 100MB of memory. A single master process schedules blocks to slave processes on demand. Dynamic load balancing accommodates uneven processing speeds and block sizes. Blocks are ordered and scheduled by decreasing estimated block execution time. The application has binary reproducibility – output file content is independent (binary) of processor count Fault tolerance is central to long-running jobs on large Beowulf clusters. It is implemented by MPI intercommunicators, as suggested by [14]: each slave process communicates with the master process by a dedicated intercommunicator, allowing the master process survival when a slave crashes. In such a case, the master process reschedules the failed block to remaining slaves. Processes are statically assigned to processors by a one-to-one mapping.

Figure 4: Wall clock distribution across processors

Figure 4 contains the total execution time and the migration loop execution time (both wall clock) across 500 processors of a typical production run. From the total execution time of about 1,300,000 seconds, about 1,150,000 seconds (88.4%) are used by the migration loop. Remaining execution time is used to filter input traces (about 10.2%) and IO. Plot jigger shows negligible load unbalancing and also shows a faulty processor (numbered 453) decommissioned at about 900,000 seconds of run time.

5. Blue Gene Experiments Power consumption and energy dissipation reductions have been the driving forces of recent processor and machine architecture proposals. Power requirements at Petrobras grew faster than processor count over the years (see Figure 3). Massive parallelism is another clear computer architecture trend. In this scenario, testing new architectural trends became of strategic importance. By late 2005, Petrobras and IBM cooperate on a research project to test the adequacy of the IBM Blue Gene/L to the production Kirchhoff time migration application. A few available time slots on the four racks, 8192 processors IBM Blue Gene/L at IBM Rochester were granted to the project. The migration code was cross-compiled at Petrobras and object files were loaded at Rochester, proving code portability – only the non-standard flush operations were modified. Executions were faultless. The low block memory footprint contributed to project success, avoiding costly algorithm and code modifications. Timing experiments were performed at Rochester and repeated at a Beowulf blade cluster at Petrobras with 2GHz Opteron 246 single-core, dual processor boards. Table 2 summarizes the observed results. Speed per core was measured and is expressed in mega contributions per second and core. Blade power was measured and Blue Gene power was estimated (20MW per rack). Speed per power is derived from Table 2 previous data and is measured in mega contributions per second and watt. List prices available at the time of the experiment are reported. Speed per price is also derived and is measured in kilo contributions per US dollar. The experiment is somehow unfair to the Blue Gene since it executed vector friendly, single precision Beowulf optimized code. Even so, this pioneer work shows its strength in speed/watt and parallel scalability: the measured compound processing speed on a Blue Gene run with 8192 processor was 37.7G

contributions per second against 29.3G contributions per second on a Beowulf run with 1000 processors. Table 2: Blue Gene and blade performance System Single core blade Blue Gene/L System Single core blade Blue Gene/L System Single core blade Blue Gene/L

Cores/board 2 2 Power/board (W) 153.67 19.53 Price/board (US$) 4000 2000

Speed/core 29.30 4.61 Speed/Power 0.38 0.47 Speed/Price 14,65 4.61

Code parallel scalability was proven on the BG/L by a negligible speed per processor variation with processor count: 4.59M and 4.61M contributions per second and processor on 4096 and 8192 processor runs, respectively.

6. Multicore Experiments During 2006 and early 2007, a set of multi-core boards was available to probe application adequacy to this new architectural trend, including a prototype desktop board with two 2.66GHz quad-core Intel Xeon. Meanwhile, rare time slots on production machines were used to collect application performance data on a dual processor server board with two singlecore 3.06GHz Intel Xeon (a top sample of the high frequency, high power consumption architectural trend), on a blade dual processor board with two 2GHz dual-core Opteron 270 (a pioneer dual-core) and on an old dual processor server board with two single-core 1.8GHz Opteron 244. The same code was submitted to all boards, using processor specific compilation switches. A fixed problem size demanded a few days of computing per board. Table 3 summarizes the results, using Table 2 nomenclature and units, measuring speed and power but estimating costs by list prices. Table 3: Multi core performance System Slow Single Core Fast Single Core Dual core blade Quad core System Slow Single Core Fast Single Core Dual core blade Quad core System Slow Single Core Fast Single Core Dual core blade Quad core

Cores/board 2 2 4 8 Power/board (W) 209.61 238.16 198.94 415.13 Price/board (US$) 2500 2500 4000 6300

Speed/core 30.24 36.67 33.98 49.01 Speed/Power 0.29 0.31 0.68 0,94 Speed/Price 24.19 29.34 33.98 62.23

The speed per core increase with the number of cores per chip shows that the computational characteristics of the application suited multi-cores quite well. The high cache hit ratio avoids competition for memory. The low cache footprint and process independency avoid competition for cache lines among cores. The high computational intensity and high vectorization ratio dispatch simultaneous floating-point vector operations, utilizing recent multi-core microarchitecture enhancements. Speed per power and speed per price also increase with cores per chip, leveraged by the speed per core increase. To identify the gains, it suffices to normalize the three ratios by the slow single core values. Figure 5 shows speed per power and speed per price gains in excess of speed per core gains. 4

Ratio to Slow Single Core

3,27

3 2,57 2,37

2 1,62 1,40 1,21

1,07

1,21

1,12

1

0 Fast Single Core Speed/Core

Dual core blade Speed/Power

Quad core Speed/Price

Figure 5: Gain ratios to slow single core

7. Cell Simulator and Sony PlayStation 3 The Cell Broadband Engine Processor Architecture [17] is a promising heterogeneous architecture that achieves an interesting combination of outstanding floating point processing speed and low power requirements by relinquishing memory consistency. It contains eight powerful processors (SPE) and a PowerPC processor (PPE). Each SPE controls its own memory – a critical 256KB – while the PPE has a regular memory size and hierarchy. Data is moved among memories by explicit DMA requests over a high bandwidth internal bus. Data movement and coherency among memories are the programmer’s responsibility. It is the central processor of the gaming console Sony PlayStation 3 (PS3). The early availability of Cell SDK software development kit, Cell simulators and Cell dual processor prototype boards [23] allowed research in the Cell applicability to HPC [18, 19] and in programming models and languages [20, 21], among others. But the use of the PS3 gaming console on HPC was only recently probed [22]. PS3 attracts by its low cost (US$

600) and mass production, even being powered by a stripped down Cell processor (only 6 SPEs are available for general processing). During 2006 and early 2007 Petrobras conducted a research project to test the adequacy of the architecture to the migration application. The first research phase aimed porting and optimizing the application on the simulator. As promising results build up and the PS3 became commercially available (Nov 2006), research moved to a second phase: build a modest Beowulf cluster with four PS3 and test the application performance on experimental runs.

7.1. Application port and optimization The parallelism strategy within a single Cell was to assign to each SPE the computation of all contributions of a single input trace to a single output trace and to assign to the PPE all remaining computations needed to migrate one output block. Using the nomenclature established at Figure 2, each SPE executes one instance of the entire contribution loop (some migration loop iterations), the set of SPEs executes the migration loop for an output block and the PPE executes the remaining commands of the input traces loop for an output block. Each SPE issues DMA requests of output traces, input velocities and updated output traces. PPE and SPE synchronize at a barrier whenever an input trace was fully processed. Double buffers at the PPE memory allow pipelining of input trace reading and filtering, minimizing barrier waiting time. Multiple Cells use the untouched master – slave parallelism, with the master process on a dedicated Cell assigning output blocks to slave Cells by slave demand. SPE memory space and DMA timing are critical issues to the success of this strategy. Each SPE uses about 160KB of memory to execute one instance of the contribution loop (the required x86 L2 cache size, as in section 4.1). Remaining SPE memory is used for double buffers of output trace and velocity, to allow pipelining of hardware serialized DMA requests. The high flop count allows enough time for DMA completion for up to eight SPEs, even at high SPE floating-point arithmetic processing speeds. Software porting and optimization required about three man months. The Cell had to be programmed in C (the only available compiler). Production code was converted back to Fortran 77 to input the f2c tool. Produced C code executed flawless, with low speeds. SPE programming for high speed requires the use of a vector data type and functions as abstractions of SPE vector registers and instructions. Hot spot routines were rewritten in C, using this data type. Resulting code reached higher speeds and binary reproducibility

with variable number of SPEs, but speeds were lower than expected. Optimization work focused on eliminating stalls by the introduction of vector unrolling of depth up to eight. The development of a simple tool for automatic generation of unrolled code from annotated statements accelerated the optimization process. The final code executed an order of magnitude faster than the previous code version, still maintaining binary reproducibility. The Cell simulator was central for the optimization success, by presenting numerous hardware performance counters at clock-cycle levels.

7.2. Performance on the PS3 Beowulf cluster The Beowulf cluster of PS3 uses open source Linux and MPICH2. The migration code was moved to the cluster and executed flawlessly, validating the Cell SDK as a software development environment. The first PS3 experiment investigated parallel scalability within a single Cell, varying the number of SPE involved in the computation. Input data had a single block, extracted from the multi-core experiment data, demanding about four hours of execution on the single SPE case. Figure 6 contains speed-up and parallel efficiency computed from wall clock times of the master process that runs on a second, dedicated PS3. Figure 6: Single PS3 parallel performance 1,00

6

1,00 5,62

0,99 5

0,96 3

0,95

2,93

2

0,94

1,98

0,92

1,00

1

0,94

Efficiency

0,97 3,87

4 Speed-up

0,98

4,76

0,98

0,90

0 1

2

3

4

5

6

SPE Speed-up

Efficiency

Execution times scale quite well with SPE count, showing algorithm adequacy to the Cell. The second PS3 experiment repeats the multi-core experiment on a single slave PS3, driven by a dedicated master PS3. Table 4 summarizes the results, using the same nomenclature and units as before and considering one Cell SPE as a core. The quad-core results of Table 3 are repeated to ease comparisons. Table 4 shows that the PS3 exceeds in all metrics. A full PS3 is about 25% faster than a full quad-core. The

PS3 is the most power efficient of all boards, and has at least one order of magnitude better price/performance ratio over any other tested board. Table 4: Single PS3 performance System Quad core PS3 System Quad core PS3 System Quad core PS3

Cores/board 8 6 Power/board (W) 415.13 380.73 Cost/board (US$) 6300 600

Speed/core 49.01 81.38 Speed/Power 0,94 1.28 Speed/Price 62.23 813.80

The third PS3 experiment measures parallel scalability on the PS3 cluster, using the multi-core experiment input data and a dedicated master PS3. Table 5: PS3 cluster parallel performance Slaves PS3 Speed-up Efficiency

2 1.99 0.99

3 2.92 0.97

The measurements reported at Table 5 show adequate parallel scalability of the time migration on the PS3 Cluster, although limited by the modest cluster size.

8. Conclusions This paper documents the development history and daily production use of the Kirchhoff time and depth seismic migration application, as well as the evolution of dedicated production machinery at Petrobras. It describes the developments that were critical to the success of up to 1000 processors, 20 days production runs: algorithm organization, sequential efficiency, parallel scalability and fault tolerance. It presents performance data on a variety of x86 Beowulf dual processor boards, from old single-core to recent quadcore boards, as well as Blue Gene/L performance data. Successful port and optimization for a modest cluster of Sony PlayStation 3 is described in detail. Experimental data allows sensing the adequacy of processor technology trends to the application. Probed trends are multi-core and alternative architectures. Our observed data shows that the multi-core trend at current core per chip count is a viable substitute for the higher frequency single-core trend in any tested performance metric. It also shows that the PS3 has the best performance in any metric, including an order of magnitude price/performance gain over any mass production system tested. But the spectrum of HPC applications that might benefit of the Cell architecture

is further limited by the lack of usual software development tools for the Cell. As in any HPC enterprise, results are heavily dependent on application characteristics and cannot be generalized.

24th International Conference on Parallel Processing, IEEE Computer Society, 1995.

Acknowledgments

[13] S. Browne, J. J. Dongarra, N. Garner, G. Ho and P. Mucci, “A Portable Programming Interface for Performance Evaluation on Modern Processors”, The International Journal of High Performance Computing Applications, Volume 14, number 3, pp. 189-204, Mar 2000.

The authors thank Petrobras for the long term research opportunity and for authorizing the public dissemination of research results. The authors gratefully acknowledge the continuous support of AMD, Intel and IBM, including the early availability of prototype boards for performance measures. Blue Gene experiments would not be possible without the work of Fabio Gandour and Marcelo L. Braunstein of IBM Brazil as well as José E. Moreira of IBM USA, among others. The work on the Cell simulator was heavily encouraged by Fabio Gandour and fed by information provided by Braunstein’s team at IBM Brazil.

Bibliography [1] C. Ellsworth and S. Vikas, “Relationships changing as NOD-IOC roles evolve”, Oil and Gas Journal, Volume 105, number 21, Apr 2007. [2] O. Yilmaz, Seismic Data Processing, Society of Exploration Geophysics, Tulsa, OK, 1988. [3] J. F. Claerbout, Imaging the Earth’s Interior, Blackwell Scientific Publications, USA, 1985. [4] C. W. Groetsch, Inverse Problems in the Mathematical Sciences, Viewg, Braunschweig, 1995. [5] ProMAX seismic processing family, available at http://www.halliburton.com/ps/Default.aspx?navid=221&pag eid=862&prodid=MSE%3a%3a1055450737429153. [6] Omega seismic processing system, available at http://www.westerngeco.com/content/services/dp/omega/inde x.asp? [7] SeisUP seismic processing system, available at http://www.geocenter.com/seisup/seisup.html. [8] Geocluster seismic processing system, available at http://www.cggveritas.com/default.aspx?cid=13

[11] November 2006 Top500 list at http://www.top500.org. [12] OProfile home page at http://oprofile.sourceforge.net.

[14] W. Gropp and E. Lusk, “Fault Tolerance in Message Passing Interface Programs”, The International Journal of High Performance Computing Applications, Volume 18, number 8, pp. 363-372, Aug 2004. [15] G. L. Chui, M. Gupta and A. K. Royyuru, Guest Editors, “Blue Gene”, IBM Journal of Research and Development, Volume 49, number 2/3, pp. 189-500, May 2005. [16] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel and D. Hitz, “NFS Version 3 Design and Implementation”, Proceedings of the USENIX Summer 1994 Technical Conference, Jun 1994. [17] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer and D. Shippy, “Introduction to the Cell Multiprocessor”, IBM Journal of Research and Development, Volume 49, number 4, pp 589-604, Jul 2005. [18] S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil and K. Yelick, “The Potential of the Cell Processor for Scientific Computing”, Proceedings of ACM International Conference on Computing Fontiers”, ACM, May 2006. [19] J. Kurzak and J. Dongarra, “Implementation of the Mixed-Precision High Performance LINPACK on the Cell Processor”, Technical Report UT-CS-06-580, Department of Computer Science, University Tennessee, Sep 2006. [20] K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken. W. J. Dally and P. Hanrahan, “Sequoia: Programming the Memory Hierarchy”, Proceedings of Supercomputing 2006, IEEE Computer Society, 2006. [21] P. Bellens, J. M. Perez, R. M. Badia and Jesus Labarta, “CellSs: a Programming Model for the Cell BE Architecture”, Proceedings of Supercomputing 2006, IEEE Computer Society, 2006

[9] S. Pinheiro, J. Panetta and C. L. Amorim, “Investigação de Paralelismo na Migração Omega-x”,– SBAC-PAD 1994, SBC, 1994.

[22] A. Buttari, J. Kurzak and J. Dongarra, “Limitations of the playstation 3 for high performance cluster computing”, Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee, Apr 2007.

[10] T. Sterling, D. Savarese, D. J. Becker, J. E. Dorband. U. A. Ranawake and C. V. Parker, “BEOWULF: A Parallel Workstation for Scientific Computing”, Proceedings of the

[23] The Cell Project at IBM Research, available at http://www.research.ibm.com/cell/.