Available online at ScienceDirect. Procedia Engineering 154 (2016 )

Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 154 (2016) 199 – 206 12th International Conference on Hydroinformatics,...
Author: Eunice Norton
3 downloads 0 Views 249KB Size
Available online at www.sciencedirect.com

ScienceDirect Procedia Engineering 154 (2016) 199 – 206

12th International Conference on Hydroinformatics, HIC 2016

On the Efficiency of Executing Hydro-environmental Models on Cloud Fearghal O’Donnchaa*, Emanuele Ragnolia, Srikumar Venugopala, Scott C. Jamesb, Kostas Katrinisa a IBM Research – Ireland, Damastown Ind. Park. Dublin 15, Ireland Baylor University, Department of Geosciences and Mechanical Engineering, One Bear Place #97354, Waco, TX

Abstract Optimizing high-performance computing applications requires understanding of both the application and its parallelization approach, the system software stack and the target architecture. Traditionally, performance tuning of parallel applications involves consideration of the underlying machine architecture, including floating point performance, memory hierarchies and bandwidth, interconnect architecture, data placement – among others. The shift to the utility computing model through cloud has created tempting economies of scale across IT and domains, not leaving HPC as an exception as a candidate beneficiary. Nevertheless, the infrastructure abstraction and multi-tenancy inherent to cloud offerings poses great challenges to HPC workloads, requiring a dedicated study of applicability of cloud computing as a viable time-to-solution and efficiency platform. In this paper, we present the evaluation of a widely used hydro-environmental code, EFDC, on a cloud platform. Specifically, we evaluate the target parallel application on Linux containers managed by Docker. Unlike virtualization- based solutions that have been widely used for HPC cloud explorations, containers are more fit-for-purpose, sporting among others native execution and lightweight resource consumption. Many-core capability is provided by the OpenMP library in a hybrid configuration with MPI for cross-node data movement, and we explore the combination of these in the target setup. For the MPI part, the work flow is implemented as a data-parallel execution model, with all processing elements performing the same computation, on different subdomains with thread-level, fine-grain parallelism provided by OpenMP. Optimizing performance requires consideration of the overheads introduced by the OpenMP paradigm such as thread initialization and synchronization. Features of the application make it an ideal test case for deployment on modern cloud architectures, including that it: 1) is legacy code written in Fortran 77, 2) has an implicit solver requiring non-local communication that poses a challenge to traditional partitioning methods, communication optimization and scaling and, 3) is a legacy code across academia, research organizations, governmental agencies, and consulting firms. These technical and practical considerations make this study a representative assessment of migrating

*

Corresponding author. Tel.: +353 01 826924. E-mail address: [email protected]

1877-7058 © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of HIC 2016

doi:10.1016/j.proeng.2016.07.447

200

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

legacy codes from traditional HPC systems to the cloud. We finally discuss challenges that stem from the containerized nature of the platform; the latter forms another novel contribution of this paper. © Published by by Elsevier Ltd.Ltd. This is an open access article under the CC BY-NC-ND license ©2016 2016The TheAuthors. Authors. Published Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of HIC 2016. Peer-review under responsibility of the organizing committee of HIC 2016 Keywords: Cloud; HPC; Numerical modelling; Containers; Hydrodynamic model

1. Main text Resolving three-dimensional flows, transport and biogeochemical processes in surface water systems is a computationally intensive challenge. In particular, the drive to model more realistic and detailed scenarios, introduces increased computational demands of numerical solutions, due primarily to finer grid resolution and the simulation of a greater number of passive and active tracers. Parallel computing allows faster execution and the ability to perform larger, more detailed simulations than is possible with serial code. Traditionally, this highperformance computing is done on cluster systems featuring a hierarchical hardware design. Shared memory nodes with several multi-core CPUs are connected through network infrastructure. Parallel programming must combine distributed-memory parallelization on the node interconnect with shared-memory parallelization inside each node. The benefits of cloud computing are well-known—efficiency, flexibility, high utilization—but have been difficult to achieve in technical and high-performance computing. The extraordinary performance demands that engineering, scientific, analytics, research workloads place upon IT infrastructure have historically meant that these workloads were not suitable for cloud deployments, especially those that use virtualization technology. Practical benefits arising from cloud deployments centre on availability, scalability, greater transportability offered by virtualization and a greater array of software choice. Traditional HPC users, however, have different features requirements, including [1]: x Close to the “metal” – Many man-years have been invested in optimising HPC libraries and applications to work closely with the hardware, thus requiring specific OS drivers and hardware support. x User space communication – HPC user applications often need to bypass the OS kernel and communicate directly with remote user processes. x Tuned hardware – HPC hardware is often selected on the basis of communication, memory, and processor speed for a given application. This paper investigates the performance of a widely used hydro-environmental code, Environmental Fluid Dynamics Code (EFDC), in a cloud environment. Many-core capability is provided by the MPI and OpenMP libraries in a hybrid configuration, and we explore the combination of these in the target setup. For the MPI part, the work flow is implemented as a data-parallel execution model, with all processing elements performing the same computation, but on different sub-domains. The domain decomposition strategy introduces additional considerations around load balancing of the application. Periodic synchronisation of each sub-domain at the end of each timestep means that the time to solution is constrained to that of the slowest sub-domain and hence requires intelligent distribution of each sub-domain to ensure approximately equal load on each processor. Fine-grain parallelism using openMP requires a much more invasive programming approach. Computationally intensive sections of the code are parallelized using compiler directives inserted at appropriate positions within the code to distribute workload across different threads. While explicit load balancing is not a factor for fine-grain parallelism, efficient performance requires consideration of the overheads introduced by openMP and the relevant computational cost to ensure that these overheads do not become punitive. 2. Methodology EFDC is a public-domain, open-source, modelling package for simulating three-dimensional flow, transport, and biogeochemical processes in surface-water systems. The model is specifically designed to simulate estuaries and subestuarine components (tributaries, marshes, wet and dry littoral margins), and has been applied to a wide range of

201

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

environmental studies including: surface-current processes [2], [3] suspended sediment transport [4], [5] water quality investigations [6], [7] and canopy flow processes [8], [9]. It is currently used by universities, research organizations, governmental agencies, and consulting firms. The equations that form the basis for the EFDC hydrodynamic model are based on the continuity and Reynoldsaveraged Navier-Stokes equations. Adopting the Boussinesq approximation for variable-density fluid, which states that density differences are sufficiently small to be neglected except where they appear in terms multiplied by gravity, g, the governing equations can be expressed as:

wK w u w v w w    w t w x wy w z

0,

§ A wu · w¨ v ¸ w gK  p § wh wHu w Huu w Hvu w wu wH · wp © H wz ¹ ,     fHv  H ¨  z ¸  wt wx wy wz wx wx ¹ wz wz © wx Av wv ) w( wHv w ( Huv ) w ( Hvv ) w ( wv ) w ( gK  p ) § wh wH · wp ¸¸     fHu  H  ¨¨  z  H wz wy wy ¹ wz wt wx wy wz wz © wy

(1)

(2)

(3)

where t is time; Ș is water elevation above or below datum; u, v are the velocity components in the curvilinear orthogonal coordinates x and y with w representing the vertical component; H is total water depth (= h + Ș, where h is water depth below datum); f is the Coriolis parameter; p is excess water-column hydrostatic pressure; and Av is vertical turbulent viscosity. The equations governing the dynamics of coastal circulation simulate propagation of fast-moving external gravity waves and slow-moving internal gravity waves. It is desirable in terms of computational economy to separate vertically integrated equations (external mode) from the vertical structure equations (internal mode) [10]. The external mode associated with barotropic, depth-independent, horizontal, long-wave motion is solved using a semiimplicit three time level scheme; the external mode computes surface elevation and depth-averaged velocities, u and v. The internal mode, associated with baroclinic, fully three-dimensional, velocity components is solved using a fractional-step scheme combining an implicit step for the vertical shear terms with an explicit discretization for all other terms; the depth-averaged velocities computed in the external mode equations serve as boundary conditions to the computation of the layer-integrated velocities. This approach solves the two-dimensional, depth-averaged momentum equations implicitly in time, hence allowing for the model’s barotropic time step to equal the baroclinic time step. The primary limitation of this semi-implicit method is the introduction of an elliptic solver (preconditioned conjugate gradient) to solve implicitly for the free-surface elevation. This has traditionally posed a problem for efficient projection of model codes onto parallel computers due to the inherent non-local conditions of the solver [11]; this issue is addressed in further detail in the next section. 2.1. Parallelization Parallelization of the code using both the MPI and OpenMP paradigms are presented elsewhere [12], [13] and will be briefly discussed here with emphasis on the challenges faced. EFDC is a Fortran 77 code originally designed for deployment on vector computers as opposed to distributed systems. The code was configured to achieve a degree of parallelization on shared-memory processors through directives specific to vectorised architectures. For performance comparable to vector systems, scalable, cache-based processors achieve speedup through massively parallel partitioning of the problem among many processors working concurrently. The standard domain decomposition implementation stores both fluid and land cells in a full matrix, which is usually decomposed into a number of equal-sized, regular-shaped subdomains for parallel processing. By adding ghost layers of halo points to each of the subdomains, the communication step can be accomplished by sending

202

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

values to the ghost layers of the neighboring processors. Each subdomain then proceeds independently with synchronization of the solution achieved through ghost layers at the end of each timestep. The mix of land and water regions with such different computational costs complicates the task of balancing the work load to achieve maximum parallel performance. To minimize added complexity to the code while achieving improved throughput, we adopt a rectilinear partitioning that allows both a range of tile sizes and the ability to mark some tiles as inactive. Two types of inter-machines communications keep the computations consistent with the sequential code: (1) pointto-point communication to update halo values among neighboring machines and (2) global communication at each iteration of the preconditioned conjugate gradient solver required for computing surface elevations. Naturally, the global communication was of primary concern as it presents a bottleneck for each core, which is exacerbated with increasing number of cores. We compare the hybrid MPI+OpenMP implementation against the pure MPI approach to determine how the different paradigms perform on a core-per-core basis. The hybrid approach combines domain decomposition at the MPI level with loop-level parallelism implemented by OpenMP. The selection of the optimum number of MPI processes and OpenMP threads is one consideration of the study, while leveraging thread-level parallelism to reduce the communication demands placed by the MPI implementation is another focus. In this study, the scalability of the solution is a central concern. Amdahl's law provides insight into the expected speedup of a parallel program, which is a function of the algorithm, but scalability is also impacted by machine aspects. In HPC, speed-up is contingent upon keeping resources busy and reducing idle time when processors wait for data from neighbors. Some programs have few dependencies on the neighboring solution making them highly scalable, while many grid-based problems have tight dependencies that make scalability more difficult. For these problems, the speed of the connection between cores (network interconnect), has a major impact on performance. Most traditional clouds use either Gigabit or 10-Gigabit Ethernet connections. HPC instances in these clouds will absolutely work for embarrassingly parallel programs, but may struggle with those that require a better interconnect. That is, scalability and hence performance will suffer [14]. 3. Results Performance tests were conducted in our datacenter, specifically on a set of IBM idataplex M3 compute servers. Each node consists of two 2.93GHz six core Intel Xeon (Westmere) processors, giving twelve cores in total, forming a single NUMA (Non-Uniform Memory Architecture) domain with 128 GB of DDR3 RAM and 1 GigE network interconnect. Experiments were conducted on an idealized test-case scenario to reduce application-specific performance issues that may not translate to other case studies (in particular, processor load imbalances that may arise in case studies with irregular land/water mixes made it difficult to balance processors under a different number of MPI process configurations). The test case consisted of a 400 × 400 grid cell rectangular configuration with 20 sigma layers in the vertical. Domain decomposition distributed equal-sized partitions to each processor as part of the MPI parallelization. Primarily, benchmarking of model performance focused on strong scaling (i.e., the overall spatial domain size remained fixed regardless how many processors were used) because it is the most relevant measure when turnaround time is the main consideration [15]. In practical hydro-environmental studies, the size of the problem is based on physical considerations, and the desire is to quickly solve the problem using as many processors as available. Weak scaling is of interest because it relates closely to the scalability of the underlying algorithm [16].The first stage of the study targeted the evaluation of model performance on a bare metal setup within our datacenter. This provided a baseline to understand and assess performance in a cloud environment. Simulations were conducted on up to 60 cores to provide an assessment of typical deployments of models configurations of this size. To quantitatively compare the different parallel configurations, both the speedup ratio, S, and the parallel efficiency, E, of the application were computed:

S

T1 TP

E

S NP

(4)

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

203

where T1 is the simulation time with the serial code; Tp is the simulation time on P processors; and Np is the number of processors. Figure 1 shows the scaling for both a pure MPI and hybrid MPI + OpenMP implementation. Results demonstrate that for this deployment, strong scaling parallel efficiency, E, drops below 50% at 18 cores. Deployment on 60 cores provides a speedup factor of 16 at a parallel efficiency of 26%. The drop in parallel efficiency is an expected result of strong scaling deployment: as the number of processors relative to the problem size increases then the ratio of communication to computation also increase, causing an increase in parallel overheads and associated runtime. Further, with a domain-decomposition approach, as the problem size assigned to each processor decreases, then the ratio of the problem size that comprise ghost zone cell computations, increases, which results in duplicated computations on each MPI process to maintain the solution accuracy. A notable feature is that the hybrid implementation does not produce performance gains beyond the pure MPI implementation. The expected advantages of the hybrid implementation versus pure MPI implementation include a reduction in the amount of data to be communicated (and total MPI calls), along with more flexible load balancing. For the idealized case presented here, load balancing is not an issue as each processor is assigned an equal rectangular section, hence coarse-grain load balancing is trivial. Similar to MPI, thread level parallelization as implemented via OpenMP, introduces parallel overheads associated mainly with thread creation/wakeup overhead and frequent synchronization. These results suggest that for this deployment the hybrid parallel model does not provide performance gains versus the pure MPI deployment.

Fig 1: Comparison of the speedup ratio and parallel efficiency when deploying both MPI and hybrid MPI+OpenMP computing paradigm on bare metal cluster. Results shown are strong scaling parallelism for 400 × 400 grid cell rectangular configuration equally partitioned across domains. The hybrid implementation combines MPI domain decomposition with two OpenMP threads. Hence for a 60 core deployment of the hybrid code, it corresponds to 30 MPI processes with each process having two OpenMP threads.

To allow for a more direct comparison and to simplify deployments (and performance variables), the next stage considers only the pure MPI implementation and an assessment of performance in cloud environment. Cloud computing initially emerged as a technology for delivering infrastructure as a service (IaaS). This model has been extended to other elements such as application platforms, software, and security among others. The most popular mechanisms for provisioning and delivery of computing infrastructure use virtual machines (VMs) as the unit of delivery. Virtual machines are software encapsulations of a computing machine down to the operating system. Multiple VMs running different operating systems can run on a single physical host, access to whose resources are mediated by a component called hypervisor. Virtual machines have heavy overheads to this requirement for complete system virtualization. Past studies have demonstrated that running MPI applications over VM-based IaaS cloud services results in heavy performance degradation due to the virtualization overhead [17].

204

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

A new low-overhead model of virtualization based on Linux containers has emerged in the recent past. In this model, the operating system is shared between different containers that remain isolated from each other. This technology has been used to provide application deployment mechanism by a few projects, the most notable of them being Docker. Docker containers are now being offered being offered by many IaaS providers as an alternative to virtual machines. Figure 2 presents model simulation time on cloud containers compared to running on the bare metal setup. Results demonstrate broadly similar run-time across containers and bare metal up to 24 cores (with deployments in the container environment faster in many cases). Beyond this, simulation run-time on cloud are noticeably slower than bare metal with deployments on 60 cores 28% slower on cloud than bare metal. However, for this particular model setup (400 × 400 grid cells), these core counts are beyond the limits of any reasonable parallel efficiency (as demonstrated in Figure 1) and are not particularly insightful for practical deployments.

Fig 2: Simulation time for a model grid configured for strong scaling benchmarks on both bare metal and cloud compute environments. Results show strong scaling parallelism for a 400 × 400 grid cell rectangular configuration equally partitioned across domains deployed on up to 60 cores.

To provide further insight, an equivalent problem was configured under a weak-scaling paradigm (i.e., as the number of processors increases, the problem size increases correspondingly), thereby maintaining problem size for each processor under the domain-decomposition implementation. Figure 3 presents simulation times for a case study deployment under a weak scaling approach. In this configuration, a rectangular domain of 50 × 50 × 20 grid cells is assigned to each processor thereby maintaining computational constancy with only communication varying as processor count increased. Weak scaling provides insight into the complexity of translating an application to manycore parallelism. In an “embarrassingly parallel” application, where there is no communication between adjacent domains, an application could scale to any number of cores without performance degradation. In many practical applications, however, simulation time increases as communication overhead increases. This is particularly true in applications requiring global communication such as this. Many scientific computing applications require the solution to nonlinear implicit algorithms (e.g. meteorology [18], CFD [19] and financial modelling [20]). The best algorithms for these sparse linear problems, particularly at very large sizes, are often preconditioned iterative methods, thereby making global-reduction operations a necessary step in many current, numerically efficient algorithms. Hence, the impact of such operations in both a bare metal and cloud computing environment is an important consideration across many scientific disciplines. Similar to the strong scaling presented above, these model simulation exhibit significant performance penalties when the number of cores exceeds 18. Up to 12 cores, execution time is lower for the containers than for the physical machines with simulations on six cores 20% faster in containers. This implies that intra-container communication through the virtual network device is faster than communication between MPI processes that use the shared memory communication module provided by OpenMPI. When more than 12 cores are used (and consequently more than one physical machine), execution time is shortened when running directly on the machine

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

205

than inside containers. This is particularly pronounced when running on 60 cores with execution time of the container-based application 55% slower. A potential source of this slowdown is the large number of MPI messages sent by the application (both point-to-point and all-to-all) leading to an increase in network congestion and TCP timeouts [21]. In the case of a TCP timeout, the Linux kernel sets by default the TCP minimum retransmission timeout (RTO) to 0.2 seconds, meaning the application has to wait a minimum of 0.2 seconds before continuing to receive messages [21]. Analysis of the time-per-timestep of the application running on 60 cores provides insight into this effect. The mean time-per-timestep running directly on the machine is 0.15 seconds compared with 0.23 seconds inside the container. However, the standard deviation of this timing metric across the entire simulation time (200 time steps total) is 0.11 seconds in containers as opposed to just 0.004 for the bare metal simulation indicating large fluctuations in the container timings. Analysis demonstrates that for the entire simulation 69% of time steps execute in less than 0.2 seconds (mean time-per-timestep of 0.17 s) with the remaining 32% having a mean timeper-timestep of 0.38 s indicating that TCP timeout delays impact a portion of the simulation and consequently total execution time.

Fig 3: Weak scaling simulation time for a 50 × 50 × 20 grid cell configuration as the number of cores increases. Weak scaling means that as the number of cores increases, the size of the problem increases correspondingly so that in a perfect scaling configuration the simulation time would remain constant

4. Conclusions The paper describes the migration of a three-dimensional hydrodynamic model with semi-implicit solver to the cloud. Scaling metrics when running the application both directly on the machine and within containers are presented. Of interest is the impact of the high number of MPI calls (both point-to-point and all-to-all) on parallel scalability. Results presented here demonstrate that parallel performance at high node counts with a semi-implicit solver is a challenging undertaking. At higher numbers of cores, parallel efficiency diminishes as communication overheads introduced by the semi-implicit algorithm become punitive. This is true when running on either baremetal or within containers. Running on a single node (up to 12 MPI processes, either in bare metal or container), application execution time are lower within containers than on bare-metal suggesting that intra-container communication is faster than shared memory communication provided via OpenMPI. Multi-node deployments however (from 12 – 60 processes in this case), produce lower run-time when run directly on the machine. This timing disparity is higher as number of processes increases with application run-time at 60 cores, 55% slower relative to the bare-metal deployments. These results however, are somewhat skewed by OS timeout delays that generate delays of 200 microseconds in some timesteps. These delays may be expected in a cloud environment that is not optimised for high throughput scientific

206

Fearghal O’Donncha et al. / Procedia Engineering 154 (2016) 199 – 206

computing since communication among containers invoke more OS/software involvement compared to the baremetal case. We are currently investigating solutions closer to the network card may reduce this overhead. . References [1] [2]

[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

D. Eadline, “Moving HPC to the Cloudௗ» ADMIN Magazine,” ADMIN Magazine. [Online]. Available: http://www.admin-magazine.com/HPC/Articles/Moving-HPC-to-the-Cloud. [Accessed: 29-Feb-2016]. F. O’Donncha, E. Ragnoli, S. Zhuk, F. Suits, and M. Hartnett, “Surface flow dynamics within an exposed wind-driven bay: Combined HFR observations and model simulations,” presented at the IEEE Oceans 2012, Hampton Roads, Virginia, 2012. F. O’Donncha, M. Hartnett, S. Nash, L. Ren, and E. Ragnoli, “Characterizing observed circulation patterns within a bay using HF radar and numerical model simulations,” J. Mar. Syst., vol. 142, pp. 96–110, 2015. P. X. H. Thanh, M. D. Grace, and S. C. James, “Sandia National Laboratories Environmental Fluid Dynamics Code: Sediment transport user manual,” Sandia Natl. Lab. Livermore CA SAND2008-5621, 2008. S. C. James, C. A. Jones, M. D. Grace, and J. D. Roberts, “Advances in sediment transport modelling,” J. Hydraul. Res., vol. 48, no. 6, pp. 754–763, 2010. S. C. James and V. Boriah, “Modeling algae growth in an open-channel raceway,” J. Comput. Biol., vol. 17, no. 7, pp. 895–906, 2010. S. C. James, V. Janardhanam, and D. T. Hanson, “Simulating pH effects in an algal-growth hydrodynamics model1,” J. Phycol., vol. 49, no. 3, pp. 608–615, 2013. F. O’Donncha, M. Hartnett, and D. R. Plew, “Parameterizing suspended canopy effects in a three-dimensional hydrodynamic model,” J. Hydraul. Res., vol. 53, no. 6, pp. 714–727, 2015. F. O’Donncha, S. C. James, N. O’Brien, and E. Ragnoli, “Numerical modelling study of the effects of suspended aquaculture farms on tidal stream energy generation,” in OCEANS 2015-Genova, 2015, pp. 1–7. A. F. Blumberg and G. L. Mellor, “A description of a three-dimensional coastal ocean circulation model,” Coast. Estuar. Sci., vol. 4, pp. 1–16, 1987. S. M. Griffies, C. Böning, F. O. Bryan, E. P. Chassignet, R. Gerdes, H. Hasumi, A. Hirst, A. M. Treguier, and D. Webb, “Developments in ocean climate modelling,” Ocean Model., vol. 2, no. 3–4, pp. 123–192, 2000. F. O’Donncha, E. Ragnoli, and F. Suits, “Parallelisation study of a three-dimensional environmental flow model,” Comput. Geosci., vol. 64, pp. 96–103, Mar. 2014. F. O’Donncha, S. C. James, N. O’Brien, and E. Ragnoli, “Parallelisation of Hydro-environmental Model for Simulating Marine Current Devices,” presented at the IEEE/MTS Oceans, Washington, 2015. D. Eadline, “Will HPC Work In The Cloud? | Cloud/Grid | Columns,” Cluster Monkey, 2012. [Online]. Available: http://www.clustermonkey.net/Grid/will-hpc-work-in-the-cloud.html. [Accessed: 29-Feb-2016]. A. Deane, G. Brenner, D. R. Emerson, J. McDonough, D. Tromeur-Dervout, N. Satofuka, A. Ecer, and J. Periaux, Parallel Computational Fluid Dynamics 2005: Theory and Applications. Elsevier, 2006. Z. J. Wang, Adaptive High-order Methods in Computational Fluid Dynamics, vol. 2. World Scientific, 2011. R. R. ExpóSito, G. L. Taboada, S. Ramos, J. TouriñO, and R. Doallo, “Performance analysis of HPC applications in the cloud,” Future Gener. Comput. Syst., vol. 29, no. 1, pp. 218–229, 2013. J. Steppeler, R. Hess, U. Schättler, and L. Bonaventura, “Review of numerical methods for nonhydrostatic weather prediction models,” Meteorol. Atmospheric Phys., vol. 82, no. 1–4, pp. 287–301, Jan. 2003. W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, “High-performance parallel implicit CFD,” Parallel Comput., vol. 27, no. 4, pp. 337–362, 2001. Q.-J. Meng, D. Ding, and Q. Sheng, “Preconditioned iterative methods for fractional diffusion models in finance,” Numer. Methods Partial Differ. Equ., vol. 31, no. 5, pp. 1382–1395, Sep. 2015. C. Ruiz, E. Jeanvoine, and L. Nussbaum, “Performance evaluation of containers for HPC,” in Euro-Par 2015: Parallel Processing Workshops, 2015, pp. 813–824.

Suggest Documents