Earth and Space Science Informatics [IN]

IN21C
 MC:3018  Tuesday  0800h

Emerging Multicore Computing Technology in Earth and Space Sciences I


Presiding:  J Michalakes, NCAR; P Messmer, Tech-X Corporation; M Halem, University of Maryland, Baltimore County; S Zhou, NASA Goddard Science Flight Center

IN21C-01 INVITED

Harnessing Petaflop-Scale Multi-Core Supercomputing for Problems in Space Science

* Albright, B J balbright@lanl.gov, Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545, United States
Yin, L , Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545, United States
Bowers, K J, Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545, United States
Daughton, W , Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545, United States
Bergen, B , Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545, United States
Kwan, T J, Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545, United States

The particle-in-cell kinetic plasma code VPIC has been migrated successfully to the world's fastest supercomputer, Roadrunner, a hybrid multi-core platform built by IBM for the Los Alamos National Laboratory. How this was achieved will be described and examples of state-of-the-art calculations in space science, in particular, the study of magnetic reconnection, will be presented. With VPIC on Roadrunner, we have performed, for the first time, plasma PIC calculations with over one trillion particles, >100× larger than calculations considered "heroic" by community standards. This allows examination of physics at unprecedented scale and fidelity. Roadrunner is an example of an emerging paradigm in supercomputing: the trend toward multi-core systems with deep hierarchies and where memory bandwidth optimization is vital to achieving high performance. Getting VPIC to perform well on such systems is a formidable challenge: the core algorithm is memory bandwidth limited with low compute-to-data ratio and requires random access to memory in its inner loop. That we were able to get VPIC to perform and scale well, achieving >0.374 Pflop/s and linear weak scaling on real physics problems on up to the full 12240-core Roadrunner machine, bodes well for harnessing these machines for our community's needs in the future. Many of the design considerations encountered commute to other multi-core and accelerated (e.g., via GPU) platforms and we modified VPIC with flexibility in mind. These will be summarized and strategies for how one might adapt a code for such platforms will be shared. Work performed under the auspices of the U.S. DOE by the LANS LLC Los Alamos National Laboratory. Dr. Bowers is a LANL Guest Scientist; he is presently at D. E. Shaw Research LLC, 120 W 45th Street, 39th Floor, New York, NY 10036.

IN21C-02

Accelerate Climate Models with the IBM Cell Processor

* Zhou, S shujia.zhou@nasa.gov, Northrop Grumman Corporation, 4801 Stonecroft Blvd., Westfields, VA 20151, United States
* Zhou, S shujia.zhou@nasa.gov, NASA Goddard Science Flight Center, Code 610, Greenbelt, MD 20771, United States
Duffy, D daniel.q.duffy@nasa.gov, Computer Sciences Corporation, 3170 Fairview Park Drive, Falls Church, VA 22042, United States
Duffy, D daniel.q.duffy@nasa.gov, NASA Goddard Science Flight Center, Code 610, Greenbelt, MD 20771, United States
Clune, T tom.clune@nasa.gov, NASA Goddard Science Flight Center, Code 610, Greenbelt, MD 20771, United States
Williams, S samw@EECS.Berkeley.EDU, University of California, Berkeley, Dept. of Electrical Eng. and Computer Science, Berkeley, CA 94720, United States
Suarez, M max.j.suarez@nasa.gov, NASA Goddard Science Flight Center, Code 610, Greenbelt, MD 20771, United States
Halem, M halem@umbc.edu, University of Maryland Baltimore County, Computer Science and Electrical Engineering Dep't, Baltimore County, MD 21250, United States

Ever increasing model resolutions and physical processes in climate models demand continual computing power increases. The IBM Cell processor's order-of- magnitude peak performance increase over conventional processors makes it very attractive for fulfilling this requirement. However, the Cell's characteristics: 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (~50% total computation time), (2) has a high computational load relative to data traffic to/from main memory, and (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran code) to C and ported it to an IBM BladeCenter QS20, manually SIMDizing 4 independent columns, and found that a Cell with 8 SPEs can process more than 3000 columns per second. Compared with the baseline results, the Cell is ~6.76x, ~8.91x, ~9.85x faster than a core on Intel's Xeon Woodcrest, Dempsey, and Itanium2 respectively. Our analysis shows that the Cell could also speed up the dynamics component (~25% total computation time). We believe this dramatic performance improvement makes the Cell processor very competitive, at least as an accelerator. We will report our experience in porting both the C and Fortran codes and will discuss our work in porting other climate model components.

IN21C-03

Geospace simulations on the Cell BE processor

* Germaschewski, K kai.germaschewski@unh.edu, Space Science Center University of New Hampshire, 8 College Rd, Durham, NH 03824, United States
Raeder, J j.raeder@unh.edu, Space Science Center University of New Hampshire, 8 College Rd, Durham, NH 03824, United States
Larson, D douglas.larson@unh.edu, Space Science Center University of New Hampshire, 8 College Rd, Durham, NH 03824, United States

OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.

IN21C-04

Benchmarking NWP Kernels on Multi- and Many-core Processors

* Michalakes, J michalak@ucar.edu, National Center for Atmospheric Research, 3450 Mitchell Lane, Boulder, CO 80301, United States
Vachharajani, M manishv@colorado.edu, University of Colorado at Boulder, Engineering Center, ECOT 342 425 UCB, Boulder, CO 80309, United States

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.

http://www.mmm.ucar.edu/wrf/WG2/GPU

IN21C-05

Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms

Dubtsov, R roman.s.dubtsov@intel.com, Intel, Lavrentieva Av. 6/1, Novosibirsk, 630090, Russian Federation
Semenov, A alexander.l.semenov@intel.com, Intel, Lavrentieva Av. 6/1, Novosibirsk, 630090, Russian Federation
* Lubin, M mark.lubin@intel.com, Intel, 1900 Prairie City Rd., Folsom, CA 95630, United States

Multi-core and upcoming many-core CPUs have dramatically increased computing density in datacenters and parallelism available to HPC applications. Currently, large clusters are employed to carry out weather and other simulations of unprecedented size. The Weather Research and Forecasting (WRF) Model is widely used and a new version has been recently launched. The software successfully runs on a number of Intel® architectures including new Intel® processors that provide new opportunities for leading performance and scalability for NWS applications. However, it is a challenging task to utilize available computing power efficiently. In part, it is because increased density puts additional stress on cluster interconnect and memory interfaces. We present one of approaches how these obstacles can be came over by evaluating hybrid MPI-plus-OpenMP parallel programming model used in WRF, showing detailed study of how different workloads perform and highlighting benefits of approach under consideration.

IN21C-06

Accelerating the Computation of Theoretical Spectro-Polarimetric Signals; Comparative Analysis Using the Cell BE and NVIDIA GPU for Implementing the Voigt Function.

* Garcia, J jgarcia@ucar.edu, National Center for Atmospheric Research, P.O. Box 3000, Boulder, CO 80307, United States
Kelly, R rory@ucar.edu, National Center for Atmospheric Research, P.O. Box 3000, Boulder, CO 80307, United States

Rapid calculation of the Voigt profile is critical for high performance in computational models for spectro- polarimetric analysis. This makes the Voigt function an ideal candidate for exploiting accelerator technologies. We have implemented the Curtis and Osborne rational polynomial approximation to the Voigt function in two architectures; the Cell Broadband Engine and a Graphics Processing Unit. We present a comparative analysis in two areas of relevance, the programming model and the possible speed up factor if these technologies were incorporated into a complete model.

IN21C-07

Chemical Transport Models on Accelerator Architectures

* Linford, J jlinford@vt.edu, Virginia Polytechnic Institute and State University, 2202 Kraft Drive, Blacksburg, VA 24060,
Sandu, A asandu7@vt.edu, Virginia Polytechnic Institute and State University, 2202 Kraft Drive, Blacksburg, VA 24060,

Heterogeneous multicore chipsets with many layers of polymorphic parallelism are becoming increasingly common in high-performance computing systems. Homogeneous co-processors with many streaming processors also offer unprecedented peak floating-point performance. Effective use of parallelism in these new chipsets is paramount. We present optimization techniques for 3D chemical transport models to take full advantage of emerging Cell Broadband Engine and graphical processing unit (GPU) technology. Our techniques achieve 2.15x the per-node performance of an IBM BlueGene/P on the Cell Broadband Engine, and a strongly-scalable 1.75x the per-node performance of an IBM BlueGene/P on an NVIDIA GeForce 8600.

IN21C-08

Using GPUs to Meet Next Generation Weather Model Computational Requirements

Govett, M mark.w.govett@noaa.gov, NOAA/OAR Earth Science Research Laboratory, 325 Broadway, Boulder, CO 80305, United States
Hart, L leslie.b.hart@noaa.gov, NOAA/OAR Earth Science Research Laboratory, 325 Broadway, Boulder, CO 80305, United States
Henderson, T thomas.b.henderson@noaa.gov, Cooperative Institute for Research in the Atmosphere (CIRA), Colorado State University, 325 Broadway, Boulder, CO 80305, United States
Middlecoff, J jacques.middlecoff@noaa.gov, Cooperative Institute for Research in the Atmosphere (CIRA), Colorado State University, 325 Broadway, Boulder, CO 80305, United States
* Tierney, C craig.tierney@noaa.gov, Colorado Institute for Research in Environmental Sciences (CIRES), University of Colorado, 325 Broadway, Boulder, CO 80305, United States

Weather prediction goals within the Earth Science Research Laboratory at NOAA require significant increases in model resolution (~1 km) and forecast durations (~60 days) to support expected requirements in 5 years or less. However, meeting these goals will likely require at least 100k dedicated cores. Few systems will exist that could even run such a large problem, much less house a facility that could provide the necessary power and cooling requirements. To meet our goals we are exploring alternative technologies, including Graphics Processing Units (GPU), that could provide significantly more computational performance and reduced power and cooling requirements, at a lower cost than traditional high-performance computing solutions. Our current global numerical weather prediction model, the Flow following finite-volume Isocahedral Model (FIM, http://fim.noaa.gov), is still early in its development but is already demonstrating good fidelity and excellent scalability to 1000s of cores. The icosahedral grid has several complexities not present in more traditional Cartesian grids including polygons with different numbers of sides (five and six) and non-trivial computation of locations of neighboring grid cells. FIM uses an indirect addressing scheme that yields very compact code despite these complexities. We have extracted computational kernels that encompass functions likely to take the most time at higher resolutions including all that have horizontal dependencies. Kernels implement equations for computing anti-diffusive flux-corrected transport across cell edges, calculating forcing terms and time-step differencing, and re-computing time-dependent vertical coordinates. We are extending these kernels to explore performance of GPU-specific optimizations. We will present initial performance results from the computational kernels of the FIM model, as well as the challenges related to porting code with indirect memory references to the NVIDIA GPUs. Results of this investigation should benefit the design our next-generation isocahedral weather and climate models.

http://fim.noaa.gov