Harnessing Petaflop-Scale Multi-Core Supercomputing for Problems in Space Science
The particle-in-cell kinetic plasma code VPIC has been migrated successfully to the world's fastest supercomputer, Roadrunner, a hybrid multi-core platform built by IBM for the Los Alamos National Laboratory. How this was achieved will be described and examples of state-of-the-art calculations in space science, in particular, the study of magnetic reconnection, will be presented. With VPIC on Roadrunner, we have performed, for the first time, plasma PIC calculations with over one trillion particles, >100× larger than calculations considered "heroic" by community standards. This allows examination of physics at unprecedented scale and fidelity. Roadrunner is an example of an emerging paradigm in supercomputing: the trend toward multi-core systems with deep hierarchies and where memory bandwidth optimization is vital to achieving high performance. Getting VPIC to perform well on such systems is a formidable challenge: the core algorithm is memory bandwidth limited with low compute-to-data ratio and requires random access to memory in its inner loop. That we were able to get VPIC to perform and scale well, achieving >0.374 Pflop/s and linear weak scaling on real physics problems on up to the full 12240-core Roadrunner machine, bodes well for harnessing these machines for our community's needs in the future. Many of the design considerations encountered commute to other multi-core and accelerated (e.g., via GPU) platforms and we modified VPIC with flexibility in mind. These will be summarized and strategies for how one might adapt a code for such platforms will be shared. Work performed under the auspices of the U.S. DOE by the LANS LLC Los Alamos National Laboratory. Dr. Bowers is a LANL Guest Scientist; he is presently at D. E. Shaw Research LLC, 120 W 45th Street, 39th Floor, New York, NY 10036.
Accelerate Climate Models with the IBM Cell Processor
Ever increasing model resolutions and physical processes in climate models demand continual computing power increases. The IBM Cell processor's order-of- magnitude peak performance increase over conventional processors makes it very attractive for fulfilling this requirement. However, the Cell's characteristics: 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (~50% total computation time), (2) has a high computational load relative to data traffic to/from main memory, and (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran code) to C and ported it to an IBM BladeCenter QS20, manually SIMDizing 4 independent columns, and found that a Cell with 8 SPEs can process more than 3000 columns per second. Compared with the baseline results, the Cell is ~6.76x, ~8.91x, ~9.85x faster than a core on Intel's Xeon Woodcrest, Dempsey, and Itanium2 respectively. Our analysis shows that the Cell could also speed up the dynamics component (~25% total computation time). We believe this dramatic performance improvement makes the Cell processor very competitive, at least as an accelerator. We will report our experience in porting both the C and Fortran codes and will discuss our work in porting other climate model components.
Geospace simulations on the Cell BE processor
OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.
Benchmarking NWP Kernels on Multi- and Many-core Processors
Increased computing power for weather, climate, and atmospheric science has provided direct benefits for
defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large
clusters with many thousands of processors are allowing scientists to move forward with simulations of
unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need
strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost-
performance has never been greater, both in terms of performance per watt and per dollar. For these
reasons, the new generations of multi- and many-core processors being mass produced for commercial IT
and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine-
grain parallelism in atmospheric models. We present results of our work to date identifying key computational
kernels within the dynamics and physics of a large community NWP model, the Weather Research and
Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core
processors. The goals are to (1) characterize and model performance of the kernels in terms of
computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate
and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties
and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel
benchmarks that can be used to measure and compare effectiveness of current and future designs of multi-
and many-core processors for weather and climate applications.
Hybrid parallelism for Weather Research and Forecasting Model on Intel platforms
Multi-core and upcoming many-core CPUs have dramatically increased computing density in datacenters and parallelism available to HPC applications. Currently, large clusters are employed to carry out weather and other simulations of unprecedented size. The Weather Research and Forecasting (WRF) Model is widely used and a new version has been recently launched. The software successfully runs on a number of Intel® architectures including new Intel® processors that provide new opportunities for leading performance and scalability for NWS applications. However, it is a challenging task to utilize available computing power efficiently. In part, it is because increased density puts additional stress on cluster interconnect and memory interfaces. We present one of approaches how these obstacles can be came over by evaluating hybrid MPI-plus-OpenMP parallel programming model used in WRF, showing detailed study of how different workloads perform and highlighting benefits of approach under consideration.
Accelerating the Computation of Theoretical Spectro-Polarimetric Signals; Comparative Analysis Using the Cell BE and NVIDIA GPU for Implementing the Voigt Function.
Rapid calculation of the Voigt profile is critical for high performance in computational models for spectro- polarimetric analysis. This makes the Voigt function an ideal candidate for exploiting accelerator technologies. We have implemented the Curtis and Osborne rational polynomial approximation to the Voigt function in two architectures; the Cell Broadband Engine and a Graphics Processing Unit. We present a comparative analysis in two areas of relevance, the programming model and the possible speed up factor if these technologies were incorporated into a complete model.
Chemical Transport Models on Accelerator Architectures
Heterogeneous multicore chipsets with many layers of polymorphic parallelism are becoming increasingly common in high-performance computing systems. Homogeneous co-processors with many streaming processors also offer unprecedented peak floating-point performance. Effective use of parallelism in these new chipsets is paramount. We present optimization techniques for 3D chemical transport models to take full advantage of emerging Cell Broadband Engine and graphical processing unit (GPU) technology. Our techniques achieve 2.15x the per-node performance of an IBM BlueGene/P on the Cell Broadband Engine, and a strongly-scalable 1.75x the per-node performance of an IBM BlueGene/P on an NVIDIA GeForce 8600.
Using GPUs to Meet Next Generation Weather Model Computational Requirements
Weather prediction goals within the Earth Science Research Laboratory at NOAA require significant
increases in model resolution (~1 km) and forecast durations (~60 days) to support expected
requirements in 5 years or less. However, meeting these goals will likely require at least 100k dedicated
cores. Few systems will exist that could even run such a large problem, much less house a facility that could
provide the necessary power and cooling requirements. To meet our goals we are exploring alternative
technologies, including Graphics Processing Units (GPU), that could provide significantly more computational
performance and reduced power and cooling requirements, at a lower cost than traditional high-performance
Our current global numerical weather prediction model, the Flow following finite-volume Isocahedral Model
(FIM, http://fim.noaa.gov), is still early in its development but is already demonstrating good fidelity and
excellent scalability to 1000s of cores. The icosahedral grid has several complexities not present in more
traditional Cartesian grids including polygons with different numbers of sides (five and six) and non-trivial
computation of locations of neighboring grid cells. FIM uses an indirect addressing scheme that yields very
compact code despite these complexities. We have extracted computational kernels that encompass
functions likely to take the most time at higher resolutions including all that have horizontal dependencies.
Kernels implement equations for computing anti-diffusive flux-corrected transport across cell edges,
calculating forcing terms and time-step differencing, and re-computing time-dependent vertical coordinates.
We are extending these kernels to explore performance of GPU-specific optimizations.
We will present initial performance results from the computational kernels of the FIM model, as well as the
challenges related to porting code with indirect memory references to the NVIDIA GPUs. Results of this
investigation should benefit the design our next-generation isocahedral weather and climate models.