Earth and Space Science Informatics [IN]

IN23C
 MC:Hall D  Tuesday  1340h

Emerging Multicore Computing Technology in Earth and Space Sciences II Posters


Presiding:  J Michalakes, NCAR; P Messmer, Tech-X Corporation; M Halem, University of Maryland, Baltimore County; S Zhou, NASA Goddard Science Flight Center

IN23C-1095

A Power Efficient Exaflop Computer Design for Global Cloud System Resolving Climate Models.

* Wehner, M F mfwehner@lbl.gov, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd MS50F, Berkeley, CA 94720, United States
Oliker, L loliker@lbl.gov, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd MS50F, Berkeley, CA 94720, United States
Shalf, J jashalf@lbl.gov, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd MS50F, Berkeley, CA 94720, United States

Exascale computers would allow routine ensemble modeling of the global climate system at the cloud system resolving scale. Power and cost requirements of traditional architecture systems are likely to delay such capability for many years. We present an alternative route to the exascale using embedded processor technology to design a system optimized for ultra high resolution climate modeling. These power efficient processors, used in consumer electronic devices such as mobile phones, portable music players, cameras, etc., can be tailored to the specific needs of scientific computing. We project that a system capable of integrating a kilometer scale climate model a thousand times faster than real time could be designed and built in a five year time scale for US$75M with a power consumption of 3MW. This is cheaper, more power efficient and sooner than any other existing technology.

IN23C-1096

GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed

Ziemba, T ziemba@eagleharbortech.com, Eagle Harbor Technologies, Inc., 321 High School Rd NE STE D3 #179, Bainbridge Island, WA 98110, United States
O'Donnell, D dano@eagleharbortech.com, Eagle Harbor Technologies, Inc., 321 High School Rd NE STE D3 #179, Bainbridge Island, WA 98110, United States
Carscadden, J johnc@eagleharbortech.com, Eagle Harbor Technologies, Inc., 321 High School Rd NE STE D3 #179, Bainbridge Island, WA 98110, United States
* Cash, M mcash@u.washington.edu, University of Washington, Department of Earth and Space Sciences, Seattle, WA 98195-1310, United States
Winglee, R winglee@ess.washington.edu, University of Washington, Department of Earth and Space Sciences, Seattle, WA 98195-1310, United States
Harnett, E eharnett@ess.washington.edu, University of Washington, Department of Earth and Space Sciences, Seattle, WA 98195-1310, United States

GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.

IN23C-1097

Performance Evaluation of Emerging High Performance Computing Technologies using WRF

* Newby, G B newby@arsc.edu, Arctic Region Supercomputing Center, Univ. Alaska Fairbanks 909 Koyukuk Dr., Suite 105, Fairbanks, AK 99775-6020, United States
Morton, D morton@arsc.edu, Arctic Region Supercomputing Center, Univ. Alaska Fairbanks 909 Koyukuk Dr., Suite 105, Fairbanks, AK 99775-6020, United States

The Arctic Region Supercomputing Center (ARSC) has evaluated multicore processors and other emerging processor technologies for a variety of high performance computing applications in the earth and space sciences, especially climate and weather applications. A flagship effort has been to assess dual core processor nodes on ARSC's Midnight supercomputer, in which two-socket systems were compared to eight-socket systems. Midnight is utilized for ARSC's twice-daily weather research and forecasting (WRF) model runs, available at weather.arsc.edu. Among other findings on Midnight, it was found that the Hypertransport system for interconnecting Opteron processors, memory, and other subsystems does not scale as well on eight-socket (sixteen processor) systems as well as two-socket (four processor) systems. A fundamental limitation is the cache snooping operation performed whenever a computational thread accesses main memory. This increases memory latency as the number of processor sockets increases. This is particularly noticeable on applications such as WRF that are primarily CPU-bound, versus applications that are bound by input/output or communication. The new Cray XT5 supercomputer at ARSC features quad core processors, and will host a variety of scaling experiments for WRF, CCSM4, and other models. Early results will be presented, including a series of WRF runs for Alaska with grid resolutions under 2km. ARSC will discuss a set of standardized test cases for the Alaska domain, similar to existing test cases for CONUS. These test cases will provide different configuration sizes and resolutions, suitable for single processors up to thousands. Beyond multi-core Opteron-based supercomputers, ARSC has examined WRF and other applications on additional emerging technologies. One such technology is the graphics processing unit, or GPU. The 9800-series nVidia GPU was evaluated with the cuBLAS software library. While in-socket GPUs might be forthcoming in the future, current generations of GPUs lack a sufficient balance of computational resources to replace the general-purpose microprocessor found in most traditional supercomputer architectures. ARSC has also worked with the Cell Broadband Engine in a small Playstation3 cluster, as well as a 24-processor system based on IBM's QS22 blades. The QS22 system, called Quasar, features the PowerXCell 8i processor found in the RoadRunner system, along with an InfiniBand network and high performance storage. Quasar overcomes the limitations of the small memory and relatively slow network of the PS3 systems. The presentation will include system-level benchmarks on Quasar, as well as evaluation of the WRF test cases. Another technology evaluation focused on Sun's UltraSPARC T2+ processor, which ARSC evaluated in a two-way system. Each T2+ provides eight processor cores, each of which provides eight threads, for a total of 128 threads in a single system. WRF scalability was good up to the number of cores, but multiple threads per core did not scale as well. Throughout the discussion, practical findings from ARSC will be summarized. While multicore general-purpose microprocessors are anticipated to remain important for large computers running earth and space science applications, the role of other potentially disruptive technologies is less certain. Limitations of current and future technologies will be discussed. class="ab'>

http://weather.arsc.edu

IN23C-1098

Computing the Delta-Eddington Approximation for Solar Radiation With Hardware Accelerators: Performance and Programmability on GPUs, FPGAs, and Microprocessors.

* Kelly, R C rory@ucar.edu, NCAR, 1850 Table Mesa Dr, Boulder, CO 80305, United States
Garcia, J jgarcia@ucar.edu, NCAR, 1850 Table Mesa Dr, Boulder, CO 80305, United States

The raddedmx routine is a computationally expensive portion of the short-wave radiation calculations in the NCAR Community Climate System Model (CCSM). The routine calculates the Delta-Eddington Approximation on columns of independent data, and executes a high number of floating point operations per byte of data accessed, making it a good candidate for hardware acceleration. We compare several implementation strategies for the raddedmx computation on two hardware acceleration platforms, Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), and analyze the computational speedups that can be realized in addition to programmability and the software engineering effort required on the different platforms. Implementations are found that are able to realize computational speedups in excess of 400x, and overall speedups in excess of 30x including data transfer overhead, versus the microprocessor of the host system. We discuss limitations of the various platforms and implementations, additional features that could improve performance, and the possibility of extending the work to accelerate other portions of the CCSM.

IN23C-1099

Acceleration of Data Analysis Applications using GPUs

* Fillmore, D fillmore@txcorp.com, Tech-X Corporation, 5621 Arapahoe Ave Suite A, Boulder, CO 80303, United States
Messmer, P messmer@txcorp.com, Tech-X Corporation, 5621 Arapahoe Ave Suite A, Boulder, CO 80303, United States
Mullowney, P paulm@txcorp.com, Tech-X Corporation, 5621 Arapahoe Ave Suite A, Boulder, CO 80303, United States
Amyx, K amyx@txcorp.com, Tech-X Corporation, 5621 Arapahoe Ave Suite A, Boulder, CO 80303, United States

The vast amount of data collected by present and future scientific instruments, sensors and numerical models requires a significant increase in computing power for analysis. In many cases, processing time on a single workstation becomes impractical. While clusters of commodity processors can be utilized to accelerate some of these tasks, the relatively high software development cost, as well as acquisition and operational costs, make them less attractive for broad use. Over the past few years, another class of architectures has gained some popularity, namely heterogeneous architectures, which consist of general purpose processors connected to specialized processors. One of the most prominent examples are Graphics Processing Units (GPUs), which offer a tremendous amount of floating-point processing power due to demand for high-quality graphics in the computer game market. However, in order to harness this processing power, software developers have to develop with a detailed understanding of the underlying hardware. This burden on the developer is often hardly justifiable considering the rapid evolution of the hardware. In this talk, we will introduce GPULib, an open source library that enables scientists to accelerate their data analysis tasks using the GPUs already installed in their system from within high-level languages like IDL or MATLAB, and present examples and possible speedup from real-world data analysis applications. This work is funded through NASA Phase II SBIR Grant NNG06CA13C.

IN23C-1100

Experiences modeling ocean circulation problems on a 30 node commodity cluster with 3840 GPU processor cores.

* hill, c cnh@mit.edu, M.I.T, 54-1515, 77 Mass. Ave, Cambridge, MA 02139,

Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes for which this technology is currently most useful. However, many interesting problems fit within this envelope. Looking forward, we extrapolate our experience to estimate full-scale ocean model performance and applicability. Finally we describe preliminary hybrid mixed 32-bit and 64-bit experiments with graphics cards that support 64-bit arithmetic, albeit at a lower performance.

IN23C-1101

Computational Performance of the UAF Eulerian Parallel Polar Ionosphere Model (UAF EPPIM)

* Maurits, S maurits@arsc.edu, University of Alaska Fairbanks, ARSC, West Ridge Research Building, 909 Koyukuk Dr., Suite 105, PO Box 756020, Fairbanks, AK 99775, United States
Kulchitsky, A kulchits@arsc.edu, University of Alaska Fairbanks, ARSC, West Ridge Research Building, 909 Koyukuk Dr., Suite 105, PO Box 756020, Fairbanks, AK 99775, United States

Improvement of the computational resolution of geophysical applications addresses various practical needs. For ionospheric modeling, high-resolution is a requirement of the radio propagation tasks that demand gradient-resolving capabilities. In practical terms this means modeling horizontal resolution of at least a few tens of kilometers or better in the polar ionospheric regions, where occurrences of the complicated ionospheric structures are typical. Such resolution is achievable in parallel and/or multi-core computational environments, which impose scalability requirements on model formulation, numerical algorithm, and the code. To achieve this performance, the UAF Eulerian Parallel Polar Ionosphere Model (UAF EPPIM) was created and refined at the Arctic Region Supercomputing Center (ARSC) with utilization of the Eulerian frame. This frame fixes the computational mesh inside the domain. Thus, the Eulerian frame facilitates computational data locality and allows effective parallelization of computational tasks. Applying double domain decomposition, each processor in the parallel partition covers a vertical "column" and the horizontal "layer" of the domain. During one time step, this partition permits global addressing of all variables for the vertical direction (advection-diffusion, heat-transfer, and chemistry solver) as well as for the horizontal plane (advection solver), minimizing a need for data exchange during computation. Advancing the entire domain in time is performed by exchange of the updated data just twice per time step, first, after application of the advection solver and then after the vertical solver is applied. As a result, the model numerical algorithm and the MPI-based Fortran code scales well. Even with large partitions of tens of processors, code performance is demonstrated to be at the level of 50% of the theoretical peak or better. These performance gains are sustainable on a wide range of computational platforms from workstation class to MPP supercomputers, including current multi-core architectures. The UAF EPPIM is applicable for the high-resolution case studies, as well as for the routine continuous run on a four-core workstation (http://spaceweather.arsc.edu) with useful horizontal resolution of 30times30 km. The presentation summarizes experience of more than decade of running the EPPIM at the different parallel platforms and the improvement gained by transition to the high-performance parallel and multi-core architectures. Specifics of the code implementation at the newest multi-core environments is also discussed.

http://spaceweather.arsc.edu

IN23C-1102

Cheaper and faster: How to have your cake and eat it too with GPU implementations of Earth Science simulations.

* Walsh, S D sdcwalsh@umn.edu, University of Minnesota, 310 Pillsbury Dr. SE, Minneapolis, MN 55455, United States
Saar, M O saar@umn.edu, University of Minnesota, 310 Pillsbury Dr. SE, Minneapolis, MN 55455, United States
Bailey, P bail0253@umn.edu, University of Minnesota, 310 Pillsbury Dr. SE, Minneapolis, MN 55455, United States
Lilja, D J lilja@umn.edu, University of Minnesota, 310 Pillsbury Dr. SE, Minneapolis, MN 55455, United States

Many complex natural systems studied in the geosciences are characterized by simple local-scale interactions that result in complex emergent behavior. Simulations of these systems, often implemented in parallel using standard CPU clusters, may be better suited to parallel processing environments with large numbers of simple processors. Such an environment is found in Graphics Processing Units (GPUs) on graphics cards. This presentation discusses graphics card implementations of three example applications from volcanology, seismology, and rock magnetics. These candidate applications involve important modeling techniques, widely employed in physical system simulation: 1) a multiphase lattice-Boltzmann code for geofluidic flows; 2) a spectral-finite-element code for seismic wave propagation simulations; and 3) a least-squares minimization code for interpreting magnetic force microscopy data. Significant performance increases, between one and two orders of magnitude, are seen in all three cases, demonstrating the power of graphics card implementations for these types of simulations.

IN23C-1103

Utilize multi CPU cores to improve dust simulation performance

* Xie, J jxie2@gmu.edu, Joint Center for Intelligent Spatial Computing (CISC),College of Science,George Mason University, 4400 Univ. Dr., Fairfax, VA, 22030-4444, Fairfax, VA 22030, United States
Yang, C cyang3@gmu.edu, Joint Center for Intelligent Spatial Computing (CISC),College of Science,George Mason University, 4400 Univ. Dr., Fairfax, VA, 22030-4444, Fairfax, VA 22030, United States
Pejanovic, G goran.pejanovic@sewa-weather.com, South Environment and Weather Agency, 11001 Beograd, Serbia, Beograd, 11001, Serbia
Zhou, B senosy@gmail.com, Joint Center for Intelligent Spatial Computing (CISC),College of Science,George Mason University, 4400 Univ. Dr., Fairfax, VA, 22030-4444, Fairfax, VA 22030, United States
Huang, Q qhuang1@gmu.edu, Joint Center for Intelligent Spatial Computing (CISC),College of Science,George Mason University, 4400 Univ. Dr., Fairfax, VA, 22030-4444, Fairfax, VA 22030, United States

Dust simulation model is a typical time-consuming application in Earth Sciences. Previous researches were conducted to simulate dust events with the Eta, a weather model at a lower resolution, for environment and atmospheric sciences by UofArizona and others. To enable the simulation of dust storms for the southwestern U.S. with an improved resolution to ZIP-code level, the model needs to be improved and consequently, more computing capacity is needed. To leverage computing clusters to enhance the performance, a parallel version of dust simulation model is developed within this reported research. The reported research migrates an Eta-based sequential computing dust model to a higher resolution Nonhydrostatic Mesoscale Model (NMM) based model running on multi CPU cores through 1) modifying the sequential Eta dust simulation model to fit into a High Performance Computing (HPC) environment by parallelizing the dust model based on the NMM weather forecasting model using Message Passing Interface (MPI); 2) testing the performance and initially validating the parallel dust simulation model using the southwest United States as experiment area for a dust event during January 7~8 2008; and 3) installing the improved model on a Linux cluster with 28 computing nodes and over 200 CPU cores. It is demonstrated that the parallelized version of the dust simulation model has a good speedup and efficiency when running with multi CPU cores. The maximum speed up is about 9.5 for the study case when 64 CPU cores are used.