A Power Efficient Exaflop Computer Design for Global Cloud System Resolving Climate Models.
Exascale computers would allow routine ensemble modeling of the global climate system at the cloud system resolving scale. Power and cost requirements of traditional architecture systems are likely to delay such capability for many years. We present an alternative route to the exascale using embedded processor technology to design a system optimized for ultra high resolution climate modeling. These power efficient processors, used in consumer electronic devices such as mobile phones, portable music players, cameras, etc., can be tailored to the specific needs of scientific computing. We project that a system capable of integrating a kilometer scale climate model a thousand times faster than real time could be designed and built in a five year time scale for US$75M with a power consumption of 3MW. This is cheaper, more power efficient and sooner than any other existing technology.
GPU Particle Tracking and MHD Simulations with Greatly Enhanced Computational Speed
GPUs are intrinsically highly parallelized systems that provide more than an order of magnitude computing speed over a CPU based systems, for less cost than a high end-workstation. Recent advancements in GPU technologies allow for full IEEE float specifications with performance up to several hundred GFLOPs per GPU, and new software architectures have recently become available to ease the transition from graphics based to scientific applications. This allows for a cheap alternative to standard supercomputing methods and should increase the time to discovery. 3-D particle tracking and MHD codes have been developed using NVIDIA's CUDA and have demonstrated speed up of nearly a factor of 20 over equivalent CPU versions of the codes. Such a speed up enables new applications to develop, including real time running of radiation belt simulations and real time running of global magnetospheric simulations, both of which could provide important space weather prediction tools.
Performance Evaluation of Emerging High Performance Computing Technologies using WRF
The Arctic Region Supercomputing Center (ARSC) has evaluated multicore
processors and other emerging processor technologies for a variety of
high performance computing applications in the earth and space
sciences, especially climate and weather applications. A flagship
effort has been to assess dual core processor nodes on ARSC's Midnight
supercomputer, in which two-socket systems were compared to
eight-socket systems. Midnight is utilized for ARSC's twice-daily
weather research and forecasting (WRF) model runs, available at
Among other findings on Midnight, it was found that the Hypertransport
system for interconnecting Opteron processors, memory, and other
subsystems does not scale as well on eight-socket (sixteen processor)
systems as well as two-socket (four processor) systems. A fundamental
limitation is the cache snooping operation performed whenever a
computational thread accesses main memory. This increases memory
latency as the number of processor sockets increases. This is
particularly noticeable on applications such as WRF that are primarily
CPU-bound, versus applications that are bound by input/output or
The new Cray XT5 supercomputer at ARSC features quad core processors,
and will host a variety of scaling experiments for WRF, CCSM4, and
other models. Early results will be presented, including a series of
WRF runs for Alaska with grid resolutions under 2km. ARSC will
discuss a set of standardized test cases for the Alaska domain,
similar to existing test cases for CONUS. These test cases will
provide different configuration sizes and resolutions, suitable for
single processors up to thousands.
Beyond multi-core Opteron-based supercomputers, ARSC has examined WRF
and other applications on additional emerging technologies. One such
technology is the graphics processing unit, or GPU. The 9800-series
nVidia GPU was evaluated with the cuBLAS software library. While
in-socket GPUs might be forthcoming in the future, current generations
of GPUs lack a sufficient balance of computational resources to
replace the general-purpose microprocessor found in most traditional
ARSC has also worked with the Cell Broadband Engine in a small
Playstation3 cluster, as well as a 24-processor system based on IBM's
QS22 blades. The QS22 system, called Quasar, features the PowerXCell
8i processor found in the RoadRunner system, along with an InfiniBand
network and high performance storage. Quasar overcomes the
limitations of the small memory and relatively slow network of the PS3
systems. The presentation will include system-level benchmarks on
Quasar, as well as evaluation of the WRF test cases.
Another technology evaluation focused on Sun's UltraSPARC T2+
processor, which ARSC evaluated in a two-way system. Each T2+
provides eight processor cores, each of which provides eight threads,
for a total of 128 threads in a single system. WRF scalability was
good up to the number of cores, but multiple threads per core did not
scale as well.
Throughout the discussion, practical findings from ARSC will be
summarized. While multicore general-purpose microprocessors are
anticipated to remain important for large computers running earth and
space science applications, the role of other potentially disruptive
technologies is less certain. Limitations of current and future
technologies will be discussed.
Computing the Delta-Eddington Approximation for Solar Radiation With Hardware Accelerators: Performance and Programmability on GPUs, FPGAs, and Microprocessors.
The raddedmx routine is a computationally expensive portion of the short-wave radiation calculations in the NCAR Community Climate System Model (CCSM). The routine calculates the Delta-Eddington Approximation on columns of independent data, and executes a high number of floating point operations per byte of data accessed, making it a good candidate for hardware acceleration. We compare several implementation strategies for the raddedmx computation on two hardware acceleration platforms, Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), and analyze the computational speedups that can be realized in addition to programmability and the software engineering effort required on the different platforms. Implementations are found that are able to realize computational speedups in excess of 400x, and overall speedups in excess of 30x including data transfer overhead, versus the microprocessor of the host system. We discuss limitations of the various platforms and implementations, additional features that could improve performance, and the possibility of extending the work to accelerate other portions of the CCSM.
Acceleration of Data Analysis Applications using GPUs
The vast amount of data collected by present and future scientific instruments, sensors and numerical models requires a significant increase in computing power for analysis. In many cases, processing time on a single workstation becomes impractical. While clusters of commodity processors can be utilized to accelerate some of these tasks, the relatively high software development cost, as well as acquisition and operational costs, make them less attractive for broad use. Over the past few years, another class of architectures has gained some popularity, namely heterogeneous architectures, which consist of general purpose processors connected to specialized processors. One of the most prominent examples are Graphics Processing Units (GPUs), which offer a tremendous amount of floating-point processing power due to demand for high-quality graphics in the computer game market. However, in order to harness this processing power, software developers have to develop with a detailed understanding of the underlying hardware. This burden on the developer is often hardly justifiable considering the rapid evolution of the hardware. In this talk, we will introduce GPULib, an open source library that enables scientists to accelerate their data analysis tasks using the GPUs already installed in their system from within high-level languages like IDL or MATLAB, and present examples and possible speedup from real-world data analysis applications. This work is funded through NASA Phase II SBIR Grant NNG06CA13C.
Experiences modeling ocean circulation problems on a 30 node commodity cluster with 3840 GPU processor cores.
Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes for which this technology is currently most useful. However, many interesting problems fit within this envelope. Looking forward, we extrapolate our experience to estimate full-scale ocean model performance and applicability. Finally we describe preliminary hybrid mixed 32-bit and 64-bit experiments with graphics cards that support 64-bit arithmetic, albeit at a lower performance.
Computational Performance of the UAF Eulerian Parallel Polar Ionosphere Model (UAF EPPIM)
Improvement of the computational resolution of geophysical applications addresses various practical needs.
For ionospheric modeling, high-resolution is a requirement of the radio propagation tasks that demand
gradient-resolving capabilities. In practical terms this means modeling horizontal resolution of at least a few
tens of kilometers or better in the polar ionospheric regions, where occurrences of the complicated
ionospheric structures are typical. Such resolution is achievable in parallel and/or multi-core computational
environments, which impose scalability requirements on model formulation, numerical algorithm, and the
code. To achieve this performance, the UAF Eulerian Parallel Polar Ionosphere Model (UAF EPPIM) was
created and refined at the Arctic Region Supercomputing Center (ARSC) with utilization of the Eulerian
frame. This frame fixes the computational mesh inside the domain. Thus, the Eulerian frame facilitates
computational data locality and allows effective parallelization of computational tasks. Applying double
domain decomposition, each processor in the parallel partition covers a vertical "column" and the horizontal
"layer" of the domain. During one time step, this partition permits global addressing of all variables for the
vertical direction (advection-diffusion, heat-transfer, and chemistry solver) as well as for the horizontal plane
(advection solver), minimizing a need for data exchange during computation. Advancing the entire domain in
time is performed by exchange of the updated data just twice per time step, first, after application of the
advection solver and then after the vertical solver is applied. As a result, the model numerical algorithm and
the MPI-based Fortran code scales well. Even with large partitions of tens of processors, code performance
is demonstrated to be at the level of 50% of the theoretical peak or better. These performance gains are
sustainable on a wide range of computational platforms from workstation class to MPP supercomputers,
including current multi-core architectures. The UAF EPPIM is applicable for the high-resolution case studies,
as well as for the routine continuous run on a four-core workstation (http://spaceweather.arsc.edu) with
useful horizontal resolution of 30times30 km. The presentation summarizes experience of more than
decade of running the EPPIM at the different parallel platforms and the improvement gained by transition to
the high-performance parallel and multi-core architectures. Specifics of the code implementation at the
newest multi-core environments is also discussed.
Cheaper and faster: How to have your cake and eat it too with GPU implementations of Earth Science simulations.
Many complex natural systems studied in the geosciences are characterized by simple local-scale interactions that result in complex emergent behavior. Simulations of these systems, often implemented in parallel using standard CPU clusters, may be better suited to parallel processing environments with large numbers of simple processors. Such an environment is found in Graphics Processing Units (GPUs) on graphics cards. This presentation discusses graphics card implementations of three example applications from volcanology, seismology, and rock magnetics. These candidate applications involve important modeling techniques, widely employed in physical system simulation: 1) a multiphase lattice-Boltzmann code for geofluidic flows; 2) a spectral-finite-element code for seismic wave propagation simulations; and 3) a least-squares minimization code for interpreting magnetic force microscopy data. Significant performance increases, between one and two orders of magnitude, are seen in all three cases, demonstrating the power of graphics card implementations for these types of simulations.
Utilize multi CPU cores to improve dust simulation performance
Dust simulation model is a typical time-consuming application in Earth Sciences. Previous researches were conducted to simulate dust events with the Eta, a weather model at a lower resolution, for environment and atmospheric sciences by UofArizona and others. To enable the simulation of dust storms for the southwestern U.S. with an improved resolution to ZIP-code level, the model needs to be improved and consequently, more computing capacity is needed. To leverage computing clusters to enhance the performance, a parallel version of dust simulation model is developed within this reported research. The reported research migrates an Eta-based sequential computing dust model to a higher resolution Nonhydrostatic Mesoscale Model (NMM) based model running on multi CPU cores through 1) modifying the sequential Eta dust simulation model to fit into a High Performance Computing (HPC) environment by parallelizing the dust model based on the NMM weather forecasting model using Message Passing Interface (MPI); 2) testing the performance and initially validating the parallel dust simulation model using the southwest United States as experiment area for a dust event during January 7~8 2008; and 3) installing the improved model on a Linux cluster with 28 computing nodes and over 200 CPU cores. It is demonstrated that the parallelized version of the dust simulation model has a good speedup and efficiency when running with multi CPU cores. The maximum speed up is about 9.5 for the study case when 64 CPU cores are used.