NG34A-01 INVITED 16:00h
Data Mining for Efficient and Accurate Large Scale Retrieval of Geophysical Parameters
Our effort is devoted to developing data mining technology for improving efficiency and accuracy of the geophysical parameter retrievals by learning a mapping from observation attributes to the corresponding parameters within the framework of classification and regression. We will describe a method for efficient learning of neural network-based classification and regression models from high-volume data streams. The proposed procedure automatically learns a series of neural networks of different complexities on smaller data stream chunks and then properly combines them into an ensemble predictor through averaging. Based on the idea of progressive sampling the proposed approach starts with a very simple network trained on a very small chunk and then gradually increases the model complexity and the chunk size until the learning performance no longer improves. Our empirical study on aerosol retrievals from data obtained with the MISR instrument mounted at Terra satellite suggests that the proposed method is successful in learning complex concepts from large data streams with near-optimal computational effort. We will also report on a method that complements deterministic retrievals by constructing accurate predictive algorithms and applying them on appropriately selected subsets of observed data. The method is based on developing more accurate predictors aimed to catch global and local properties synthesized in a region. The procedure starts by learning the global properties of data sampled over the entire space, and continues by constructing specialized models on selected localized regions. The global and local models are integrated through an automated procedure that determines the optimal trade-off between the two components with the objective of minimizing the overall mean square errors over a specific region. Our experimental results on MISR data showed that the combined model can increase the retrieval accuracy significantly. The preliminary results on various large heterogeneous spatial-temporal datasets provide evidence that the benefits of the proposed methodology for efficient and accurate learning exist beyond the area of retrieval of geophysical parameters.
http://www.ist.temple.edu
NG34A-02 INVITED 16:15h
Bayesian Approach to the Joint Inversion of Gravity and Magnetic Data, with Application to the Ismenius Area of Mars
Many inverse problems encountered in geophysics and planetary science are well known to be non-unique (i.e. inversion of gravity the density structure of a body). In the hopes of reducing the non-uniqueness of solutions, there has been interest in the joint analysis of data. An example is the joint inversion of gravity and magnetic data, with the assumption that the same physical anomalies generate both the observed magnetic and gravitational anomalies. In this talk, we formulate the joint analysis of different types of data in a Bayesian framework and apply the formalism to the inference of the density and remanent magnetization structure for a local region in the Ismenius area of Mars. The Bayesian approach allows prior information or constraints in the solutions to be incoporated in the inversion, with the "best" solutions those whose forward predictions most closely match the data while remaining consistent with assumed constraints. The application of this framework to the inversion of gravity and magnetic data on Mars reveals two typical challenges - the forward predictions of the data have a linear dependence on some of the quantities of interest, and non-linear dependence on others (termed the "linear" and "non-linear" variables, respectively). For observations with Gaussian noise, a Bayesian approach to inversion for "linear" variables reduces to a linear filtering problem, with an explicitly computable "error" matrix. However, for models whose forward predictions have non-linear dependencies, inference is no longer given by such a simple linear problem, and moreover, the uncertainty in the solution is no longer completely specified by a computable "error matrix". It is therefore important to develop methods for sampling from the full Bayesian posterior to provide a complete and statistically consistent picture of model undertainty, and what has been learned from observations. We will discuss advanced numerical techniques, including Monte Carlo Markov Chain methods, for a Bayesian approach to the joint inversion of gravity and magnetic data for the Ismenious area on Mars, and for geophysical problems in general.
NG34A-03 INVITED 16:30h
A Taxonomy of Model-Data Relationships
Contrasting observational data and geophysical models is a ubiquitous task in the earth sciences. The quality and quantity of available observational data sets vary across orders of magnitude, the complexity of models is similarly varied, and issues of accessibility confound analyses of both. Therefore, any examination of the relationship between models and data is of limited scope, losing rich subtleties in particular research programs. Nevertheless, there may be some value in noting similarities across the broad geosciences, with the aim of a tighter focus on and better appreciation of the goal (or goals) one hopes to obtain in any specific model-data comparison. One might suppose that these goals are obvious, but this naive view is quickly vanquished by a survey of scientists. In this paper, we propose an initial taxonomy of model-data relationships, classifying various research projects in terms of an investigator's overall goal(s) in comparing and confronting models with data. Examples are taken in the context of five loosely defined geosystems (the {\it Earth's Atmosphere}, the {\it Solid Earth}, the {\it Biosphere}, {\it Celestial Mechanics}, the {\it Research Laboratory}). Each system is thought of as the physical context within which the model is framed and the data is taken to describe or reflect. Questions posed within these systems can be classified in terms of the type of occurrence and/or temporal and spatial patterns within these systems ({\it Recurrent}, {\it Repetitive}, {\it Rare}, {\it One-Off}). Sometimes our understanding of the system itself limits which classes are appropriate to a given research project. We then illustrate, by example, a methodology for placing a specific research program within our classification. We use a four-tiered approach for writing down details about each specific research example: (a) the {\it system} considered, (b) the {\it model} constructed, (c) the {\it data} used, (d) the {\it question} asked. Finally, we give several examples of our approach taken from the broad geosciences, and considering end members of both very large and very small data sets. Clearly, any taxonomy is justified only by its function; our aim here is to recognize patterns and distinguish subtleties, as well as generate some broad suggestions of `good practice.' Ideally, this or some similar taxonomy will prove of use to the researcher both in terms of a vantage point from which to better focus their goals in model-data comparisons, and in easing communications with colleagues who may be thinking initially in terms corresponding to different entries of the taxonomy.
NG34A-04 16:45h
HPC Infrastructure for Solid Earth Simulation on Parallel Computers
Recently, various types of parallel computers with various types of architectures and processing elements (PE) have emerged, which include PC clusters and the Earth Simulator. Moreover, users can easily access to these computer resources through network on {\em Grid} environment. It is well-known that thorough tuning is required for programmers to achieve excellent performance on each computer. The method for tuning strongly depends on the type of PE and architecture. Optimization by tuning is a very tough work, especially for developers of applications. Moreover, parallel programming using message passing library such as MPI is another big task for application programmers. In GeoFEM project (http://gefeom.tokyo.rist.or.jp), authors have developed a parallel FEM platform for solid earth simulation on the {\em Earth Simulator}, which supports parallel I/O, parallel linear solvers and parallel visualization. This platform can efficiently {\em hide} complicated procedures for parallel programming and optimization on vector processors from application programmers. This type of infrastructure is very useful. Source codes developed on PC with single processor is easily optimized on massively parallel computer by linking the source code to the parallel platform installed on the target computer. This parallel platform, called HPC Infrastructure will provide dramatic efficiency, portability and reliability in development of scientific simulation codes. For example, line number of the source codes is expected to be less than 10,000 and porting legacy codes to parallel computer takes 2 or 3 weeks. Original GeoFEM platform supports only I/O, linear solvers and visualization. In the present work, further development for adaptive mesh refinement (AMR) and dynamic load-balancing (DLB) have been carried out. In this presentation, examples of large-scale solid earth simulation using the {\em Earth Simulator} will be demonstrated. Moreover, recent results of a parallel computational steering tool using an {\em MxN} communication model will be shown. In an {\em MxN} communication model, the large-scale computation modules run on {\em M} PE's and high performance parallel visualization modules run on {\em N} PE's, concurrently. This can allow computation and visualization to select suitable parallel hardware environments respectively. Meanwhile, {\em real-time} steering can be achieved during computation so that the users can check and adjust the computation process in real time. Furthermore, different numbers of PE's can achieve better configuration between computation and visualization under {\em Grid} environment.
http://www-solid.eps.s.u-tokyo.ac.jp/~nakajima/index.html
NG34A-05 INVITED 17:00h
A Visualization Approach to Understanding Minerals Properties
Over the past several years, huge amounts of data related to structural, electronic and mechanical properties of minerals have been produced by numerous experiments and calculations. This trend of rapid increase in mineral datasets will continue in the coming years. It has now been possible to perform atomistic simulations for mineral systems containing several millions or a few billions of atoms using semi-empirical molecular dynamics approach. Also feasible are quantum mechanical simulations of systems involving several hundreds of atoms. All these high-end computations are producing massive datasets, which are multivariate and time-dependent. Gaining insight into such datasets is, however, a non-trivial task. Recently, we have initiated to adopt a visualization approach to facilitate understanding of various data related to geophysically relevant minerals such as silicates and oxides. In this talk, I will present our current progress in this endeavor by considering two visualization case studies. In the first case, we are visualizing multivariate elastic moduli and anisotropic wave propagation as a function of composition, pressure and temperature. In the second case, we are visualizing defect-induced properties in minerals.
NG34A-06 17:15h
The Programming Language Python In Earth System Simulations
Mathematical models in earth sciences base on the solution of systems of coupled, non-linear, time-dependent partial differential equations (PDEs). The spatial and time-scale vary from a planetary scale and million years for convection problems to 100km and 10 years for fault systems simulations. Various techniques are in use to deal with the time dependency (e.g. Crank-Nicholson), with the non-linearity (e.g. Newton-Raphson) and weakly coupled equations (e.g. non-linear Gauss-Seidel). Besides these high-level solution algorithms discretization methods (e.g. finite element method (FEM), boundary element method (BEM)) are used to deal with spatial derivatives. Typically, large-scale, three dimensional meshes are required to resolve geometrical complexity (e.g. in the case of fault systems) or features in the solution (e.g. in mantel convection simulations). The modelling environment escript allows the rapid implementation of new physics as required for the development of simulation codes in earth sciences. Its main object is to provide a programming language, where the user can define new models and rapidly develop high-level solution algorithms. The current implementation is linked with the finite element package finley as a PDE solver. However, the design is open and other discretization technologies such as finite differences and boundary element methods could be included. escript is implemented as an extension of the interactive programming environment python (see www.python.org). Key concepts introduced are Data objects, which are holding values on nodes or elements of the finite element mesh, and linearPDE objects, which are defining linear partial differential equations to be solved by the underlying discretization technology. In this paper we will show the basic concepts of escript and will show how escript is used to implement a simulation code for interacting fault systems. We will show some results of large-scale, parallel simulations on an SGI Altix system. Acknowledgements: Project work is supported by Australian Commonwealth Government through the Australian Computational Earth Systems Simulator Major National Research Facility, Queensland State Government Smart State Research Facility Fund, The University of Queensland and SGI.
http://www.esscc.uq.edu.au/Research/EscriptFinley
NG34A-07 17:30h
Implementing Geographic Information System Grid Services Using Distributed Messaging Systems
Geographic Information Systems (GIS) provide a number of standard services for managing data sources (such as faults, GPS, and earthquake event catalogs) and generating map displays. These may be integrated with remote services for running simulation codes to create a distributed computational environment for integrating data, simulation, and analysis/visualization tools. Such systems are usually refered to as "Grids". We present in this talk our work building a GIS grid around modern distributed computing concepts: Web Services and Service Oriented architectures. These distributed systems glue together database and analysis tools with more traditional approaches to high performance computing. The main developments outlined in this talk are a) implementing GIS services as Web Services; b) integrating these services into a message based grid using the NaradaBrokering infrastructure; and c) extending GIS services to support earthquake simulation and visualization as part of the NASA SERVOGrid project.
NG34A-08 17:45h
Analysis, Mining and Visualization Service at NCSA
NCSA's goal is to create a balanced system that fully supports high-end computing as well as: 1) high-end data management and analysis; 2) visualization of massive, highly complex data collections; 3) large databases; 4) geographically distributed Grid computing; and 5) collaboratories, all based on a secure computational environment and driven with workflow-based services. To this end NCSA has defined a new technology path that includes the integration and provision of cyberservices in support of data analysis, mining, and visualization. NCSA has begun to develop and apply a data mining system-NCSA Data-to-Knowledge (D2K)-in conjunction with both the application and research communities. NCSA D2K will enable the formation of model-based application workflows and visual programming interfaces for rapid data analysis. The Java-based D2K framework, which integrates analytical data mining methods with data management, data transformation, and information visualization tools, will be configurable from the cyberservices (web and grid services, tools, ..) viewpoint to solve a wide range of important data mining problems. This effort will use modules, such as a new classification methods for the detection of high-risk geoscience events, and existing D2K data management, machine learning, and information visualization modules. A D2K cyberservices interface will be developed to seamlessly connect client applications with remote back-end D2K servers, providing computational resources for data mining and integration with local or remote data stores. This work is being coordinated with SDSC's data and services efforts. The new NCSA Visualization embedded workflow environment (NVIEW) will be integrated with D2K functionality to tightly couple informatics and scientific visualization with the data analysis and management services. Visualization services will access and filter disparate data sources, simplifying tasks such as fusing related data from distinct sources into a coherent visual representation. This approach enables collaboration among geographically dispersed researchers via portals and front-end clients, and the coupling with data management services enables recording associations among datasets and building annotation systems into visualization tools and portals, giving scientists a persistent, shareable, virtual lab notebook. To facilitate provision of these cyberservices to the national community, NCSA will be providing a computational environment for large-scale data assimilation, analysis, mining, and visualization. This will be initially implemented on the new 512 processor shared memory SGI's recently purchased by NCSA. In addition to standard batch capabilities, NCSA will provide on-demand capabilities for those projects requiring rapid response (e.g., development of severe weather, earthquake events) for decision makers. It will also be used for non-sequential interactive analysis of data sets where it is important have access to large data volumes over space and time.
http://www.ncsa.uiuc.edu/