Earth and Space Science Informatics [IN]

IN32A
 MC:3022  Wednesday  1020h

Data Fusion: Issues, Barriers, and Approaches I


Presiding:  D Arctur, Open Geospatial Consortium; P Fox, HAO/ESSL/NCAR

IN32A-01 INVITED

Information Fusion: Moving from domain independent to domain literate approaches

* McGuinness, D , McGuinness Associates, 4 Shaker Bay Road, Latham, NY' 12110, United States
* McGuinness, D , Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States

Information Fusion has been a focus of research within the field of computer science for a number of years. Numerous environments aimed at general schema evaluation, diagnosis, and evolution have evolved within those communities including for example the Chimaera Ontology Evolution Environment and the Prompt environment for mapping schema alignment. General (domain independent) efforts have produced useful research results and numerous tools, however these results have predominantly been generated and used by computer scientists and have been focused largely on information schema integration and diagnosis. More recently semantically-enabled web-centric approaches have emerged that utilize domain knowledge to provide tools and services aimed at natural scientists needs for data fusion. In this talk, we will introduce some foundations for information fusion and provide deployed examples of how these foundations and evolving tools have been and are being used today in natural science domains by domain scientists. Some examples will be provided from deployed virtual observatory settings.

IN32A-02

Fusion is possible only with interoperability agreements; the GEOSS experience

* Percivall, G gpercivall@opengeospatial.org, Open Geospatial Consortium, 1804 Stonegate Ave, Crofton, MD 21114, United States

Data fusion is defined for this session as the merging of disparate data sources for multidisciplinary study. Implicit in this definition is that the data consumer may not be intimately familiar with the data sources. In order to achieve fusion of the data, there must be generalized concepts that apply to both the data sources and consumer; and those concepts must be implemented in our information systems. The successes of GEOSS depend on data and information providers accepting and implementing a set of interoperability arrangements, including technical specifications for collecting, processing, storing, and disseminating shared data, metadata, and products. GEOSS interoperability is based on non-proprietary standards, with preference to formal international standards. GEOSS requires a scientific basis for the collection, processing and interpretation of the data. Use of standards is a hallmark of a sound scientific basis. In order communicate effectively to achieve data fusion, interoperability arrangements must be based upon sound scientific principles that have been implemented in efficient and effective tools. Establishing such interoperability arrangements depends upon social processes and technology. Through the use of Interoperability Arrangements based upon standards, GEOSS achieves data fusion to in order to answer humanities critical questions. Decision making in support of societal benefit areas depends upon data fusion in multidisciplinary settings.

http://www.ogcnetwork.net/AIpilot

IN32A-03

The SURA Coastal Ocean Observing and Prediction (SCOOP) Program: Adapting Web 2.0 technologies to power next generation science

* Bogden, P bogden@gomoos.org, SURA, 1201 New York Ave, NW, Washington, DC 20005, United States
Partners, S , SURA, 1201 New York Ave, NW, Washington, DC 20005, United States

The Web 2.0 has helped globalize the economy and change social interactions, but the full impact on coastal sciences has yet to be realized. The SCOOP program (www.OpenIOOS.org/about/sura.html), an initiative of the Coastal Research Committee of the Southeastern Universities Research Association (SURA), has been using Web 2.0 technologies to create infrastructure for a multi-disciplinary Distributed Coastal Laboratory (DCL). In the spirit of the Web 2.0, SCOOP strives to provide an open-access virtual facility where "virtual visiting" scientists can log in, perform experiments (e.g., evaluate new wetting/drying algorithms in several different inundation models), potentially contribute to the assembly of resources (e.g., leave their algorithms for others), and then move on. The SCOOP prototype has focused on storm surge and waves (the initial science focus), and integrates a real-time data network to evaluate the predictions. The multi-purpose SCOOP components support a sensor-web initiative (www.OOSTethys.org) that is co-led by SURA. SCOOP also includes portals with real-time visualization, workflow configuration and decision-tool prototypes (www.OpenIOOS.org), powered by distributed computing resources from multiple universities across the nation (www.sura.org/SURAgrid). Based on our experience, we propose three key ingredients for initiatives to have the biggest impact on coastal science: (1) standards, (2) working prototypes and (3) communities of interest. We strongly endorse the Open Geospatial Consortium – a geospatial analog of the World Wide Web consortium – and other international consensus-standards bodies that engage government, private sector and academic involvement. But these standards are often highly complex, which can be an impediment to their use. We have overcome such hurdles with the second key ingredient: a focused working prototype. The prototype should include guides and resources that make it easy for others to apply, test, and revise the prototype, all without need to understand the standards in their overwhelming complexity. In addition, the prototype should support direct involvement of the third key ingredient: communities of interest that assess functional relevance. We expect that any two of these ingredients alone, without the third, will severely limit applicability and impact of any initiative.

http://www.OpenIOOS.org

IN32A-04

Data Integration in Support of a Real-Time Biosurveillance Network

* Cross, S L scott.cross@noaa.gov, NOAA Coastal Data Development Center, 1100 Balch Blvd. Suite 101, Stennis Space Center, MS 39522, United States
Scott, G I geoff.scott@noaa.gov, NOAA Center for Coastal Environmental Health and Biomolecular Research, 219 Ft. Johnson Rd., Charleston, SC 29412, United States
Miglarese, J V john.miglarese@fedsolve.com, FedSolve, LLC, 5 Town Gate Ct., Bethesda, MD 20817, United States

Recent emergency and security events from both human and natural causes have increased the urgency for multidisciplinary data integration. For example, understanding natural resource mortalities on any given day and time of year may result in the timely identification of an intentional biological or chemical act, as well as assist in the development of recovery and restoration plans. The South Carolina Environmental Surveillance Network (ESN) is a real-time surveillance network of coastal-zone wildlife mortality incidents (e.g. fish kills, bird kills, animal disease outbreaks, harmful algal blooms, marine mammal strandings, etc.) that (1) notifies participating network science and regulatory experts of incidents; (2) allows for quick assessments of potential links between and among mortalities and (3) provides a mechanism to alert the emergency management community of incidents that could impact commerce and public health. The ESN data management system relies on a resource-based (or RESTful) approach and includes a Web mapping application that provides access to both real-time and historical data, as well as data flow that analyzes event co-occurrence and provides for email notification of a number of state and federal partners. Notably, it is not simply the occurrence, but the co-occurrence of these events that can signal emergency conditions; thus the real value of the ESN is in its integration of data streams across state and federal administrative lines that have historically provided barriers to data and information flow. In our experience, two recurring types of obstacles to data system integration are particularly challenging. One is the cultural tendency for an agency or agent to maintain tight control over data that they have collected. The second is the reluctance of Information Technology (IT) managers to allow remote access to data systems under their control, regardless of security measures taken. The ESN development has thus far been successful due largely to efforts to: 1) gain high-level backing of the concept by agency managers; (2) use data-exchange protocols that do not force changes to agency data-handling practices; and (3) take care to listen to and understand the requirements of all participants. This process included multiple workshops that facilitated the dialogue necessary to design the system, as well as work in a formal team-based training setting. This planning approach may be critical to lay the necessary groundwork required to take advantage of the many technical advances being made in support of data integration.

IN32A-05

Merging Disparate Data Sources Into a Paleoanthropological Geodatabase for Research, Education, and Conservation in the Greater Hadar Region (Afar, Ethiopia)

Campisano, C J campisano@asu.edu, School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, United States
* DiMaggio, E N erin.dimaggio@asu.edu, School of Earth and Space Exploration, Arizona State University, PO Box 871404, Tempe, AZ 85287-1404, United States
Arrowsmith, J R ramon.arrowsmith@asu.edu, School of Earth and Space Exploration, Arizona State University, PO Box 871404, Tempe, AZ 85287-1404, United States
Kimbel, W H wkimbel.iho@asu.edu, School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, United States
Reed, K E kreed.iho@asu.edu, School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, United States
Robinson, S E serobins@asu.edu, School of Earth and Space Exploration, Arizona State University, PO Box 871404, Tempe, AZ 85287-1404, United States
Schoville, B J bschovil@asu.edu, School of Human Evolution and Social Change, Arizona State University, PO Box 872402, Tempe, AZ 85287-2402, United States

Understanding the geographic, temporal, and environmental contexts of human evolution requires the ability to compare wide-ranging datasets collected from multiple research disciplines. Paleoanthropological field- research projects are notoriously independent administratively even in regions of high transdisciplinary importance. As a result, valuable opportunities for the integration of new and archival datasets spanning diverse archaeological assemblages, paleontological localities, and stratigraphic sequences are often neglected, which limits the range of research questions that can be addressed. Using geoinformatic tools we integrate spatial, temporal, and semantically disparate paleoanthropological and geological datasets from the Hadar sedimentary basin of the Afar Rift, Ethiopia. Applying newly integrated data to investigations of fossil- rich sediments will provide the geospatial framework critical for addressing fundamental questions concerning hominins and their paleoenvironmental context. We present a preliminary cyberinfrastructure for data management that will allow scientists, students, and interested citizens to interact with, integrate, and visualize data from the Afar region. Examples of our initial integration efforts include generating a regional high-resolution satellite imagery base layer for georeferencing, standardizing and compiling multiple project datasets and digitizing paper maps. We also demonstrate how the robust datasets generated from our work are being incorporated into a new, digital module for Arizona State University's Hadar Paleoanthropology Field School – modernizing field data collection methods, on-the-fly data visualization and query, and subsequent analysis and interpretation. Armed with a fully fused database tethered to high-resolution satellite imagery, we can more accurately reconstruct spatial and temporal paleoenvironmental conditions and efficiently address key scientific questions, such as those regarding the relative importance of internal and external ecological, climatological, and tectonic forcings on evolutionary change in the fossil record. In close association with colleagues working in neighboring project areas, this work advances multidisciplinary and collaborative research, training, and long-range antiquities conservation in the Hadar region.

IN32A-06

Utilizing Multi-Source Geophysical Data for Generating Statistically-Based Gridded Weather Forecast Guidance

* Sheets, K L kari.sheets@noaa.gov, Meteorological Development Laboratory National Weather Service, NOAA, SSCM-2 W/OST22 1325 East West Hwy, Silver Spring, MD 20910,
Wagner, G Geoff.Wagner@noaa.gov, Meteorological Development Laboratory National Weather Service, NOAA, SSCM-2 W/OST22 1325 East West Hwy, Silver Spring, MD 20910,

The Meteorological Development Laboratory (MDL) of NOAA's National Weather Service (NWS) is generating gridded Model Output Statistics (MOS) forecast guidance in support of the National Digital Forecast Database (NDFD). Currently, gridded MOS provides forecasters with statistically post-processed guidance on grids covering the contiguous United States and Alaska, at resolutions comparable to those used in the official NWS forecast process. Stations used in traditional MOS development are unevenly distributed, leaving developers searching for additional observational data sets as well as better predictor variables to capture terrain effects. Despite efforts to increase the resolution of the meteorological observation data set, the network of quality-controlled observed data is shy of the desired NDFD resolution. To supplement the meteorological data and tailor the MOS forecast guidance to terrain and coastlines, we used spatial analysis techniques to generate additional geophysical variables at the NDFD grid resolution. Meteorological data and geophysical data from varying sources also reference multiple coordinate systems presenting challenges before the data can be used in generating gridded MOS. In order to accurately position and combine these data, they need to be transformed to a common geospatial reference system. In this talk, we discuss the challenges, barriers, and methods encountered when combining data from multiple sources. We discuss some of the details of the decisions included in selecting a common coordinate system and Geographic Information System processes used to perform those transformations.

IN32A-07

Assembling Large, Multi-Sensor Climate Datasets Using the SciFlo Grid Workflow System

* Wilson, B D Brian.Wilson@jpl.nasa.gov, Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91109, United States
Manipon, G Gerald.Manipon@jpl.nasa.gov, Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91109, United States
Xing, Z Zhangfan.Xing@jpl.nasa.gov, Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91109, United States
Fetzer, E Eric.Fetzer@jpl.nasa.gov, Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91109, United States

NASA's Earth Observing System (EOS) is the world's most ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the A-Train platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the cloud scenes from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time matchups between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, and assemble merged datasets for further scientific and statistical analysis. To meet these large-scale challenges, we are utilizing a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data query, access, subsetting, co-registration, mining, fusion, and advanced statistical analysis. SciFlo is a semantically-enabled ("smart") Grid Workflow system that ties together a peer-to-peer network of computers into an efficient engine for distributed computation. The SciFlo workflow engine enables scientists to do multi-instrument Earth Science by assembling remotely-invokable Web Services (SOAP or http GET URLs), native executables, command-line scripts, and Python codes into a distributed computing flow. A scientist visually authors the graph of operation in the VizFlow GUI, or uses a text editor to modify the simple XML workflow documents. The SciFlo client & server engines optimize the execution of such distributed workflows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. The engine transparently moves data to the operators, and moves operators to the data (on the dozen trusted SciFlo nodes). SciFlo also deploys a variety of Data Grid services to: query datasets in space and time, locate & retrieve on-line data granules, provide on-the-fly variable and spatial subsetting, and perform pairwise instrument matchups for A-Train datasets. These services are combined into efficient workflows to assemble the desired large-scale, merged climate datasets. SciFlo is currently being applied in several large climate studies: comparisons of aerosol optical depth between MODIS, MISR, AERONET ground network, and U. Michigan's IMPACT aerosol transport model; characterization of long-term biases in microwave and infrared instruments (AIRS, MLS) by comparisons to GPS temperature retrievals accurate to 0.1 degrees Kelvin; and construction of a decade-long, multi-sensor water vapor climatology stratified by classified cloud scene by bringing together datasets from AIRS/AMSU, AMSR-E, MLS, MODIS, and CloudSat (NASA MEASUREs grant, Fetzer PI). The presentation will discuss the SciFlo technologies, their application in these distributed workflows, and the many challenges encountered in assembling and analyzing these massive datasets.

IN32A-08

A SYSTEM OF SERVICES DELIVERING MULTI-SENSOR GRIDDED DATA RECORD AND APPLIED SCIENTIFIC ANALYSES

* Most, N neal.f.most@nasa.gov, NASA GSFC, Code 610, Greenbelt, MD 20771, United States
Halem, M halem@umbc.edu, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, United States
Stewart, K kstewart@innovim.com, NASA GSFC, Code 610, Greenbelt, MD 20771, United States
Chapman, D dchapm2@umbc.edu, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, United States
Nguyen, P phuong3@umbc.edu, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, United States
Golpayani, N golpa1@umbc.edu, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, United States

The AIRS and MODIS instruments provide the Earth science community with data and data products to conduct research and applied science, such as modeling weather, conducting global trend analysis of ozone and aerosols, and predicting ecological patterns. Adequate mechanisms have been developed to collect, process, archive, maintain, and distribute these data to the science community. However, these delivery systems are constructed with dissimilar technologies and present different workflows, increasing the burden to the user community consuming their product to learn different interfaces and processes. Furthermore, knowledge may be gained from hosting these data in one repository, where new data products are delivered, improved data visualization routines are provided and analysis algorithms are integrated allowing discrete datasets to be compared and combined. The Service Oriented Atmospheric Radiances (SOAR) system has evolved into such a system. SOAR generates and stages gridded radiance datasets from the MODIS and AIRS Level 1B data and delivers these datasets, in near real-time, using a suite of hardware and software technologies, to the science community. Our data processing tool set geolocates and grids AIRS and MODIS data with a unified processing routine. Data visualization and analysis routines have been developed, providing capabilities not found in the original data processing and distribution systems. And all this capability is delivered over the Web. Extensions to SOAR are in-progress, including a data download module, which will download the L1B MODIS and AIRS data in real-time and grid the data on demand. The SOAR system is also currently integrating a cloud computing framework, which will improve data processing performance by spreading submitted requests over multiple processors and expand the system storage capacity. This storage capacity is required to support the near-term plan to grid and stage an expanded data library, such as HIRS, AVHRR and VTPR.