Earth and Space Science Informatics [IN]

IN22A
 MC:3018  Tuesday  1020h

Emerging Cyberinfrastructure for Geosciences I


Presiding:  P Fox, HAO/ESSL/NCAR; D McGuinness, Rensselaer Polytechnic Institute

IN22A-01 INVITED

Next Generation Virtual Observatories

* Fox, P pfox@ucar.edu, HAO/ESSL/NCAR, PO Box 3000, Boulder, CO 80307, United States
McGuinness, D L dlm@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States
McGuinness, D L dlm@cs.rpi.edu, McGuinness Associates, 4 Shaker Bay Road, Latham, NY 12110, United States

Virtual Observatories (VO) are now being established in a variety of geoscience disciplines beyond their origins in Astronomy and Solar Physics. Implementations range from hydrology and environmental sciences to solid earth sciences. Among the goals of VOs are to provide search/ query, access and use of distributed, heterogeneous data resources. With many of these goals being met and usage increasing, new demands and requirements are arising. In particular there are two of immediate and pressing interest. The first is use of VOs by non-specialists, especially for information products that go beyond the usual data, or data products that are sought for scientific research. The second area is citation and attribution of artifacts that are being generated by VOs. In some sense VOs are re-publishing (re-packaging, or generating new synthetic) data and information products. At present only a few VOs address this need and it is clear that a comprehensive solution that includes publishers is required. Our work in VOs and related semantic data framework and integration areas has lead to a view of the next generation of virtual observatories which the two above-mentioned needs as well as others that are emerging. Both of the needs highlight a semantic gap, i.e. that the meaning and use for a user or users beyond the original design intention is very often difficult or impossible to bridge. For example, VOs created for experts with complex, arcane or jargon vocabularies are not accessible to the non-specialist and further, information products the non-specialist may use are not created or considered for creation. In the second case, use of a (possibly virtual) data or information product (e.g. an image or map) as an intellectual artifact that can be accessed as part of the scientific publication and review procedure also introduces terminology gaps, as well as services that VOs may need to provide. Our supposition is that formalized methods in semantics and semantic web technologies are ideal to meet and solve both of these semantic gaps. In this presentation we highlight both of the emerging needs, and current and emerging semantic web solutions that will enable the next generation of virtual observatories. Our work is funded under NSF/OCI and NASA/ACCESS/ESTO projects to the High Altitude Observatory at the National Center for Atmospheric Research (NCAR) and McGuinness Associates Consulting.

IN22A-02

Cyberinfrastructure for the NSF Ocean Observatories Initiative

* Orcutt, J A jorcutt@ucsd.edu, UCSD, California Institute for Telecommunications & Information Technology, 9500 Gilman Drive, La Jolla, CA 92093-0436, United States
* Orcutt, J A jorcutt@ucsd.edu, UCSD, Scripps Institution of Oceanography, 9500 Gilman Drive, La Jolla, CA 92093-0225, United States
Vernon, F L, UCSD, Scripps Institution of Oceanography, 9500 Gilman Drive, La Jolla, CA 92093-0225, United States
Arrott, M , UCSD, California Institute for Telecommunications & Information Technology, 9500 Gilman Drive, La Jolla, CA 92093-0436, United States
Chave, A , Woods Hole Oceanographic Institution, Deep Submergence Laboratory, Woods Hole, MA 02543-1531, United States
Schofield, O , Rutgers University, 381 Mercer St., Princeton, NJ 08540, United States
Peach, C , UCSD, Scripps Institution of Oceanography, 9500 Gilman Drive, La Jolla, CA 92093-0225, United States
Krueger, I , UCSD, California Institute for Telecommunications & Information Technology, 9500 Gilman Drive, La Jolla, CA 92093-0436, United States
Meisinger, M , UCSD, California Institute for Telecommunications & Information Technology, 9500 Gilman Drive, La Jolla, CA 92093-0436, United States

The Ocean Observatories Initiative (OOI) is an environmental observatory covering a diversity of oceanic environments, ranging from the coastal to the deep ocean. The physical infrastructure comprises a combination of seafloor cables, buoys and autonomous vehicles. It is currently in the final design phase, with construction planned to begin in mid-2010 and deployment phased over five years. The Consortium for Ocean Leadership manages this Major Research Equipment and Facilities Construction program with subcontracts to Scripps Institution of Oceanography, University of Washington and Woods Hole Oceanographic Institution. High-level requirements for the CI include the delivery of near-real-time data with minimal latencies, open data, data analysis and data assimilation into models, and subsequent interactive modification of the network (including autonomous vehicles) by the cyberinfrastructure. Network connections include a heterogeneous combination of fiber optics, acoustic modems, and Iridium satellite telemetry. The cyberinfrastructure design loosely couples services that exist throughout the network and share common software and middleware as necessary. In this sense, the system appears to be identical at all scales, so it is self-similar or fractal by design. The system provides near-real-time access to data and developed knowledge by the OOI's Education and Public Engagement program, to the physical infrastructure by the marine operators and to the larger community including scientists, the public, schools and decision makers. Social networking is employed to facilitate the virtual organization that builds, operates and maintains the OOI as well as providing a variety of interfaces to the data and knowledge generated by the program. We are working closely with NOAA to exchange near-real-time data through interfaces to their Data Interchange Facility (DIF) program within the Integrated Ocean Observing System (IOOS). Efficiencies have been emphasized through the use of university and commercial computing clouds.

http://ooici.ucsd.edu/spaces

IN22A-03

Data Relationships: Towards a Conceptual Model of Scientific Data Catalogs

* Hourcle, J A joseph.a.hourcle@nasa.gov, NASA/GSFC (Wyle IS), Code 671.1 Goddard Space Flight Center, Greenbelt, MD 20771, United States

As the amount of data, types of processing and storage formats increase, the total number of record permutations increase dramatically. The result is an overwhelming number of records that make identifying the best data object to answer a user's needs more difficult. The issue is further complicated as each archive's data catalog may be designed around different concepts - - anything from individual files to be served, series of similarly generated and processed data, or something entirely different. Catalogs may not only be flat tables, but may be structured as multiple tables with each table being a different data series, or a normalized structure of the individual data files. Merging federated search results from archives with different catalog designs can create situations where the data object of interest is difficult to find due to an overwhelming number of seemingly similar or entirely unwanted records. We present a reference model for discussing data catalogs and the complex relationships between similar data objects. We show how the model can be used to improve scientist's ability to quickly identify the best data object for their purposes and discuss technical issues required to use this model in a federated system.

IN22A-04

Report From the Cryospheric Cyberinfrastructure: Discovery, Access, and Delivery of Data for IPY (DADDI)

Parsons, M parsonsm@nsidc.org, National Snow and Ice Data Center, CIRES, 449 UCB University of Colorado, Boulder, CO 80309-0449, United States
* Collins, J collinsj@nsidc.org, National Snow and Ice Data Center, CIRES, 449 UCB University of Colorado, Boulder, CO 80309-0449, United States

The Discovery, Access, and Delivery of Data for IPY (DADDI) project seeks to improve the availability of Arctic coastal data, and has the long term goal of developing a system that can be extended to support access to the spectrum of International Polar Year (IPY) data. Previously, we reported on the process of defining user needs for DADDI, especially those requirements related to data discovery and access(1). Here we discuss the implementation of the DADDI system and the components that provide the means to contribute, preserve, discover and access data relevant to all disciplines within the cryospheric domain. Our previously reported use case development for the DADDI project described a set of criteria which were particularly salient for the users of systems supported by a geoscience cyberinfrastructure. These included the ability to easily control the boundaries of scientific parameter dimensions when searching for, manipulating and obtaining data; relevant, ranked, and filterable search and browse results; and access to data quality indicators and references, including access to human experts in the use of the selected data. Several of those user priorities have been successfully incorporated into the current DADDI environment, and in particular into the Mercury search system used to provide DADDI's metadata harvesting, indexing, query, and search results presentation functions. We will discuss and demonstrate the current system and its capabilities, including a review of the metadata and related standards used to support the existing features. We will also review the capabilities yet to be implemented, and the infrastructure changes or additions that will be necessary for DADDI to more fully participate in the cryospheric cyberinfrastructure. (1) The Virtual Observatory in Action: Recurring Themes in Polar Science Use Cases. 2007 Virtual Observatories in Geosciences Conference (http://www.egy.org/VOiG/Home.html; http://www.hao.ucar.edu/projects/vsto/voig/index.php/Session_II:Recurring_Themes_in_Polar_Sciences).

http://www.nsidc.org/daddi/

IN22A-05

Knowledge Provenance in Semantic Wikis

* Ding, L dingl@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States
Bao, J baojie@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States
McGuinness, D L dlm@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States

Collaborative online environments with a technical Wiki infrastructure are becoming more widespread. One of the strengths of a Wiki environment is that it is relatively easy for numerous users to contribute original content and modify existing content (potentially originally generated by others). As more users begin to depend on informational content that is evolving by Wiki communities, it becomes more important to track the provenance of the information. Semantic Wikis expand upon traditional Wiki environments by adding some computationally understandable encodings of some of the terms and relationships in Wikis. We have developed a semantic Wiki environment that expands a semantic Wiki with provenance markup. Provenance of original contributions as well as modifications is encoded using the provenance markup component of the Proof Markup Language. The Wiki environment provides the provenance markup automatically, thus users are not required to make specific encodings of author, contribution date, and modification trail. Further, our Wiki environment includes a search component that understands the provenance primitives and thus can be used to provide a provenance-aware search facility. We will describe the knowledge provenance infrastructure of our Semantic Wiki and show how it is being used as the foundation of our group web site as well as a number of project web sites.

IN22A-06

DataSpaces: Using Community Workspaces to Enable Rich Air Quality Metadata

* Robinson, E M emr1@wustl.edu, Washington University in Saint Louis, 1 Brookings Dr, Saint Louis, MO 63130, United States
Husar, R B rhusar@wustl.edu, Washington University in Saint Louis, 1 Brookings Dr, Saint Louis, MO 63130, United States

Currently, metadata for air quality datasets is variable, distributed and normally created by the provider for the user. However, a single dataset can be used for many applications that the provider may or may not anticipate and the data may go through many value-adding processes before it reaches the "end user". Additional metadata can be created at any step along the usage chain and at this time there is no mechanism for collecting this metadata. Consequently, users don't know how a dataset has been used or what additional processing has occurred beyond the originator. One method to harvest and share metadata from all members of the usage chain is through community workspaces, DataSpaces. DataSpaces are virtual spaces for contributing and archiving metadata, discussing the dataset and harvesting distributed resources in order to capture the critical community knowledge about the dataset. A DataSpace for a given dataset has two parts, structured, semantically rich metadata and flexible community-contributed metadata. The structured dataset description includes standard dataset metadata, data lineage, and data quality information such as provider, parameters, platform and time period. The additional value of the DataSpaces comes from the context provided by the dataset community: users, mediators and providers. This may be through links to other mediator or user-provided metadata, publications that reference the dataset or web applications and tools using the dataset. DataSpaces also provides a place where a dataset community can connect through discussion and announcements about the dataset. As DataSpaces evolves and is used more by the community, additional functionality will emerge. Currently, there are still many issues with the implementation of DataSpaces including how to link the DataSpace to the dataset as it moves along the usage chain and how material in DataSpaces can be reused in other metadata.

IN22A-07

The Model Interoperability Experiment in the Gulf of Maine: A Success Story Made Possible by NetCDF, CF-1.0, NcML, NetCDF-Java, THREDDS, OPeNDAP and MATLAB

* Signell, R P rsignell@usgs.gov, USGS, 384 Woods Hole Road, Woods Hole, MA 02543, United States

The Gulf of Maine Ocean Data Partnership Modeling Committee has been developing a Model Interoperability Experiment in the Gulf of Maine built around the Climate and Forecast (CF-1.0) metadata standard. The goal is to allow scientists to issue common Matlab commands to retrieve geospatially referenced data, regardless of model type. Our starting point was output from six different models: the ROMS, ECOM, POM and FVCOM ocean circulation models, the WRF meteorological model and the WaveWatch III ocean wave model. Although the models all had different grid conventions and were served at different institutions, each group produced NetCDF files, used Matlab for visualization and analysis, and had a standard HTTP 1.1 web server. Only one group used CF-conventions, however, and as a result each group had their own set of analysis and visualization routines to perform nearly identical tasks. The system was designed to achieve interoperability with a minimum of effort on the part of the data providers and data users. To supply data, participants need only place their existing NetCDF files on their own web sites. The data is accessed using the "byte range request" feature of HTTP, utilized in NetCDF-Java. The CF standardization is achieved using a layer of XML (NcML) which also provides virtual aggregation of data. The THREDDS Data Server allows for central cataloging of the dataset, access via the OPeNDAP web service, and for rectilinear grids, access via the OGC Web Coverage Service (WCS) and the NetCDF Subset Services as well. The OPeNDAP + CF standard data can be accessed with our NetCDF-Java based "CF Toolkit for MATLAB". This toolkit works on any MATLAB system without compiling, delivering geospatially referenced model output from all six models using common functions. To further expand the capabilities of CF clients such as the one we have developed, we need to further expand the CF conventions to specify additional common features of model output, including staggered grids, masked regions, velocity component relationships and unstructured grid connectivity information. We also need to develop CF toolkits for other common languages such as Python and IDL.

http://www.gomodp.org/modeling-committee

IN22A-08

First Applications of DoD Iridium RUDICS in the NSF Polar Programs

* Valentic, T todd.valentic@sri.com, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, United States
Stehle, R roy.stehle@sri.com, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, United States

We will present the first deployment and application of the new Iridium RUDICS service to remote instrumentation projects within the National Science Foundation's polar programs. The rise of automated observing networks has increased the demand for real-time connectivity to remote instruments, not only for immediate access to data, but to also interrogate health and status. Communicating with field sites in the polar regions is complicated by the remoteness from existing infrastructure, low temperatures and limited connection options. Sites located above 78° latitude are not able to see geostationary satellites, leaving the Iridium constellation as the only one that provide a direct connection. Some others, such as Orbcomm, only provide a store-and-forward service. Iridium is often used as a dial up modem to establish a PPP connection to the Internet with data files transferred via FTP. On low-bandwidth, high-latency networks like Iridium (2400bps with ping times of seconds), this approach is time consuming and inefficient. The dial up time alone takes upwards of a minute, and standard TCP/IP and FTP protocols are hampered by the long latencies. Minimizing transmission time is important for reducing battery usage and connection costs. The new Iridium RUDICS service can be used for more efficient transfers. RUDICS is an acronym for "Router-based Unstructured Digital Inter-working Connectivity Solution" and provides a direct connection between an instrument in the field and a server on the Internet. After dialing into the Iridium gateway, a socket connection is opened to a registered port on a user's server. Bytes sent to or from the modem appear at the server's socket. The connection time is reduced to about 10 seconds because the modem training and PPP negotiation stages are eliminated. The remote device does not need to have a full TCP/IP stack, allowing smaller instruments such as data loggers to directly handle the data transmission. Alternative protocols can be deployed that better exploit the characteristics of the Iridium channel. In addition, the setup naturally scales to handle hundreds of remote devices, an important aspect for larger sensor networks. As part of the NSF's Arctic Research Support and Logistics Services, we have deployed RUDICS systems with three different research projects. These are the first NSF RUDICS deployments for projects using the Department of Defense Iridium gateway, which allows for unlimited connection time at a flat monthly rate for US government users. The first project is O-Buoy, an IPY-OASIS project for self-contained, autonomous observations of atmospheric chemical species in the polar marine boundary layer. The second project is collection of low-power instrument towers on Alaska's North Slope at Imnavait Creek, part of the Arctic Observation Network (AON). Lastly, the autonomous instrument platform at Ivotuk, Alaska, uses RUDICS to provide telemetry about the renewable energy systems. A set of real-time web displays allow researchers for each project to monitor their remote sites and access real-time data.

http://transport.sri.com/rudics