Knowledge Provenance for Virtual Observatories
Virtual Observatories present a growing trend as a new paradigm supporting distributed interdisciplinary scientific research. As virtual observatories are becoming more commonplace with a wider range of users, additional requirements are being identified and some previous requirements are becoming more important. In this presentation, we describe knowledge provenance requirements that have emerged in our previous work on virtual observatories as well as requirements identified from a number of scientific communities. We use knowledge provenance in a broad sense to include the origins of knowledge in any virtual system. This includes sources of raw data; experiments used to generate data, processing applied to the data, etc. We will describe a new sponsored research effort entitled Semantic Provenance Capture in Data Ingest Systems (SPCDIS) where we focus on providing an extensible representation for provenance in data systems. In this NSF OCI/SDCI- funded project we will implement an extensible meta data provenance scheme for one existing virtual observatory – the Virtual Solar Terrestrial-Observatory.
The GFZ ISDC - Part of Earth Science Infrastructure
The GeoForschungsZentrum Potsdam (GFZ) Information System and Data Center (ISDC) as part of the GFZ information technology infrastructure is managing almost 300 very different geoscientific product types containing more than 15 million data sets and approximately 10 TByte data volume, processed by different national and international scientific groups. Most of the data are coming from the German CHAMP satellite as well as from the American-German GRACE satellite mission and appropriate projects. The scientific results are covering geodesy, geophysics and atmospheric research. More than 1500 registered users and user groups all over the world have access to the data using the project and product integrating ISDC portal (http://isdc.gfz-potsdam.de) which is providing different graphical user interfaces (GUI) and non-GUI batch processing interfaces. This presentation is reflecting the ideas, the concepts and the realisation process in order to integrate the ISDC in a global Earth science Infrastructure. All scientific data sets managed by the ISDC are described by corresponding metadata using the Directory Interchange Format (DIF) standard. Extending this DIF standard (http://gcmd.gsfc.nasa.gov/User/difguide/difman.html), product types are described by substantial parent DIF documents whereas the scientific products consisting of the actual data sets and the appropriate child DIF documents containing the data set specific information only. Recently all metadata corresponding to new products types are using the DIF version 9 XML schema instead of the DIF "ASCII text only" standard. Still almost all metadata are stored in project specific relational database structures. In future, at least the parent DIF XML metadata documents will be managed by XML database mechanisms. Existing XML metadata documents easily can be transferred from the ISDC/DIF XML schema to international standardized XML schema using Extensible Stylesheet Language Transformation (XSLT)-based processes. This approach could be helpful in order to integrate a standard web-based Catalog Service (WCS) interface into the ISDC portal realized by the Open GIS Consortium (OGC)/ISO (http://www.opengeospatial.org/standards/cat) standard conform "degree" framework (http://deegree.sourceforge.net/index.html). Providing such a standard WCAS interface would open the gate to real interoperability and networking other geoscientific information systems constructing the global Earth science cyberinfrastructure. http://isdc.gfz-potsdam.de
A Web 2.0 Application for Executing Queries and Services on Climatic Data
For many years countries have collected data in order to understand climate, to study its effect in living species, and to predict future behavior. Nowadays, terabytes of data are collected by governmental agencies and academic institutions and the current challenge is how to provide appropriate access to this vast amount of climatic data. Each country has a different situation with respect to the collection and use of these data. In particular, in Venezuela, a few institutions have systematically gathered observational and hidrology data, but the data are mostly registered in non-digital media which have been lost or have deteriorated over the years; all of this restricts data availability. In 2006 a joint project between two major venezuelan universities, Universidad Simón Bol\'ivar (USB) and Universidad Central de Venezuela (UCV) was initiated. The goal of the project is to develop a digital repository of the country's climatic and hidrology data, and to build an application that provides querying and service execution capabilities over these data. The repository has been conceptually modeled as a database, which integrates observational data and metadata. Among the metadata we have an inventory of all the stations where data has been collected, and the description of the measurements themselves, for instance, the instruments used for the collection, the time granularity of the measurements, and their units of measure. The resulting data model combines traditional entity relationship concepts with star and snowflake schemas from datawarehouses. The model allows the inclusion of historic or current data, and each kind of data requires a different loading process. A special emphasis has been given to the representation of the quality of the data stored in the repository. Quality attributes can be attached to each individual value or to sets of values; these attributes can represent statistical or semantic quality of the data. Values can be stored at any level of aggregation, hourly, daily, monthly, so that they can be provided to the user at the desired level. This means that additional caution has to be exercised in query answering, in order to distinguish between primary and derived data. On the other hand, a Web 2.0 application is being designed to provide a front-end to the repository. This design focuses on two important aspects: the use of metadata structures, and the definition of collaborative Web 2.0 features that can be integrated to a project of this nature. Metadata descriptors include for a set of measurements, its quality, granularity and other dimension information. With these descriptors it is possible to establish relationships between different sets of measurements and provide scientists with efficient searching mechanisms that determine the related sets of measurements that contribute to a query answer. Unlike traditional applications for climatic data, our approach not only satisfies requirements of researchers specialized in this domain, but also those of anyone interested in this area; one of the objectives is to build an informal knowledge base that can be improved and consolidated with the usage of the system.
Ontology-based Information Management in QuakeSim
The QuakeSim interdisciplinary research team has developed a federated database system, which records and provides portal-based access to a variety of geoscientific information important to the earthquake study and forecasting process. This includes fault, seismicity, and other information key to modeling earthquakes and tsunamis. Through the QuakeSim portal, scientists can discover relevant information, and access, visualize, and import data to simulation programs and other codes. This is accomplished by utilizing an inter-connected (federated) set of ontologies to describe the semantics of the information and the inter-relationships among the data. Data are recorded in source form with error estimates included, are geotagged as appropriate to specify precisely where on the globe the data were obtained, and are converted as necessary for use by scientists. Data is delivered by the use of a suite of Web Services tied to the semantic metadata (ontology) specifications. A primary goal of the QuakeTables federated database is to provide an integrated resource for simulation and modeling software, such as GeoFest and Virtual California. A key feature of QuakeTables is to allow for multiple fault interpretations, which can be tested in the models and simulations. As such, QuakeTables does not define a standard set of faults, but allows users to select faults from standard sets, from research publications, or from user-defined attributes. At present, the system is being enhanced to include GPS and InSAR data
Using Multiple Metadata Standards to Describe Climate Datasets in a Semantic Framework
The standards underlying the Semantic Web -- Resource Description Framework (RDF) and Web Ontology Language (OWL), among others -- show great promise in addressing some of the basic problems in earth science metadata. In particular they provide a single framework that allows us to describe datasets according to multiple standards, creating a more complete description than any single standard can support, and avoiding the difficult problem of creating a super-standard that can describe everything about everything. The Semantic Web standards provide a framework for explicitly describing the data models implicit in programs that display and manipulate data. They also provide a framework where multiple metadata standards can be described. Most importantly, these data models and metadata standards can be interrelated, a key step in creating interoperability, and an important step in creating a practical system. As a exercise in understanding how this framework might be used, we have created an RDF expression of the datasets and some of the metadata in the IRI/LDEO Climate Data Library. This includes concepts like datasets, units, dependent variables, and independent variables. These datasets have been provided under diverse frameworks that have varied levels of associated metadata, including netCDF, GRIB, GeoTIFF, and OpenDAP: these frameworks have some associated concepts that are common, some that are similar and some that are quite distinct. We have also created an RDF expression of a taxonomy that forms the basis of a earth data search interface. These concepts include location, time, quantity, realm, author, and institution. A series of inference engines using currently-evolving semantic web technologies are then used to infer the connections between the diverse data-oriented concepts of the data library as well as the distinctly different conceptual framework of the data search. http://iridl.ldeo.columbia.edu/ontologies/
Developing packages and integrating ontologies for Volcanoes, Plate Tectonics and Atmospheric Science Data Integration
In support of a NASA-funded scientific application (SESDI; Semantically Enabled Science Data Integration Project; that needs to share volcano and climate data to investigate relationships between volcanism and global climate, we have generated a volcano and plate tectonic ontologies and leveraged and augmented the existing SWEET (Semantic Web for Earth and Environmental Terminology) ontoloy. Our goal is to create a package for integrating the relevant ontologies (meant to be shared and reused by a broad community of users) to provide access to the key volcanology, plate tectonic and atmospheric related databases. We present how we have put ontologies to work in this science application setting, and the methodologies employed to create the ontologies, map them to the underlying data and implement them for use by scientists. SESDI is an NASA/ESTO/ACCESS-funded project involving the High Altitude Observatory at the National Center for Atmospheric Research (NCAR), McGuinness Associates Consulting, NASA/JPL and Virginia Polytechnic University.
Prototyping a Knowledge IntegrationFramework to Solve Science Problems
Key information technology advances in recent years include the emergence of distributed computing architectures based on web services; knowledge engineering efforts as evidenced by the development of science domain ontologies in the Semantic Web; and growing interest in scientific data mining as a means for automated knowledge extraction from the ever-increasing volumes of science observations and model data available. We present the results of our prototype study that bring together these key information technology components, as applied to the problem of feature extraction and morphology identification for multi-wavelength images of the Sun. We present, the science application, the linked ontologies describing both the data mining, manipulation and analysis services as well as the science domain; and a web-based user interface based on an existing smart search tool (NOESIS) which allows a user to discover and explore available data and perform the desired analysis.
Advanced Semantic Concepts and Services in the Virtual Solar-Terrestrial Observatory
After almost three years experience in developing the Virtual Solar-Terrestrial Observatory (VSTO), we report on what we have learned about the level of knowledge representation required to satifying the science user needs. To date, we have achieved a unified query workflow based on an abstraction of classes (instrument, parameter, date-time) and provided semantic web services to make these available across the internet. We have also moved to the next level and generation of classes that capture science-level (higher) concepts such as state of the atmosphere, domains (e.g. neutral upper atmosphere), spatial locations, and parameterized representations of time (e.g. high geomagnetic activity). We also outline our plans for further representation, logic and reasoning within VSTO. VSTO is an NSF/OCI-funded joint effort between the High Altitude Observatory and the Computing and Information System Lab at the National Center for Atmospheric Research (NCAR) and McGuinness Associates Consulting.
Current and future uses of OWL for Earth and Space science data frameworks: successes and limitations
Based on almost three years of experience in developing and deploying scientific data frameworks built using semantic technologies, we now have a production virtual observatory in operation, serving two broad communities: solar physics and terrestrial upper atmospheric physics. Within this application, a data framework provides online location, retrieval, and analysis services to a variety of heterogeneous scientific data sources distributed over the internet. We describe selected current and planned uses of our ontologies in OWL-DL, and tools involved in development and deployment. We describe both successes and limitations we have found to date using OWL- based technologies, especially tool support. We also indicate the important components we require from a robust technical infrastructure as we move forward with expanding the functionality of the frameworks. This expansion includes additional semantic representation and reasoning/query services as well as broadening the scope of our scientific disciplines.
Development and Application of Ontologies in Support of Earth and Space Science Education
Through its work in supporting improved science education the Science Education Resource Center (SERC) has developed and applied a set of Earth and Space Science vocabularies. These controlled vocabularies play a central role in supporting user exploration of our educational materials. The set of over 50 vocabularies run the gamut from small vocabularies with a narrowly targeted use, to broader vocabularies that span multiple disciplines and are applied across multiple projects and collections. Typical specialized vocabularies cover disciplinary themes such as tectonic setting (with terms such as mid-ocean ridge, passive margin, and craton) as well as interdisciplinary work such as geology and human health (with terms such as radionuclides and airborne transport processes). To support project-specific customization of vocabularies while retaining the benefits of cross-project reuse our systems allow for dynamic mapping of terms among multiple vocabularies based on semantic equivalencies. The end result is a weaving of related vocabularies into an ontological network that is exposed as specific vocabularies that employ the natural language of the collections and communities that use them. Our process for vocabulary development is community driven and reflects our experiences in aligning terminology with disciplinary-specific expectations. These experiences include rectifying language differences across disciplines in building a Geoscience Quantitative Skills vocabulary through work with both the Mathematics and Geoscience communities, as well as the iterative development of a vocabulary spanning Earth and Space science through the aggregation of smaller vocabularies, each developed by scientists for use within their own discipline. The vocabularies are exposed as key navigational features in over 100 faceted search interfaces within the web sites of a dozen Earth and Space Science Education projects. Within these faceted search interfaces the terms in the vocabularies act as guideposts and browsing links for the users. Only terms relevant to the current collection, or search return, are exposed to the users giving them an immediate sense of the scope and focus of the collection. In using vocabularies to drive these sorts of discovery processes it is critical that vocabularies not only have clear semantics so they can be applied consistently, but also have appropriate evocative meaning for the users of the search interface. It is this immediate evocative meaning, rather than the precisely defined semantics that will end up driving user search behavior and in the end determine the efficacy of the vocabulary as an applied tool. We will outline our experiences in developing and applying these vocabularies within the context of geoscience education and explore how the broader themes that emerge can inform the development and use of ontologies throughout Earth and space science.
The Rosetta Model: Can the Different Physical Science Data Models be Reconciled?
There are a variety of data models in the physical sciences, some of which are in overlapping domains. Each of the data models have been derived in different ways. Some have been based on formal ontologies, others on informal ontologies and others on relational schemas. An additional complication is that different international agencies have divided the physical science domains into different sub-domains leading to some confusion as to which data model to adopt. The most prevalent data models in use today are the Planetary Data System (PDS), Space Physics Archive Search and Extract (SPASE), Virtual Solar Terrestrial Observatory (VSTO), the International Virtual Observatory Alliance (IVOA) and the Global Change Master Directory (GCMD). We take a comparative look at the various data models and ask the questions: Can they be reconciled? Is it possible to have a Rosetta Model to translate between each of the models? What role can ontologies play in defining a Rosetta Model?
Relationship-Centric Ontology Integration
Informal ontologies can result from the successive extraction of RDF (Resource Description Framework) from XML and then OWL (Web Ontology Language) from RDF. This two-stage extraction affords separate and distinct opportunities for the development of an integrated ontology. Working directly with RDF, resulting integrated informal ontologies are shaped heavily by relationships. The RDF-centric approach also allows inconsistencies and redundancies to be resolved. In contrast, the class/property/individual bias inherent in informal ontologies represented via OWL affords a very different, more traditional perspective for ontology integration. Using a semantic framework developed for the Global Geodynamics Project (GGP), the RDF-centric approach for ontology integration is illustrated. And although this approach is effective and efficient on balance, the incorporation of feature-based annotations illustrates how integrated ontologies may challenge the computational completeness and decidability of the resulting representation in OWL.
A Prototype Ontology Tool and Interface for Coastal Atlas Interoperability
While significant capacity has been built in the field of web-based coastal mapping and informatics in the last decade, little has been done to take stock of the implications of these efforts or to identify best practice in terms of taking lessons learned into consideration. This study reports on the second of two transatlantic workshops that bring together key experts from Europe, the United States and Canada to examine state-of-the-art developments in coastal web atlases (CWA), based on web enabled geographic information systems (GIS), along with future needs in mapping and informatics for the coastal practitioner community. While multiple benefits are derived from these tailor-made atlases (e.g. speedy access to multiple sources of coastal data and information; economic use of time by avoiding individual contact with different data holders), the potential exists to derive added value from the integration of disparate CWAs, to optimize decision-making at a variety of levels and across themes. The second workshop focused on the development of a strategy to make coastal web atlases interoperable by way of controlled vocabularies and ontologies. The strategy is based on web service oriented architecture and an implementation of Open Geospatial Consortium (OGC) web services, such as Web Feature Services (WFS) and Web Map Service (WMS). Atlases publishes Catalog Web Services (CSW) using ISO 19115 metadata and controlled vocabularies encoded as Uniform Resource Identifiers (URIs). URIs allows the terminology of each atlas to be uniquely identified and facilitates mapping of terminologies using semantic web technologies. A domain ontology was also created to formally represent coastal erosion terminology as a use case, and with a test linkage of those terms between the Marine Irish Digital Atlas and the Oregon Coastal Atlas. A web interface is being developed to discover coastal hazard themes in distributed coastal atlases as part of a broader International Coastal Atlas Network (ICAN). Lessons learned from this prototype will help build regional atlases and improve decision support systems. http://workshop1.science.oregonstate.edu