IN52A-01
Coordinating Communities and Building Governance in the Development of Schematic and Semantic Standards: the Key to Solving Global Earth and Space Science Challenges in the 21st Century.
The Information Age in Science is being driven partly by the data deluge as exponentially growing volumes of data are being generated by research. Such large volumes of data cannot be effectively processed by humans and efficient and timely processing by computers requires development of specific machine readable formats. Further, as key challenges in earth and space sciences, such as climate change, hazard prediction and sustainable development resources require a cross disciplinary approach, data from various domains will need to be integrated from globally distributed sources also via machine to machine formats. However, it is becoming increasingly apparent that the existing standards can be very domain specific and most existing data transfer formats require human intervention. Where groups from different communities do try combine data across the domain/discipline boundaries much time is spent reformatting and reorganizing the data and it is conservatively estimated that this can take 80% of a project's time and resources. Four different types of standards are required for machine to machine interaction: systems, syntactic, schematic and semantic. Standards at the systems (WMS, WFS, etc) and at the syntactic level (GML, Observation and Measurement, SensorML) are being developed through international standards bodies such as ISO, OGC, W3C, IEEE etc. In contrast standards at the schematic level (e.g., GeoSciML, LandslidesML, WaterML, QuakeML) and at the semantic level (ie ontologies and vocabularies) are currently developing rapidly, in a very uncoordinated way and with little governance. As the size of the community that can machine read each others data depends on the size of the community that has developed the schematic or semantic standards, it is essential that to achieve global integration of earth and space science data, the required standards need to be developed through international collaboration using accepted standard proceedures. Once developed the standards also require some form of governance to maintain and then extend the standard as the science evolves to meet new challenges. A standard that does have some governance is GeoSciML, a data transfer standard for geoscience map data. GeoSciML is currently being developed by a consortium of 7 countries under the auspices of the Commission for the Management of and Application of Geoscience Information (CGI), a commission of the International Union of Geological Sciences. Perhaps other ‘ML' or ontology and vocabulary development ‘teams' need to look to their international domain specific specialty societies for endorsement and governance. But the issue goes beyond Earth and Space Sciences, as increasingly cross and intra disciplinary science requires machine to machine interaction with other science disciplines such as physics, chemistry and astronomy. For example, for geochemistry do we develop GeochemistryML or do we extend the existing Chemical Markup Language? Again, the question is who will provide the coordination of the development of the required schematic and semantic standards that underpin machine to machine global integration of science data. Is this a role for ICSU or CODATA or who? In order to address this issue, Geoscience Australia and CSIRO established the Solid Earth and Environmental Grid Community website to enable communities to ‘advertise' standards development and to provide a community TWIKI where standards can be developed in a globally ‘open' environment. http://www.seegrid.csiro.au
IN52A-02
Building Community Around Hydrologic Data Models Within CUAHSI
The Consortium of Universities for the Advancement of Hydrologic Science, Inc (CUAHSI) has a Hydrologic Information Systems project which aims to provide better data access and capacity for data synthesis for the nation's water information, both that collected by academic investigators and that collected by water agencies. These data include observations of streamflow, water quality, groundwater levels, weather and climate and aquatic biology. Each water agency or research investigator has a unique method of formatting their data (syntactic heterogeneity) and describing their variables (semantic heterogeneity). The result is a large agglomeration of data in many formats and descriptions whose full content is hard to interpret and analyze. CUAHSI is helping to resolve syntactic heterogeneity through the development of WaterML, a standard XML markup language for communicating water observations data through web services, and a standard relational database structure for archiving data called the Observations Data Model. Variables in these data archiving and communicating systems are indexed against a controlled vocabulary of descriptive terms to provide the capacity to synthesize common data types from disparate data sources. http://www.cuahsi.org/his.html
IN52A-03
Community-Driven Initiatives to Achieve Interoperability for Ecological and Environmental Data
Advances in ecology and environmental science increasingly depend on information from multiple disciplines to tackle broader and more complex questions about the natural world. Such advances, however, are hindered by data heterogeneity, which impedes the ability of researchers to discover, interpret, and integrate relevant data that have been collected by others. Here, we outline two community-building initiatives for improving data interoperability in the ecological and environmental sciences, one that is well-established (the Ecological Metadata Language [EML]), and another that is actively underway (a unified model for observations and measurements). EML is a metadata specification developed for the ecology discipline, and is based on prior work done by the Ecological Society of America and associated efforts to ensure a modular and extensible framework to document ecological data. EML "modules" are designed to describe one logical part of the total metadata that should be included with any ecological dataset. EML was developed through a series of working meetings, ongoing discussion forums and email lists, with participation from a broad range of ecological and environmental scientists, as well as computer scientists and software developers. Where possible, EML adopted syntax from the other metadata standards for other disciplines (e.g., Dublin Core, Content Standard for Digital Geospatial Metadata, and more). Although EML has not yet been ratified through a standards body, it has become the de facto metadata standard for a large range of ecological data management projects, including for the Long Term Ecological Research Network, the National Center for Ecological Analysis and Synthesis, and the Ecological Society of America. The second community-building initiative is based on work through the Scientific Environment for Ecological Knowledge (SEEK) as well as a recent workshop on multi-disciplinary data management. This initiative aims at improving interoperability by describing the semantics of data at the level of observation and measurement (rather than the traditional focus at the level of the data set) and will define the necessary specifications and technologies to facilitate semantic interpretation and integration of observational data for the environmental sciences. As such, this initiative will focus on unifying the various existing approaches for representing and describing observation data (e.g., SEEK's Observation Ontology, CUAHSI's Observation Data Model, NatureServe's Observation Data Standard, to name a few). Products of this initiative will be compatible with existing standards and build upon recent advances in knowledge representation (e.g., W3C's recommended Web Ontology Language, OWL) that have demonstrated practical utility in enhancing scientific communication and data interoperability in other communities (e.g., the genomics community). A community-sanctioned, extensible, and unified model for observational data will support metadata standards such as EML while reducing the "babel" of scientific dialects that currently impede effective data integration, which will in turn provide a strong foundation for enabling cross-disciplinary synthetic research in the ecological and environmental sciences.
IN52A-04 INVITED
The International Polar Year Data and Information Service—Building a network of sharing, trust, and meaning
Arctic science is inherently interdisciplinary and there is a national and international imperative to understand the Arctic region as a system. This emphasis on interdisciplinary research requires scientists to extend their professional networks across disciplines. Researchers need to access, understand, and assess data and information outside their field where they may not have the relevant disciplinary expertise, including knowledge of core assumptions and metaphors. Correspondingly data providers need to make their data understandable and usable by new users with different knowledge bases. While there are many technical barriers to cross- disciplinary data sharing, the fundamental issues lie in the human interactions and networks of scientific interaction. The International Polar Year 2007-2008 (IPY) is an intensive burst of coordinated scientific activity involving tens of thousands of investigators from 63 countries. It also seeks to create a sustained legacy of international and interdisciplinary cooperation, notably through enhanced polar observing systems. As such, IPY provides an opportunity for focused investigation of how interdisciplinary scientific and data sharing networks are created, extended, and modified. The International Polar Year Data and Information Service (IPYDIS), an international federation of data centers, archives, and networks working to ensure proper stewardship of IPY and related data, is one example of a formal network developing out of IPY. The IPYDIS has a semi-formal structure of governance, but IPY is limited in time. Methods for formal support and extension of the IPYDIS through other international collaborations will be discussed. Ultimately, however, the network is sustained through mechanisms that increase data sharing (technically and socially), establish and reinforce trust (between investigators and of data sources, and codify meanings both formally (ontologies, vocabularies, etc.) and informally (e.g., concepts of quality, incorporation of traditional knowledge). http://ipydis.org
IN52A-05 INVITED
Leverage and Delegation in Developing an Information Model for Geology
GeoSciML is an information model and XML encoding developed by a group of primarily geologic survey organizations under the auspices of the IUGS CGI. The scope of the core model broadly corresponds with information traditionally portrayed on a geologic map, viz. interpreted geology, some observations, the map legend and accompanying memoir. The development of GeoSciML has followed the methodology specified for an Application Schema defined by OGC and ISO 19100 series standards. This requires agreement within a community concerning their domain model, its formal representation using UML, documentation as a Feature Type Catalogue, with an XML Schema implementation generated from the model by applying a rule-based transformation. The framework and technology supports a modular governance process. Standard datatypes and GI components (geometry, the feature and coverage metamodels, metadata) are imported from the ISO framework. The observation and sampling model (including boreholes) is imported from OGC. The scale used for most scalar literal values (terms, codes, measures) allows for localization where necessary. Wildcards and abstract base- classes provide explicit extensibility points. Link attributes appear in a regular way in the encodings, allowing reference to external resources using URIs. The encoding is compatible with generic GI data-service interfaces (WFS, WMS, SOS). For maximum interoperability within a community, the interfaces may be specialised through domain-specified constraints (e.g. feature-types, scale and vocabulary bindings, query-models). Formalization using UML and XML allows use of standard validation and processing tools. Use of upper-level elements defined for generic GI application reduces the development effort and governance resonsibility, while maximising cross-domain interoperability. On the other hand, enabling specialization to be delegated in a controlled manner is essential to adoption across a range of subdisciplines and jurisdictions. The GeoSciML design team is responsible only for the part of the model that is unique to geology but for which general agreement can be reached within the domain. This paper is presented on behalf of the Interoperability Working Group of the IUGS Commission for Geoscience Information (CGI) - follow web-link for details of the membership. http://www.seegrid.csiro.au/twiki/bin/view/CGIModel/GeoSciML
IN52A-06 INVITED
Building Community and Governance of Metadata and Ontologies Within the Marine Community
For three years the Marine Metadata Interoperability Project has been explicitly building a metadata community for the marine sciences. Toward that goal, the organization identified technical resources, developed tools, provided guidance, held workshops, gave scores of presentations, led and participated in interoperability demonstrations, and contributed to standards development activities. All of this information has been presented on the organization's web site, and is used to increase awareness and participation of the community. As a successful community-building project with many accomplishments to date, MMI is keenly aware of the opportunities -- and the need -- for further progress. In this talk, we will frankly present the successes, challenges, and lessons of the project to date; consider MMI in the context of like-minded organizations; consider opportunities for MMI and similar organizations to achieve semantic interoperability objectives; and envision a more thoroughly collaborative and effective marine science community. Finally, with this background in mind, the presentation will discuss how to best manage the standards and ontologies needed for earth science data systems interoperability. http://marinemetadata.org/agucommunitybuilding
IN52A-07
The CF Conventions: Governance and Community Issues in Establishing Standards for Representing Climate, Forecast, and Observational Data
The Climate and Forecast (CF) conventions governing metadata have become important to earth system science communities as a standard way of capturing the meaning of multidimensional data and the intent of data providers. The CF Conventions have proved useful for comparing conforming data from different sources and for unambiguously determining the space-time location of data. Originally developed and maintained by a small group in the climate modeling community, the CF Conventions have recently transitioned to a community governance structure. We discuss the process of evolving the development, maintenance, and community governance of an international data standard, as well as successes, challenges, and issues in maintaining and scaling the standard to broader uses through an open process. http://www.cfconventions.org/
IN52A-08
NOAA's Approach to Community Building and Governance for Data Integration and Standards Within IOOS
This presentation will review NOAA's current approach to the Integrated Ocean Observing System (IOOS) at a national and regional level within the context of our United States Federal and Non-Federal partners. Further, it will discuss the context of integrating data and the necessary standards definition that must be done not only within the United States but in a larger global context. IOOS is the U.S. contribution to the Global Ocean Observing System (GOOS), which itself is the ocean contribution to the Global Earth Observation System of Systems (GEOSS). IOOS is a nationally important network of distributed systems that forms an infrastructure providing many different users with the diverse information they require to characterize, understand, predict, and monitor changes in dynamic coastal and open ocean environments. NOAA recently established an IOOS Program Office to provide a focal point for its ocean observation programs and assist with coordination of regional and national IOOS activities. One of the Program's initial priorities is the development of a data integration framework (DIF) proof-of-concept for IOOS data. The initial effort will focus on NOAA sources of data and be implemented incrementally over the course of three years. The first phase will focus on the integration of five core IOOS variables being collected, and disseminated, for independent purposes and goals by multiple NOAA observing sources. The goal is to ensure that data from different sources is interoperable to enable rapid and routine use by multiple NOAA decision-support tool developers and other end users. During the second phase we expect to ingest these integrated variables into four specific NOAA data products used for decision-support. Finally, we will systematically test and evaluate enhancements to these products, and verify, validate, and benchmark new performance specifications. The outcome will be an extensible product for operational use that allows for broader community applicability to include additional variables, applications, and non-NOAA sources of data. NOAA is working with Ocean.US to implement an interagency process for the submission, proposal, and recommendation of IOOS data standards. In order to achieve the broader goals of data interoperability of GEOSS, communication of this process and the identified standards needs to be coordinated at the international level. NOAA is participating in the development of a series of IODE workshops with the objective to achieve broad agreement and commitment to ocean data management and exchange standards. The first of these meetings will use the five core variables identified by the NOAA DIF as a focus.
IN52A-09
Community-Based Development of Standards for Geochemical and Geochronological Data
The Geoinformatics for Geochemistry (GfG) Program (www.geoinfogeochem.org) and the EarthChem project (www.earthchem.org) aim to maximize the application of geochemical data in Geoscience research and education by building a new advanced data infrastructure for geochemistry that facilitates the compilation, communication, serving, and visualization of geochemical data and their integration with the broad Geoscience data set. Building this new data infrastructure poses substantial challenges that are primarily cultural in nature, and require broad community involvement in the development and implementation of standards for data reporting (e.g., metadata for analytical procedures, data quality, and analyzed samples), data publication, and data citation to achieve broad acceptance and use. Working closely with the science community, with professional societies, and with editors and publishers, recommendations for standards for the reporting of geochemical and geochronological data in publications and to data repositories have been established, which are now under consideration for adoption in journal and agency policies. The recommended standards are aligned with the GfG and EarthChem data models as well as the EarthChem XML schema for geochemical data. Through partnerships with other national and international data management efforts in geochemistry and in the broader marine and terrestrial geosciences, GfG and EarthChem seek to integrate their development of geochemical metadata standards, data format, and semantics with relevant existing and emerging standards and ensure compatibility and compliance. http://www.geoinfogeochem.org, http://www.earthchem.org
IN52A-10
QuakeML: XML for Seismological Data Exchange and Resource Metadata Description
QuakeML is an XML-based data exchange format for seismology that is under development. Current collaborators are from ETH, GFZ, USC, USGS, IRIS DMC, EMSC, ORFEUS, and ISTI. QuakeML development was motivated by the lack of a widely accepted and well-documented data format that is applicable to a broad range of fields in seismology. The development team brings together expertise from communities dealing with analysis and creation of earthquake catalogs, distribution of seismic bulletins, and real-time processing of seismic data. Efforts to merge QuakeML with existing XML dialects are under way. The first release of QuakeML will cover a basic description of seismic events including picks, arrivals, amplitudes, magnitudes, origins, focal mechanisms, and moment tensors. Further extensions are in progress or planned, e.g., for macroseismic information, location probability density functions, slip distributions, and ground motion information. The QuakeML language definition is supplemented by a concept to provide resource metadata and facilitate metadata exchange between distributed data providers. For that purpose, we introduce unique, location-independent identifiers of seismological resources. As an application of QuakeML, ETH Zurich currently develops a Python-based seismicity analysis toolkit as a contribution to CSEP (Collaboratory for the Study of Earthquake Predictability). We follow a collaborative and transparent development approach along the lines of the procedures of the World Wide Web Consortium (W3C). QuakeML currently is in working draft status. The standard description will be subjected to a public Request for Comments (RFC) process and eventually reach the status of a recommendation. QuakeML can be found at http://www.quakeml.org. http://www.quakeml.org