U13D-01 INVITED
The Research Library and the E-Science Challenge: New Roles Building on Expanding Responsibilities in Service of the Science Community
Research libraries provide a set of core services to the scholarly and educational communities. This includes: information acquisition, synthesis, navigation, discovery, dissemination, interpretation, presentation, understanding and archiving. Researchers across the science disciplines and increasingly in multi disciplinary projects are producing massive amounts of data, and they seek the infrastructure, the strategies and the partnerships that will enable rigorous and sustained tools for extraction, distribution, collaboration, application and permanent availability. This paper will address the role of the research library from three perspectives. First, the view of scientific datasets as information assets that would benefit from traditional library collection development practice will be explored. Second, the agenda on e-science developed by the Association of Research Libraries will be outlined with a focus on the need for policy and standards development, for resources assessment and allocation, for new approaches to the preparation of the library professional, and library leadership in campus planning and innovative collaborations for research cyberinfrastructure. And third, the responses to the call for proposals from the National Science Foundation's DataNet program will be analyzed and the role of the research library in these project plans will be summarized as an indicator of the expanding responsibility of the library for research data stewardship.
U13D-02 INVITED
The Role of NOAA's National Data Centers in the Earth and Space Science Infrastructure
NOAA's National Data Centers (NNDC) provide access to long-term archives of environmental data from
NOAA and other sources. The NNDCs face significant challenges in the volume and complexity of modern
data sets. Data volume challenges are being addressed using more capable data archive systems such as
the Comprehensive Large Array-Data Stewardship System (CLASS). Challenges in assuring data quality
and stewardship are in many ways more challenging. In the past, scientists at the Data Centers could
provide reasonable stewardship of data sets in their area of expertise. As staff levels have decreased and
data complexity has increased, Data Centers depend on their data providers and user communities to
provide high-quality metadata, feedback on data problems and improvements. This relationship requires
strong partnerships between the NNDCs and academic, commercial, and international partners, as well as
advanced data management and access tools that conform to established international standards when
available. The NNDCs are looking to geospatial databases, interactive mapping, web services, and other
Application Program Interface approaches to help preserve NNDC data and information and to make it easily
available to the scientific community.
http://www.ngdc.noaa.gov
U13D-03 INVITED
Re-inventing Data Libraries: Ensuring Continuing Access To Curated (Value-added) Data
How many years of inexperience do we need in using, and in particular sharing, digital data generated by others? That history pre-dates, but must also gain leverage from, the emergence of the digital library. Much of this sharing was done within research groups but recent attention to spatial data infrastructure highlights the importance of achieving several 'right mixes': * between Internet-standards, geo-specific referencing, and domain-specific vocabulary (cf ontology); * between attention to user-focus'd services and machine-to-machine interoperability; * between the demands of current high-quality services, the practice of data curation, and the need for long term preservation. This presentation will draw upon ideas and experience data library services in research universities, a national (UK) academic data centre, and developments in digital curation. It will be argued that the 1980s term 'data library' has some polemic value in that we have yet to learn what it means to 'do library' for data: more than "a bit like inter-galactic library loan", perhaps. Illustration will be drawn from multi-faceted database of digitized boundaries (UKBORDERS), through the first Internet map delivery of national mapping agency data (Digimap), to strategic positioning to help geo-enable academic and scientific data and so enhance research (in the UK, in Europe, and beyond).
U13D-04
Developing Archive Information Packages for Data Sets: Early Experiments with Digital Library Standards
The key to interoperability between systems is often metadata, yet metadata standards in the digital library and data center communities have evolved separately. In the data center world NASA's Directory Interchange Format (DIF), the Content Standard for Digital Geospatial Metadata (CSDGM), and most recently the international Geographic Information: Metadata (ISO 19115:2003) are used for descriptive metadata at the data set level to allow catalog interoperability; but use of anything other than repository- based metadata standards for the individual files that comprise a data set is rare, making true interoperability, at the data rather than data set level, across archives difficult. While the Open Archival Information Systems (OAIS) Reference Model with its call for creating Archive Information Packages (AIP) containing not just descriptive metadata but also preservation metadata is slowly being adopted in the community, the PREservation Metadata Implementation Strategies (PREMIS) standard, the only extant OAIS- compliant preservation metadata standard, has scarcely even been recognized as being applicable to the community. The digital library community in the meantime has converged upon the Metadata Encoding and Transmission Standard (METS) for interoperability between systems as evidenced by support for the standard by digital library systems such as Fedora and Greenstone. METS is designed to allow inclusion of other XML-based standards as descriptive and administrative metadata components. A recent Stanford study suggests that a combination of METS with included FGDC and PREMIS metadata could work well for individual granules of a data set. However, some of the lessons learned by the data center community over the last 30+ years of dealing with digital data are 1) that data sets as a whole need to be preserved and described and 2) that discovery and access mechanisms need to be hierarchical. Only once a user has reviewed a data set description and determined that these data are useful for their purposes is it appropriate to search for granules that meet specific search criteria. The work described here is an initial attempt to bridge these two disparate communities' metadata standards in a manner supportive of this need for hierarchical discovery and access. One component of the work demonstrates the effort required to develop METS compliant metadata from granule metadata held in NASA's Earth Observing System (EOS) Data and Information System (EOSDIS) Core System (ECS) for inclusion in complete granule level AIPs for HDF5-formatted data. Another component demonstrates the feasibility of developing METS metadata for a data set as a whole.
U13D-05
Archiving Data to Facilitate its use in Education
The scientific data collected by research programs funded by the government belongs to the public. As such
it is the responsibility of the scientific and technical communities to make scientific data accessible and
usable by the educational community. However, much geoscience data are difficult for educators and
students to find and use. They are generally described by metadata that are narrowly focused, challenging
educators and researchers in other fields to determine if the dataset is relevant to their needs, and to
effectively access and use the data. Two strands of work directly address this issue. First, recommendations
have been developed to implement 1) educationally relevant review criteria for data-rich Web sites
(http://serc.carleton.edu/usingdata/site_criteria.html), and 2) educationally relevant metadata for datasets
called DataSheets (http://serc.carleton.edu/usingdata/browse_sheets.html). These recommendations
[Ledley et al., 2008] are to directly address data sites that are not by themselves an educational activity, but
are intended to help educators and others easily find data sets, determine their relevance to their needs, and
how to access them. Second, a model for bridging the scientific and educational communities to develop
robust inquiry-based activities using scientific datasets in the form of Earth Exploration Toolbook (EET,
http://serc.carleton.edu/eet) chapters has been developed. This model involves working directly with small
teams made up of data providers from large scientific data archives, data analysis tool specialists, scientists,
curriculum developers, and educators (AccessData, http://serc.carleton.edu/usingdata/accessdata).
In this presentation we will 1) present the educationally relevant review criteria and metadata for data sets as
a form of curation of the data for broad use and 2) describe the model of the AcccessData workshops as a
model to facilitate collaboration between scientists and educators.
Ledley, T. S., A. Prakash, C. Manduca, and S. Fox (2008), Recommendations for Making Geoscience Data
Accessible and Usable in Education, Eos, 89(32), 291 (DOI: 210.1029/2008EO2003).
http://www.agu.org/journals/eo/eo0832/2008EO320003.pdf#anchor
U13D-06
Preserving the Context of Science Data
Preserving any type of digital information requires preserving both the "bits" comprising the information, and sufficient context (metadata) to support interpreting the bits in the future. Unfortunately, this context is often implicit or embedded in organizations (e.g., communities of practice) or artifacts (e.g., computing platforms) that are not as survivable as the information itself. Therefore, digital preservation must explicitly preserve context. Two necessary components of digital scientific information context are formats and provenance. Formats describe the syntax and low-level semantics of digital information objects (e.g., files). The library community has promulgated format registries (e.g, PRONOM, GDFR, digitalpreservation.gov) that allow archival objects to refer to format definitions using standardized persistent identifiers. Format registries maintain this context separately from the information that references it, but make no archival guarantees about the context's survival. Meanwhile, the scientific community has focused on capturing the provenance of scientific information, typically as a formal workflow specification of the processing steps that created the information. Unfortunately, there is as yet no standard for scientific workflows, nor any guarantee that a specification that can reproduce information is sufficient for understanding it. We describe new technologies that may prove a better fit for preserving scientific information context. The National Geospatial Digital Archive (NGDA) data model represents formats as archival objects containing specifications, software implementations, and other documentation. A format registry is simply an archive that happens to hold archival objects representing formats. Both format and provenance relationships are represented by typed references. Any archival object may reference any other object for its interpretation: the referenced object may be a "file format" object or an object containing dataset documentation, and may reside in the same archive or in another. Cross-archive references capture whole-archive dependencies (summarized by whole-archive descriptors located at the root of each archive), allowing us to describe the familiar situation of an entire archive referencing a format registry, or a source data center. We describe as a case study the archiving of the Earth science data records (ESDRs) being produced by the UCSB NASA-funded Ocean MEaSUREs project. The data's context includes complex formats, scientific literature, and software (both commercial and locally-developed). The data's provenance includes dependencies on multiple versions, parameter settings, and satellite data sources. By addressing how much context is required to preserve these data, we hope to begin to answer the question: What does it mean for a library to assume responsibility for a science dataset?
U13D-07
Persistent Identifiers in the Publication and Citation of Scientific Data - Theory and Practice
In the last decade data driven research has become a third pillar of scientific work alongside with theoretical
reasoning and experiment. Greatly increased computing power and storage, together with web services and
other electronic resources have facilitated a quantum leap in new research based on the analysis of great
amounts of data. However, traditional scientific communication only slowly changes to new media other than
an emulation of paper. This leaves many data inaccessible and, in the long run exposes valuable data to the
risk of loss.
To improve access to data and to create incentives for scientists to make their data accessible, a group of
German data centres initiated the project "Publication and Citation of Scientific Data" (STD-DOI) which was
funded by the German Science Foundation DFG for the periods 2003-2005 and 2006-2008. In this project
the German National Library for Science and Technology (TIB Hannover), together with the German
Research Centre for Geoscience (GFZ Potsdam), Alfred Wegener Institute for Polar and Marine Research
(AWI) Bremerhaven, University of Bremen, the Max Planck Institute for Meteorology in Hamburg, and the
DLR German Remote Sensing Data Center set up the first system to assign DOIs to data sets and for their
publication.
A prerequisite for data to be made available is a proper citation. This means that all fields mandatory for a
bibliographic citation are included. In addition, a mechanism is needed that ensures that the location of the
referenced data on the internet can be resolved at any time. In the past, this was a problematic issue
because URLs are short-lived, many becoming invalid after only a few months. Data publication on the
internet therefore needs a system of reliable pointers to a web publication to make these publications
citeable. To achieve this persistence of identifiers for their conventional publications many scientific
publishers use Digital Object Identifiers (DOI). The identifier is resolved through the handle system to the
valid location (URL) where the dataset can be found. This approach meets one of the prerequisites for
citeability of scientific data published online. In addition, the valid bibliographic citation can be included in the
catalogues of German National Library of Science and Technology (TIB).
The data publications themselves are held at discipline specific data centres, for instance ICSU World Data
Centers. The data providers take on the role of publication agents and are responsible for the long-term
availability of the data. The discipline specific publication agents are also responsible for the quality of the
published data. Syntactic and semantic quality checks are used to secure data quality. Data may come as
data supplements to scientific papers, or as time series from environmental monitoring systems, or as novel
form of publication in a data journal. The latter requires a peer-review process, analogous to conventional
science publications.
http://www.std-doi.de