Earth and Space Science Informatics [IN]

IN11C
 MC:Hall D  Monday  0800h

Emerging Issues in Science: Collaboration, Provenance, and the Ethics of Data Posters


Presiding:  D McGuinness, Rensselaer Polytechnic Institute; C Tilmes, NASA GSFC; R Duerr , National Snow and Ice Data Center/World Data Center for Glaciology

IN11C-1037

How Can International Standards Support Scientific Lineage Needs?

* Habermann, T ted.habermann@noaa.gov, NOAA National Geophysical Data Center, E/GC1 325 Broadway, Boulder, CO 80305-3328, United States

Recording the provenance of large-scale scientific datasets is an incredible challenge that will attracted a significant number of creative solutions. Many of these solutions will likely be customized in order to address specific needs of organizations that develop them. At the same time, standard representations of the provenance are required for preservation and ease of understanding by a wide variety of non-expert users. Emerging international metadata standards include mechanisms for describing provenance that are necessarily general and simplified. We will describe these mechanisms and some applications to begin to characterize situations in which these standards are useful.

IN11C-1038 INVITED

Advances in Provenance Tracking and Configuration Management for Earth Science Data

* Barkstrom, B R alicebarkstrom@verizon.net, National Climatic Data Center, 151 Patton Av, Asheville, NC 28801,

Production of Earth science data involves two different production paradigms: 1. Large scale, batch production in which a network of programs and files create a collection of data for access by users 2. Small scale, fine-grained production in which users interact with data in files or databases These two paradigms lead to two different models for configuration management of data. The first model is one that recognizes that data production is similar to other large-scale manufacturing processes and inventories both the collection of files, the processes the produce them, and the connections between the files and the processes. This model requires traversing the mathematical graph created by the network of processes and files. The second model is better described as a workflow that modifies the data by fine- grained transactions. In this case, provenance tracking is equivalent to tracking the history of transactions created by the workflow. In the large scale production paradigm, the rates of data ingest and the number of processes and files is sufficiently large that production needs to be automated. For example, in the upcoming NPOESS Preparatory Project (NPP), a typical time granularity for ingested data files is about 86 seconds. A single spectral channel from the VIIRS instrument on this satellite produces about 1000 files per day of calibrated data. Given the large number of channels on this instrument, production must be highly automated. An archive must be prepared to catalog millions of files per year. In this situation, data producers operate by creating source code that the production system compiles and links into executable objects. The executable code is often unchanged over many instances, so that production is very homogeneous over extended periods of time. This fact means that versioning of data products is also quite discrete and leads naturally to a hierarchical inventory structure. In the small scale production paradigm, the number of files is much smaller. At the same time (and particularly for in situ data), human intervention and judgment are often required to create trustworthy data. This fine-scale editing and transformation process appears well-adapted to workflow engines. With such engines, data users can store previous sessions, edit them, and rapidly examine the differences between data produced in previous sessions and new ones. In this case, configuration management is intended to provide traceable records, including the ability to audit changes to the data and the processes that created it.

IN11C-1039 INVITED

Automatic run-time provenance capture for scientific dataset generation

* Frew, J frew@bren.ucsb.edu, Donald Bren School of Environmental Science and Management, University of California, Santa Barbara, CA 93106-5131, United States
Slaughter, P frew@bren.ucsb.edu, Donald Bren School of Environmental Science and Management, University of California, Santa Barbara, CA 93106-5131, United States

Provenance---the directed graph of a dataset's processing history---is difficult to capture effectively. Human- generated provenance, as narrative metadata, is labor-intensive and thus often incorrect, incomplete, or simply not recorded. Workflow systems capture some provenance implicitly in workflow specifications, but these systems are not ubiquitous or standardized, and a workflow specification may not capture all of the factors involved in a dataset's production. System audit trails capture potentially all processing activities, but not the relationships between them. We describe a system that transparently (i.e., without any modification to science codes) and automatically (i.e. without any human intervention) captures the low-level interactions (files read/written, parameters accessed, etc.) between scientific processes, and then synthesizes these relationships into a provenance graph. This system---the Earth System Science Server (ES3)---is sufficiently general that it can accommodate any combination of stand-alone programs, interpreted codes (e.g. IDL), and command- language scripts. Provenance in ES3 can be published in well-defined XML formats (including formats suitable for graphical visualization), and queried to determine the ancestors or descendants of any specific data file or process invocation. We demonstrate how ES3 can be used to capture the provenance of a large operational ocean color dataset.

IN11C-1040 INVITED

Provenance: Promise and Practice

* Duerr, R E rduerr@nsidc.org, NSIDC/CIRES University of Colorado, University of Colorado 449 UCB, Boulder, CO 80309-0449, United States

Capturing provenance is one of the fundamental principles of archive theory. Provenance consists of information about the creation of an object, its ownership, and how this information has changed over time. The data management community has been discussing how to apply the concepts of provenance to science data. Considerable attention has been paid to developing mechanisms to record how data were created, since this is key to reproducing research results. Less attention has been paid to the other elements of provenance, even though data and the organizations that archive data are dynamic and ever changing. Some practice is coming into play; but there is a large gap between theory and practice. This talk will review the current state of the art, discuss the gap between theory and practice, and describe what could be done to close the gap.

IN11C-1041

Provenance in Data Interoperability for Multi-Sensor Intercomparison

Lynnes, C Chris.Lynnes@nasa.gov, NASA/GSFC, Code 610.2, Greenbelt, MD 20771, United States
* Leptoukh, G gregory.leptoukh@nasa.gov, NASA/GSFC, Code 610.2, Greenbelt, MD 20771, United States
Berrick, S Stephen.W.Berrick@nasa.gov, NASA/GSFC, Code 610.2, Greenbelt, MD 20771, United States
Shen, S Suhung.Shen@nasa.gov, NASA/GSFC, Code 610.2, Greenbelt, MD 20771, United States
Prados, A Ana.I.Prados@nasa.gov, JCET/GEST, UMBC, 5523 Research Park Dr. Suite 320, Baltimore, MD 21228, United States
Fox, P pfox@ucar.edu, NCAR/HAO, PO Box 3000, Boulder, CO 80307, United States
Yang, W wenli.yang@nasa.gov, George Mason University, CSISS 6301 Ivy Ln., Suite 620, Greenbelt, MD 20770, United States
Min, M mmin1@gmu.edu, George Mason University, CSISS 6301 Ivy Ln., Suite 620, Greenbelt, MD 20770, United States
Holloway, D d.holloway@opendap.org, OPeNDAP, 165 Dean Knauss Dr., Narragansett, RI 02882, United States
Enloe, Y yonsook@mindspring.com, SGT, Inc., 7701 Greenbelt Rd. Suite 400, Greenbelt, MD 20770, United States

As our inventory of Earth science data sets grows, the ability to compare, merge and fuse multiple datasets grows in importance. This implies a need for deeper data interoperability than we have now. Many efforts (e.g. OPeNDAP, Open Geospatial Consortium) have broken down format barriers to interoperability; the next challenge is the semantic aspects of the data. Consider the issues when satellite data are merged, cross- calibrated, validated, inter-compared and fused. We must determine how to match up data sets that are related, yet different in significant ways: the exact nature of the phenomenon being measured, measurement technique, exact location in space-time, or the quality of the measurements. If subtle distinctions between similar measurements are not clear to the user, the results can be meaningless or even lead to an incorrect interpretation of the data. Most of these distinctions trace back to how the data came to be: sensors, processing, and quality assessment. For example, monthly averages of satellite-based aerosol measurements often show significant discrepancies, which might be due to differences in spatio-temporal aggregation, sampling issues, sensor biases, algorithm differences and/or calibration issues. This provenance information must therefore be captured in a semantic framework that allows sophisticated data inter-use tools to incorporate it, and eventually aid in the interpretation of comparison or merged products. Semantic web technology allows us to encode our knowledge of measurement characteristics, phenomena measured, space-time representations, and data quality representation in a well-structured, machine- readable ontology and rulesets. An analysis tool can use this knowledge to show users the provenance- related distinctions between two variables, advising on options for further data processing and analysis. An additional problem for workflows distributed across heterogeneous systems is retrieval and transport of provenance. Provenance information may be either embedded within the data payload, or transmitted from server to client in an out of band mechanism, through a dedicated provenance protocol, or as an addition to existing protocol standards. The out of band mechanism is more flexible in the richness of provenance information that can be accommodated, but it relies on a persistent framework. Also, if the user saves it locally for preservation purposes, a data management problem arises in keeping provenance information with the data. Therefore, we are prototyping the embedded model, incorporating the provenance within metadata objects in the data payload. Thus, it always remains with the data, no matter where the user moves them. The downside is a limit to the size of provenance metadata that we can include, an issue that will eventually need resolution to encompass the richness of provenance information required for data intercomparison and merging.

IN11C-1042

Provenance Tracking in Climate Science Data Processing Systems

* Tilmes, C Curt.Tilmes@nasa.gov, NASA Goddard Space Flight Center, Code 614.5, Greenbelt, MD 20771, United States

NASA, NOAA, ESA and other organizations involved with climate research have captured huge archives of Earth observations. Over time, the sensors, spacecraft, science algorithms for transforming and analyzing the data and the processing frameworks have all evolved. Tracking sufficient provenance information in concert with the science data used in research and ultimately, policy decisions is a tremendously complicated problem. Data are stored in multiple archives across multiple agencies. Since the data volume is so large, previous generations of the data are often discarded in favor of newer versions. Systems often aren't capable of reproducing data that were once provided to the public. Tracing the provenance of a product is generally a very manual process, since it is stored in so many different ways (or not stored at all). It often involves reading science papers, or calling up the researchers. In next generation processing systems data can be transformed by on-demand processing in new ways resulting in transient data sets that are returned to a user or layered application but not archived at all. Our goal is to capture, archive, and present sufficient provenance information for complete scientific reproducibility of all data and for understanding of the data under considerations. I will briefly present the general area and challenges of provenance tracking for science data processing systems and the requirements for scientific reproducibility.

IN11C-1043

Geospatial Data Provenance in the Semantic Web Environment

* Di, L ldi@gmu.edu, Center for Spatial Information Science and Systems, George Mason University, 6301 Ivy Lane, Suite 620, Greenbelt, MD 20770, United States
Yue, P geopyue@gmail.com, Center for Spatial Information Science and Systems, George Mason University, 6301 Ivy Lane, Suite 620, Greenbelt, MD 20770, United States

Geospatial data will grow to multi-exabytes very soon. The major form of geospatial data is imagery collected by the Earth observing community through remote sensing methods. Those data, along with their derived products and model outputs, are archived in many data centers around the world. Geospatial data has to be converted to user-specific information and knowledge before they become useful. Such a user-specific information and knowledge is normally derived from multi-source data through a set of geoprocess steps. Recent technology advances in the united representation of geospatial data, information, and knowledge, the geospatial semantic web, the geospatial interoperability, and the artificial intelligence have made the automatic derivation of user-specific information and knowledge from diverse data sources in the web service environment possible. A prototype system for proofing such technologies has been constructed and successfully demonstrated. An operational systems is being development. With the ontology support, the system automatically constructs the executable workflow based on users' descriptions of what they want and the available services and the input data over the web, and execute the workflow to generate the user- specific product. In order for users to have the confidence to use such automatically generated products in real applications, complete and accurate provenance information must be provided to users, even before such user-specific products are generated. In this presentation, we will discuss the representation of geospatial data provenance, the automatic capturing of geospatial data provenance in the semantic web environment, and the management of geospatial data provenance. We will also discuss a prototype provenance management system that allows the users to query and access providence information.

IN11C-1044

Source Code, an Essential Part of Providing Complete Provenance

* Fleig, A J al.fleig@gmail.com, PITA Analytic Sciences, 8705 Burning Tree Road, Bethesda, MD 20817, United States

Providing thorough provenance information, sufficient to guarantee that it is possible to understand and reproduce a data set requires many things. One is complete documentation of the algorithms that were used to create it. There are many forms of algorithm documentation including Algorithm Theoretical Basis Documents (ATBDs), As Built documents, User's Guides, and peer reviewed algorithm descriptions and validation papers. However the only way to know for sure how a data set was made is to provide the exact source code that was used to make it. All of the other forms are lacking in one or more respects. Most are not up to date. Some, such as ATBDs were written before the data was even collected. Others, such as peer reviewed journal articles, lack details detail because of page space limitations, and none come with an absolute guarantee that they include all of the changes that were introduced as the processing code was updated with time. There are two additional problems even when the source code is provided. First, it can be very difficult to understand exactly what is being done from reading source code and additional effort to adequately comment the source code is usually required. Second, the source code actually used in the production of a data set may not be completely define the process in that it may incorporate tables or constants produced in prior processing steps and it may be necessary to assure that documentation, including source code is available for those steps also. Documentation requirements for source code sufficient to meet the needs of providing complete provenance will be discussed in this talk.

IN11C-1045 INVITED

Current Climate Data Set Documentation Standards: Somewhere between Anagrams and Full Disclosure

* Fleig, A J PITA@fleig.us, PITA Analytic Sciences, 8705 Burning Tree Road, Bethesda, MD 20817, United States

In the 17th century scientists, concerned with establishing primacy for their discoveries while maintaining control of their intellectual property, often published their results as anagrams. Robert Hooke's initial publication in 1676 of his law of elasticity in the form ceiiinossttuv which he revealed two years later as "Ut tension sic vis" or "of the extension, so the force" is one of the better known examples although Galileo, Newton, and many others used the same approach. Fortunately the idea of open publication in scientific journals subject to peer review as a cornerstone of the scientific method gradually became established and is now the norm. Unfortunately though even peer reviewed publication does not necessarily lead to full disclosure. One example of this occurs in the production, review and distribution of large scale data sets of climate variables. Validation papers describe how the data was made in concept but do not provide adequate documentation of the process. Complete provenance of the resulting data sets including description of the exact input files, processing environment, and actual processing code are not required as part of the production and archival effort. A user of the data may be assured by the publication and peer review that the data is considered to be good and usable for scientific investigation but will not know exactly how the data set was made. The problem with this lack of knowledge may be most apparent when considering questions of climate change. Future measurements of the same geophysical parameter will surely be derived from a different observational system than the one used in creating today's data sets. An obvious task in assessing change between the present and the future data set will be to determine how much of the change is because the parameter changed and how much is because the measurement system changed. This will be hard to do without complete knowledge of how the predecessor data set was made. Automated techniques are being developed that will simplify the creation of much of the provenance information but there are both cultural and infrastructure problems that discourage provision of complete documentation. It is time to reconsider what the standards for production and documentation of data sets should be. There is only a short window before the loss of knowledge about current data sets associated with human mortality becomes irreversible. .

IN11C-1046

Archive Issues Associated with NASA Earth Science Datasets

Behnke, J jeanne.behnke@nasa.gov, ESDIS Project, Goddard Space Flight Center, Greenbelt, MD 20771, United States
* Moses, J john.f.moses@nasa.gov, ESDIS Project, Goddard Space Flight Center, Greenbelt, MD 20771, United States
Byrnes, J james.b.byrnes@nasa.gov, Science Data Systems Branch, Goddard Space Flight Center, Greenbelt, MD 20771, United States

The Earth Science Data and Information System (ESDIS) Project at NASA Goddard Space Flight Center was established in the early 1990s to develop and maintain a core collection of NASA's critical earth science data. Part of its mission was to provide a home for legacy earth science data from early NASA missions. Examples of these datasets include data from such missions as NIMBUS (1960s) and the Heat Capacity Mapping Mission (HCMM) from the late 1970s at GSFC and the Earth Radiation Budget Experiment (ERBE) from the late 1970s at Langley Research Center. Much of this information has been kept on old media and in many cases is not readily accessible by the science community. This presentation will describe several science data issues we have experienced as part of our efforts to recover data from these missions. We will share problems encountered with data formats, data resolution, representation, and documentation. The presentation will also suggestion best practices and identify key missing elements that would enable easier recovery if incorporated into future archives. The authors offer an opportunity to discuss plans for NASA's heritage assets and their disposition.

IN11C-1047

What Are We Tracking ... and Why?

Suarez-Sola, I igor@noao.edu, National Solar Observatory, 950 N. Cherry Avenue, Tucson, AZ 85719, United States
Davey, A ard@head.cfa.harvard.edu, Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138, United States
* Hourcle, J A joseph.a.hourcle@nasa.gov, NASA/GSFC (Wyle IS), Code 671.1 Goddard Space Flight Center, Greenbelt, MD 20771, United States

What Are We Tracking ... and Why? It is impossible to define what adequate provenance is without knowing who is asking the question. What determines sufficient provenance information is not a function of the data, but of the question being asked of it. Many of these questions are asked by people not affiliated with the mission and possibly from different disciplines. To plan for every conceivable question would require a significant burden on the data systems that are designed to answer the mission's science objectives. Provenance is further complicated as each system might have a different definition of 'data set'. Is it the raw instrument results? Is it the result of numerical processing? Does it include the associated metadata? Does it include packaging? Depending on how a system defines 'data set', it may not be able to track provenance with sufficient granularity to ask the desired question, or we may end up with a complex web of relationships that significantly increases the system complexity. System designers must also remember that data archives are not a closed system. We need mechanisms for tracking not only the provenance relationships between data objects and the systems that generate them, but also from journal articles back to the data that was used to support the research. Simply creating a mirror of the data used, as done in other scientific disciplines, is unrealistic for terabyte and petabyte scale data sets. We present work by the Virtual Solar Observatory on the assignment of identifiers that could be used for tracking provenance and compare it to other proposed standards in the scientific and library science communities. We use the Solar Dynamics Observatory, STEREO and Hinode missions as examples where the concept of 'data set' breaks many systems for citing data.

IN11C-1048

Automatic Provenance Recording for Scientific Data using Trident

* Simmhan, Y yoges@microsoft.com, Microsoft Research, 835 Market St. Suite 700, San Francisco, CA 94103, United States
Barga, R barga@microsoft.com, Microsoft Research, One Microsoft Way, Redmond, WA 98052, United States
van Ingen, C vaningen@windows.microsoft.com, Microsoft Research, 835 Market St. Suite 700, San Francisco, CA 94103, United States

Provenance is increasingly recognized as being critical to the understanding and reuse of scientific datasets. Given the rapid generation of scientific data from sensors and computational model results, it is not practical to manually record provenance for data and automated techniques for provenance capture are essential. Scientific workflows provide a framework for representing computational models and complex transformations of scientific data, and present a means for tracking the operations performed to derive a dataset. The Trident Scientific Workbench is a workflow system that natively incorporates provenance capture of data derived as part of the workflow execution. The applications used as part of a Trident workflow can execute on a remote computational cluster, such as a supercomputing center on in the Cloud, or on the local desktop of the researcher, and provenance on data derived by the applications is seamlessly captured. Scientists also have the option to annotate the provenance metadata using domain specific tags such as, for example, GCMD keywords. The provenance records thus captured can be exported in the Open Provenance Model* XML format that is emerging as a provenance standard in the eScience community or visualized as a graph of data and applications. The Trident workflow system and provenance recorded by it has been successfully applied in the Neptune oceanography project and is presently being tested in the Pan-STARRS astronomy project. *http://twiki.ipaw.info/bin/view/Challenge/OPM

http://www.microsoft.com/mscorp/tc/trident.mspx

IN11C-1049

Information and informatics in a geological survey – the good, the bad and the ugly

* Jackson, I ij@bgs.ac.uk, British Geological Survey, Keyworth, Nottingham, NG12 5GG, United Kingdom

It is apparent that the most successful geological surveys (as measured by the only true Key Performance Indicator - their effectiveness in serving their societies) have recognised that, while their core business is making maps and models and doing scientific research to underpin that, the commodity they actually deal in is data and information and knowledge. They know that in a digital world the better they organise the data and information and knowledge, the more successful they will be. In our future world, where e-science will surely dominate, some are already sub-titling themselves as information or knowledge exchange organisations. There seems an unarguable correlation between surveys which organise their information well and those that run their projects well, their agility in responding to government agendas or national emergencies, and flexibility in delivering products their diverse users want. Look deeper and you can see the pivotal role of best practice information management and the tangible benefits a responsible approach to acquiring, storing and delivering information brings. But even in these (most successful) surveys the people leading information management will tell you that it was a gargantuan battle to get the resources to achieve this success and that, even with the downstream fruits of the investment in professional information management and informatics now obvious, it is a continuing struggle to maintain a decent level of funding for these tasks. It is not hard to see why; the struggle is innately one-sided; geoscientists are born and/or trained to be curious, to be independent and to innovate. If the choice is between more research and survey, or a professional approach to information/informatics and the adjudicators are geoscientists, it is not difficult to pick the winner. So what does lie behind a successful approach to information in a geological survey organisation? First, recognise that poor information management cannot just be cured by investing in hardware and software; it is the geoscience data content (its availability, quality and consistency) that is in greater need of investment. Second, to achieve the full synergies and benefits information management and informatics must be planned into all domains of the Survey and all project phases - acquisition, processing, analysis, dissemination and storage. Adequate investment in front office applications and services to communicate and deliver geoscience to all our stakeholders (eg virtualisation and visualisation) is essential. Without it back office work, however, worthy, is of limited value. Finally, the widely accepted truth is that the real challenge in introducing professional information management and informatics is not technical or scientific, but cultural and managerial. Unless you can sensitively and positively change the work patterns and culture of Survey geoscientists a sustainable outcome will remain beyond reach. Of course to change the work pattern and culture of the geoscientists you must first ensure that the most senior management of the organisation embrace the change wholeheartedly; now there's a challenge! Using examples and experience from the evolution on information management and informatics in the British Geological Survey over the last decade this presentation will explore the issues above.

IN11C-1050

Using blackmail, bribery, and guilt to address the tragedy of the virtual intellectual commons

* Griffith, P C peter.c.griffith@nasa.gov, Science Systems & Applications, Inc. and the Carbon Cycle and Ecosystems Office, NASA Goddard Space Flight Center, Mailstop 614.4, Greenbelt, MD 20771, United States
Cook, R B cookrb@ornl.gov, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN 37831, United States
Wilson, B E wilsonbe@ornl.gov, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN 37831, United States
Gentry, M J mgentry2@utk.edu, University of Tennessee and the LBA-ECO Project Office, NASA Goddard Space Flight Center, P.O. Box 164, Crawford, MS 39743, United States
Horta, L M luiz.horta@cptec.inpe.br, CPTEC / INPE, Rodovia Presidente Dutra, km39, Cachoeira Paulista, SP 12630-000, Brazil
McGroddy, M mmcgroddy@gmail.com, Science Systems & Applications, Inc. and the LBA-ECO Project Office, NASA Goddard Space Flight Center, Dept of Environmental Sciences, Clark Hall, Charlottesville, VA 22903, United States
Morrell, A L amy.l.morrell@nasa.gov, Science Systems & Applications, Inc. and the Carbon Cycle and Ecosystems Office, NASA Goddard Space Flight Center, Mailstop 614.4, Greenbelt, MD 20771, United States
Wilcox, L E lisa.e.wilcox@nasa.gov, Science Systems & Applications, Inc. and the Carbon Cycle and Ecosystems Office, NASA Goddard Space Flight Center, Mailstop 614.4, Greenbelt, MD 20771, United States

One goal of the NSF's vision for 21st Century Cyberinfrastructure is to create a virtual intellectual commons for the scientific community where advanced technologies perpetuate transformation of this community's productivity and capabilities. The metadata describing scientific observations, like the first paragraph of a news story, should answer the questions who? what? why? where? when? and how?, making them discoverable, comprehensible, contextualized, exchangeable, and machine-readable. Investigators who create good scientific metadata increase the scientific value of their observations within such a virtual intellectual commons. But the tragedy of this commons arises when investigators wish to receive without giving in return. The authors of this talk will describe how they have used combinations of blackmail, bribery, and guilt to motivate good behavior by investigators participating in two major scientific programs (NASA's component of the Large-scale Biosphere-Atmosphere Experiment in Amazonia; and the US Climate Change Science Program's North American Carbon Program).

IN11C-1051 INVITED

Sharing data resources benefits owners as well as miners.

* Smith, R W roger.smith@gi.alaska.edu, Geophysical Institute, University of Alaska Fairbanks, Fairbanks, AK 99775-7320, United States

The most fundamental part of any research activity is the data created. Data are most frequently the result of physical measurements but, increasingly, also result from the operation of a computer code. Given that the methods of creation are properly executed and recorded, data have an intrinsic value regardless of the ensuing study in which they are used. Data are part of the intellectual property associated with the work of a scientist. Like any other form of property, the value to the cognizant community depends upon access and available usage. Data that remain on some hidden storage medium are like a bank account storing funds at with no interest accrual, an apparent waste of opportunity. Not sharing data with the cognizant community needs a justification like security risk or possible danger. The historically contentious issue associated with data as intellectual property is the protection of the owner's rights of first use. This paper contends that data sharing is the proper and most productive strategy for scientists to gain the most value from their work. The first example illustrating the point relates to the Alaska Climate Research Center (www.climate.gi.alaska.edu) operated by the Geophysical Institute (GI) where the data is shared on a website that gets 35,000 hits (2000 visits) per day. The data is a mixture of current weather and historical meteorological observations. The latter could be considered the property of the GI. Although most website hits are for the current weather, web inquiries for meteorological observations across the state, some dating back to 1820, are available for all to use. This kind of sharing brings the most volume and greatest value from the stored data. The second relates to the personal observations of GI faculty members who share their measurements directly on the web as soon as they are available. These data are the same as published in their personal work, and are also available for others to use based on some simple "rules of the road". This strategy broadens the applications of his work and results in more co-authorships along the way. Many federal granting agencies require a similar approach of rapid dissemination of data. The recent introduction of virtual observatories has strengthened this approach and also provides a formalism for the protection of data owners.

http://www.egy.org

IN11C-1052

Annotating and embedding provenance in science data repositories to enable next generation science applications

* McGuinness, D L dlm@cs.rpi.edu, McGuinness Associates, 4 Shaker Bay Road, Latham, NY 12110, United States
* McGuinness, D L dlm@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States
Fox, P pfox@ucar.edu, HAO/ESSL/NCAR, PO Box 3000, Boulder, CO 80307, United States
Pinheiro da Silva, P paulo@utep.edu, University of Texas at El Paso, Computer Science Building Room 234 500 West University Avenue, El Paso, TX 79968, United States
Zednik, S zednik@ucar.edu, HAO/ESSL/NCAR, PO Box 3000, Boulder, CO 80307, United States
Del Rio, N ndel2@miners.utep.edu, University of Texas at El Paso, Computer Science Building Room 234 500 West University Avenue, El Paso, TX 79968, United States
Ding, L dingl@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States
West, P pwest@ucar.edu, HAO/ESSL/NCAR, PO Box 3000, Boulder, CO 80307, United States
Chang, C csc@cs.rpi.edu, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, United States

Recognizing the increased need for knowledge provenance in interdisciplinary eScience efforts, we have begun an effort to enhance a real-world data production pipeline and the resulting data services with semantic provenance. This work designing and implementing in an existing fielded virtual observatory setting has allowed us to collect key provenance requirements for a broad variety of end users. We have documented several image data pipelines for solar physics instruments at the Mauna Loa Solar Observatory and have documented almost 20 use cases covering usage from instrument scientists, observers, data analysts and managers, and end-user scientists. These use cases have guided our work developing an initial infrastructure that can be searched, queried, or browsed by these users. We use a multi-stage approach to provenance as data and information artifacts progress along processing pipelines. Our motivation, is that both the qualitative and quantitative measures of uncertainty may be vastly improved when treated in an end-to-end manner. This also reduces the likelihood that critical information is left behind or obscurely represented, making the later use of the data and information difficult or impossible. Another motivation is that provenance captured consistently at ingest time supports transparency of sources and propagation of credit for data generation, thereby increasing the likelihood of contribution and reuse. We present the current stages of implementation of our provenance infrastructure, tools and impact on what users are able to learn from the annotated information streams. The Semantic Provenance Capture in Data Ingest Systems (SPCDIS) project is an NSF/OCI/SDCI funded effort involving the High Altitude Observatory at NCAR, McGuinness Associates and the University of Michigan.