![]()
Vol. 84, No. 36, 9 September 2003
Complete PostScript: An Archival and Exchange Format for the Sciences?
Paul Wessel, School of Ocean and Earth Science and Technology, University of Hawaii at Manoa, Honolulu
Copyright 2003 American Geophysical Union
New scientific knowledge is routinely disseminated through the channels of professional
journals and books, and the material is eagerly scrutinized by peers for relevance,
insight, and accuracy. While the first two of these objectives can be readily
assessed, a complete evaluation of the paper's accuracy may not be feasible.
Usually, for such an evaluation to be possible, potential evaluators would need
access to data sets and quite possibly specialized software, as well as a detailed
description of how the author analyzed the results presented in any of the illustrations
that are contained within the manuscript.
While some attempts have been made to address this particular problem (in particular, see the splendid online textbooks in data analysis and exploration seismology by Jon Claerbout at Stanford University; http://sepwww.stanford.edu/sep/jon/), the vast majority of publications do not facilitate reproduction of analysis. The increasing availability of electronic forms of publication has made the user-friendly publication of results possible, but the promise of this medium remains largely unfulfilled due to the advanced technical skill required to present one's results in an interactive manner.
Rather than expect scientists to routinely present their results in a fashion that can be explored by online tutorials and demos, I propose that authors make their scientific illustrations available in Complete PostScript format (CPS). The CPS format is simply a regular Encapsulated PostScript file (i.e., an illustration in a scientific paper) that has been extended, via PostScript comments, to include not only any data required to reproduce the illustration, but also a detailed description of how to reproduce the results. These instructions can be DOS or UNIX shell scripts or computer programs, as well as explicit instructions on how to run a particular program and view relevant documentation. The CPS format gives the reader complete access to the results, as well as to the data analysis of a research project, and allows readers either to reproduce the results or experiment further, based on the published results and data. For instance, users could readily choose different processing parameters to see how sensitive the author's conclusions are to the chosen parameters.
The CPS format allows for a simplified way to accomplish several important tasks:
• It allows authors to archive their research projects in a logical manner
organized by illustrations. The illustrations now become self-contained entities
with all the information required to reproduce the results;
• It facilitates the exchange of ideas. Scientists can exchange CPS files
to illustrate specific points or preliminary ideas, and both sides can revise
the content of the CPS file to explore and experiment further;
• It provides a universally accepted standard (Encapsulated PostScript)
in which to publish one's results. Unlike the Portable Document Format (PDF)
files produced by most applications, PostScript files may readily contain comments,
and this makes possible the inclusion of associated material directly in the
PostScript file; and
• It simplifies the task of providing online access to data and programs,
as online publishing companies can simply provide a link to each figure's CPS
file.
Details of Implementation
The CPS software contains two POSIX-compliant C programs named cpsencode and cpsdecode, which are used to pack up or unscramble a CPS file, respectively. Both utilities rely on the external compression/deflation library libbz2 provided with the bzip2 package (bzip2 is a standard open source file compressor and deflator; it is available for all major operating systems. For more information, see http://sources.redhat.com/bzip2). The CPS software also utilizes open source algorithms that derive from the UNIX uuencode/uudecode utilities, so compressed binary files can be encoded in ASCII format. Thus, the CPS utilities can be installed under UNIX/Linux, Windows, and MacOS X (MacOS versions 9 or earlier would need a UNIX emulator due to the lack of a command line interface). For flexibility, two Bourne shell scripts duplicating the work of the C programs are also included. There is no Windows version of the scripts; however, Windows users are encouraged to install the free Cygwin environment and do script processing using the UNIX shell interface. The software as well as precompiled executables for a variety of platforms can be found at the CPS Web site (http://www.soest.hawaii.edu/pwessel/cps).
CPS Usage
Because using the CPS utilities is relatively straightforward, we will illustrate its use with one example. To augment an EPS file to a CPS file, we simply need to append the output of cpsencode to the EPS file. The arguments to cpsencode are all the script files, program files, data files, and documentation that anyone interested in recreating the illustration would need. For instance, if the shell script Figure_3.sh creates the illustration Figure_3.eps from the data files my_topo.grd, drillholes.dat, and samples.dat, and where README.drill explains some details about data processing, we would simply run:
cpsencode my_topo.grd drillholes.dat samples.dat README.drill >> Figure_3.eps.
Alternatively, this command could instead be inserted at the end of the script Figure_3.sh. For transmission or browsing of CPS files, it is useful first to compress them with bzip2. I propose to use the file suffix .psz for bzip2-compressed CPS files. The main reason for giving these compressed files a separate suffix is to facilitate display by Web browsers. For instance, the browser can be configured to recognize *.psz files and deflate them with bzip2 prior to opening them with a PostScript viewer such as ghostview. To produce the final compressed file, we run:
bzip2 -c9 Figure_3.eps > Figure_3.psz
Since all files are treated the same, it makes little difference if the data files are ASCII text, proprietary EXCEL files, or other data as long as the embedded scripts and documentation explain what to do. Of course, data files in proprietary formats may be of limited use to the recipient, so care should be taken to convert such files to an open exchange format such as ASCII text.
Discussion
Reproducibility in science is greatly aided by clear instructions and the availability of relevant data. While online demonstrations and tutorials are extremely valuable tools, it is unrealistic to hope that the majority of scientists will be able to spend the time required to produce such sophisticated products. The Complete PostScript format seeks to facilitate this process without requiring highly specialized technical skills to share the information. With this approach, PostScript documents---for example, journal articles or technical reports---that include CPS files as in-line illustrations can thus contain everything needed to reproduce the entire paper or report, since cpsdecode will unscramble all embedded data sets and other files.
Not all illustrations lend themselves to the treatment outlined here. In many studies, photographs or images are used for illustrations and they are naturally in a raster format like JPEG or TIFF. Since they represent the raw data, no further enclosures are required. However, converting vector data to rasters to have an image representation of the data is discouraged, since it eliminates the possibility for others to scrutinize the work and extract useful information. Other images derive from global gridded data sets available from official data centers. In these cases, it may suffice to embed a README file that explains where the data may be obtained, rather than include very large global grids directly in the PostScript file. Of course, this becomes a judgment call based on file sizes and availability of the global grid; note that such grids can be dramatically compressed using grid-compression schemes [Wessel, 2003].
Scientists will often collect data and scripts and make zip or tar files, thus producing a single archive that contains everything needed to reproduce an illustration. However, because the archives themselves are not viewable, one would require additional files, including graphics, to represent the archive on the Web. One of the main advantages of the proposed scheme is the fact that the single, final archive itself is the printable and viewable PostScript illustration.
It is hoped that the low technical requirements in making CPS files will translate into a broader acceptance of the practice, thus leading to a broader use of CPS as a mechanism for both the archive and exchange of scientific ideas.
Ultimately, CPS may help scientists live up to AGU's motto of “unselfish cooperation in research.”
Reference
Wessel, P., Compression of large data grids for Internet transmission, Computers & Geosciences, 29, 665-671, 2003.