Digital Archival Repositories: Models, Protocols, and Tools

G. Philip Rogers
School of Information and Library Science
University of North Carolina at Chapel Hill
Email: gerogers@cisco.com
Key features: References; Figures 1, 2, 3, 4, 5; Tables 1, 2, 3

Contents

  • 1 Introduction
  • 2 A Brief History of Digital Preservation
  • 3 The Development of Frameworks, Models, Protocols, and Tools
  • 4 The Future of Digital Archives
  • References
  • Notes
  • Appendix A: Best Practices for Data Providers and Service Providers
  • Appendix B: Methods to Mitigate the Risk of Losing Digital Materials
  • Appendix C: Preservation Metadata Framework Recommended by the OCLC/RLG Preservation Metadata Working Group
  • Abstract

    In recent years there has been increasing interest in digital preservation, an increase that is commensurate with rapid growth in the amount of digital content. After a brief discussion of the history and key challenges of digital preservation, this paper describes a number of the frameworks, models, standards, and tools that have been developed to support the digital preservation community. The paper concludes with an assessment of current and future trends that are likely to impact this community.

    Keywords: cultural heritage, digital archives, digital libraries, digital preservation, institutional repositories, preservation metadata

    Introduction

    Increases in the power and availability of computers and other information systems have made it possible to digitally render and widely disseminate artifacts such as manuscripts and works of art. Increasingly powerful computers and feature-rich software offerings also have resulted in the creation of a tremendous volume of content that is "born digital," such as digital audio, digital video, digital documents, databases, scientific data sets, and Web sites. However, the vast increase in the amount of digital content1 poses significant challenges for those who seek to preserve, protect, and provide access to digital assets.

    This paper will begin with a review of the archival literature to provide a context for understanding the origins of digital archival repositories, including the issues that such repositories were intended to address when they were first established. The paper will then describe many of the most significant advances in frameworks, models, protocols, and tools from which digital archival repositories have benefited over the past five to seven years. The paper will close with a discussion of ongoing and future research in areas that are likely to benefit digital archival repository practitioners and scholars.

    A Brief History of Digital Preservation

    This section provides a brief survey of the literature on the origins of digital preservation activities and the unique challenges that organizations face as they develop digital preservation policy, programs, and infrastructure.

    The Origins of Digital Preservation

    Digital preservation activities originated in the archival community, a community that includes practicing archivists, archivists working in an academic setting, manuscript curators, and policy makers. Members of the archival community share a commitment to acquire, preserve, and provide access to selected artifacts that constitute the documentary record, in whatever format in which those artifacts might exist (Gilliland-Swetland 2000). Arguably the most important function of the archivist is preservation2, an activity that Conway (1996) defines as "the acquisition, organization, and distribution of resources to prevent further deterioration or renew the usability of selected groups of materials."

    Opinions differ on what constitutes an archive3 and on the set of activities that support archival. Within the archival community, the term "archive" encompasses all of the following: 1) collections of documents (or "records") that are chosen for preservation based on particular selection criteria; 2) organizations that are responsible for the selection, care, and use of selected documents or records, and; 3) the physical location of the repository where archival activities take place (McKemmish 1993).4 By way of contrast, outside of the archival community, Information Technology (IT) practitioners tend to see archival in terms of moving data from an online environment to an offline storage medium. Significantly, in the IT community, which vastly outnumbers the archival community, the notion of archival implies little to no sense of obligation to preserve and make accessible digital artifacts, while in the archival community preservation and accessibility are seen as fundamental tenets of the archival profession (Dollar 2000; Garrett & Waters 1996).

    Because it is not possible to preserve all artifacts, archival is by necessity a selective process.5 One of the concepts that is most central to the traditional notion of an archive is the idea of permanence. Different schools of thought have developed in regard to what characteristic of an artifact should be considered permanent. According to some archivists, preservation activities must focus on the artifact itself, while other archivists maintain that preservation should instead focus on the information that the artifact contains (O'Toole 1989). The distinction between the artifact and the information in the artifact is particularly significant when it is applied to the world of digital preservation and digital archives.

    The Digital Preservation Challenge

    To understand the scope of the digital preservation challenge, it is a necessity to come to grips with something that archivists generally take for granted—in the current environment, without major intervention, digital artifacts are at far greater risk than non-digital artifacts (Besser 2000; Rothenberg 1999). At first glance, the notion that digital artifacts are somehow more fragile than non-digital artifacts, such as manuscripts and works of art, seems counter-intuitive. Artifacts that are stored as bits and bytes would appear to be considerably less susceptible to degradation over time than paper artifacts, for example. The digital preservation literature tells a completely different story, however. Perhaps the two most difficult technical challenges to address are what Besser (2000) refers to as the "viewing problem" and the "scrambling problem."

    The Viewing Problem

    Anyone who creates digital content on a regular basis over the course of several years knows that it is relatively simple to create the content, but that it can be quite a different matter to try to retrieve the content in its original format after a substantial amount of time has passed. This is the challenge that the information life cycle presents, where digital artifacts move through a number of phases, ultimately reaching the utilization phase, where they are either discarded or retained (Gilliland-Swetland 2000).

    Even if the decision is made to retain (archive) a particular digital artifact, there are a number of factors that can prevent such archival from taking place. The viewing problem is primarily manifested in two ways. One manifestation of the viewing problem is caused by the physical degradation or obsolescence of the media (hardware) on which the digital artifact is stored. The most common examples of hardware degradation or obsolesence include the disappearance of storage media from the market, the discontinued production of devices capable of reading the media, or a lack of availability of device drivers for later-generation computers that would be able to decipher the encoding scheme that is used on the media (Rothenberg 1999; Linden et al. 2005).6

    Despite the fact that the challenges presented by hardware degradation and obsolesence appear daunting, the other primary manifestation of the viewing problem, software incompatibility, can pose even greater obstacles to long-term digital preservation. The ability to properly read and render a particular digital artifact requires the alignment of a number of factors, including software type, software version, and operating system. The greater the amount of time that has passed since the creation of a digital artifact, the greater the likelihood that it will not be possible to accurately render the digital artifact (Besser 2000).

    The Scrambling Problem

    The scrambling problem is a direct result of file encryption and compression algorithms and practices, which have tended to be proprietary and have often not conformed with commonly accepted standards. In the case of compression, organizations began compressing files on a routine basis in response to technology constraints such as the high cost of storage and relatively narrow bandwidth availability. Even today, in an environment with dramatically lower storage costs and an abundance of network bandwidth, many organizations continue to compress files on a routine basis (Besser 2000).

    The degree of difficulty in reading a compressed file depends on the type of compression that was applied to that file. When a "lossless" compressed file is decompressed, the resulting file is essentially the same as it was before compression was used. However, when a "lossy" compressed file is decompressed, the resulting file is not identical to the original version, because some information has been eliminated during the compression process.

    One of the most common lossy image compression formats, Joint Photographic Expert Group (JPEG)7, reduces file size by removing information that is considered almost indistinguishable to the human eye. What is not completely clear is whether any of the information that is removed by lossy compression might be important to preserve for future applications that might employ advanced techniques such as comparing and overlaying images. Furthermore, both lossy and lossless compression add additional complexity to the encoding of a file, a fact that might become problematic in the future for humans or machines that require access to those files (Besser 2000). Therefore, as Frey and Reilly (1999) point out, the use of file compression calls for careful planning. Their recommendation is that for any given file, there should be at least one uncompressed copy.

    Risk Mitigation: Migration vs. Emulation

    Institutions have a number of risk mitigation options for digital preservation. The two strategies that are most frequently discussed in the literature are migration and emulation. Oltmans and Kol (2005) discuss the relative merits of the two approaches, including the relative costs associated with each approach. When an institution chooses to migrate content, it typically does so upon learning that a particular version of software that renders a particular file format will be made obsolete. All files that are in the older format that are to be migrated are saved (converted) to a more recent version of the software.

    One of the risks of migration is that some elements of the document's layout, or possibly other data, could be lost during the migration process. Therefore, emulation provides an alternative to migration. Rather than converting the files from one version to another, emulation software is provided so that files saved in an older format can continue to be viewed on newer platforms and operating systems. The advantage of emulation is that the original format of the file is preserved, but the major drawback is that it is difficult to maintain emulation tools over time.

    The Development of Frameworks, Models, Protocols, and Tools

    One indicator that a discipline is entering a more mature phase is the development of systematic ways for analyzing and solving the problems that are central to that discipline. For those interested in digital preservation and digital archives, one of the most significant milestones to date originated from what might seem an unlikely source: the Consultative Committee for Space Data Systems (CCSDS). At the request of the Organization for International Standards (ISO), the CCSDS set to work on developing formal standards that could serve as the basis for a methodology for managing the long-term storage of the voluminous amounts of digital data that were being produced during space missions (Lavoie 2004).

    OAIS

    One of the first steps that the CCSDS took was to develop a framework that could facilitate collaboration in standards-building and identify potential disciplines and stakeholders most likely to benefit from the existence of such standards. Drafts of the reference model were widely circulated over a period of several years, and in January 2002 the model was formally approved as ISO Standard 14721. In part because the development of the model was a cross-disciplinary effort that engaged many in the archival community, the model's core concept has appropriately become the de facto name for the model itself—the Open Archival Information System (OAIS) (Lavoie 2004; Consultative Committee for Space Data Systems [CCSDS] 2002).

    The CCSDS defined an OAIS as "an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community" (CCSDS 2002). As Lavoie (2004) observes, this definition emphasizes the following two functions of an archival repository: 1) to preserve information, and; 2) to provide access to the information. Furthermore, the CCSDS emphasized the importance of long-term preservation, as well as the need to focus on digital information.9

    The conceptual framework of the OAIS reference model consists of three inter-related parts, each of which will be described in the following sub-sections: 1) OAIS environment; 2) OAIS information model, and; 3) OAIS functional model.

    OAIS Environment

    As shown in Figure 1, there are three entities that interact with the OAIS (archive): 1) Management; 2) Producer; and; 3) Consumer. In the OAIS reference model, "Management" does not imply an individual or group that oversees the operations of the archive. Instead, Management refers to activities such as strategic planning, defining the scope of the collection, and in some cases, ensuring that funding is available to support the archive (Lavoie 2004).

    Producers in the OAIS reference model are the entities (people, organizations, systems) that transfer content, along with any metadata associated with it, to the archive for storage. Consumers are the entities that use the content in the archive, including a special sub-group that are referred to as the Designated Community—those who can understand the artifacts in the archive in the form in which they were preserved. Another way to understand the Designated Community is as a body that is influential in determining both the type of content stored in the archive and the form in which the content is stored (Lavoie 2004).8

    OAIS Information Model

    As shown in Figure 2, the most elemental forms of information in the OAIS information model are Data Objects (physical objects or digital objects). External to the archive is a knowledge base that contains information to assist individuals or classes of individuals with understanding and interpreting Data Objects. In addition, Representation Information makes it possible to render, understand, and interpret digital objects—that is, to "impart meaning to the bits" (OCLC/RLG Preservation Metadata Working Group [PMWG] 2002). The Data Object and the Representation Information together constitute an Information Object.

    At the next level of aggregation (complexity) in Figure 2 is the Information Package. The Information Package includes the following: 1) Content Information (CI), that is, the Information Object; 2) Preservation Description Information (PDI), including Reference Information (for example, ISBN, URN), Provenance Information (the history of the CI, such as its origins and chain of custody), Context Information (a description of the relationship between the CI and its environment), and Fixity Information (authentication mechanisms such as checksums and digital signatures to help ensure that the CI has not been altered other than it the way that is documented); 3) Packaging Information (PI) for binding the digital object and its associated metadata together, and; 4) Descriptive Information (DI) that is intended to facilitate search and retrieval by integrating with the archive's finding aids.

    As shown in Figures 2 and 3, there are three types of Information Packages: 1) Submission Information Package (the Information Package that is sent by the information producer to the archive); 2) Archive Information Package (the Information Package that is stored in the archive), and; 3) Dissemination Information Package (the Information Package that moves from the archive to the user in response to an access request) (PMWG 2002).

    OAIS Functional Model

    Figure 3 shows five of the six functional components in the OAIS Functional Model. Table 1 provides a description of each of the six functional components.

                    Table 1. OAIS Functional Model Components

    Functional Component
    Description
    Functions
    Ingest Provides an archive's external interface with Producers. 1. Receives information from the Producer.
    2. Validates that information is complete and uncorrupted.
    3. Transforms information into a form that the archive can store and manage.
    4. Extracts or creates descriptive metadata.
    5. Transfers information and associated metadata to archival store.
    Archival Storage Manages storage and maintenance of the content in an archive. 1. Refreshes media or migrates format as appropriate for the content.
    2. Conducts error-checking procedures.
    3. Performs disaster recovery actions.
    4. Retrieves information from storage.
    Data Management Maintains databases containing descriptive and administrative metadata. 1. Maintains databases that contain descriptive metadata.
    2. Performs queries and generates reports.
    3. Updates databases (creates, modifies, or deletes data).
    Preservation Planning Provides a roadmap that outlines an archive's preservation strategy. 1. Monitors the external environment for changes that could impact the archive's ability to preserve its content, such as innovations in storage and access technologies.
    2. Develops recommendations for updating the preservation strategy.
    Access Manages processes and services that enable Consumers to locate, request, and receive content in an archive. 1. Forwards access requests to Data Management.
    2. Presents the result of an access request to a Consumer.
    3. Forwards content requests to Archival Storage.
    4. Receives content from Archival Storage.
    5. Performs transformations, such as changing the format or stripping unneeded metadata.
    6. Implements security and access control mechanisms.
    Administration Manages archive operations. 1. Interacts with Producers (for example, negotiates Submission Agreements).
    2. Interacts with Consumers (for example, provides customer support).
    3. Interacts with Management (for example, implements and maintains archive policies and standards.
    4. Oversees operation of archive and access systems.
    5. Monitors system performance.
    6. Communicates with the other five OAIS Functional Model Components.

    Source: Lavoie 2000

    Preservation Metadata

    The OAIS Reference Model has both benefited from and contributed to advances in other areas. In particular, the development of the OAIS Information Model made it clear how central a role metadata would need to play in order for organizations to develop realize the need to provide long-term preservation of digital objects. Because it is now universally recognized that metadata will be a critical component in any digital preservation scheme, the debate has shifted to what set of metadata can best meet the needs of the various communities that require such functionality (Day 2004).

    The debate began in earnest in 1997 with the formation of the Working Group on Preservation Issues of Metadata. This seven-member working group chose to focus on the development of a core set of preservation metadata elements. In 1998, the working group released its final report, which included a set of 16 elements that were seen as "crucial to the continued viability of a digital master file" (Working Group on Preservation Issues of Metadata 1998).

    As its name implies, preservation metadata is intended to enable the preservation of digital information. Less obvious, but no less important, is the need to ensure that digital information continues to be accessible. The National Library of Australia, which was among the first organizations that sought to develop a preservation metadata model (NLA 1999), identified the following broad categories of activities that preservation metadata could support: 1) storing technical information that can support preservation activities; 2) provide a record of preservation actions; 3) provide a record of the effects of preservation actions; 4) provide a record of the authenticity of preservation records over time, and; 5) provide data that facilitates collection management and rights management.

    Day (2004) indicates that preservation metadata initiatives have tended to originate in three communities. Table 2 provides a summary of initiatives that led to advances in preservation metadata .

                    Table 2. Preservation Metadata Initiatives and Projects

    Community
    Sponsoring Organization
    Initiative/Project Name
    Initiative/Project Description
    National/Research Libraries Consortium of University Research Libraries (CURL) CURL Examplars in Digital Archives (CEDARS) Develop an OAIS-compliant means of building a digital archive that could support the universities of Oxford, Cambridge, and Leeds. Key deliverables included rapid prototyping of a distributed digital archiving system, a basic preservation metadata element set, and documentation of the project results so that other members of the higher education community could benefit from and apply the project findings to their own digital archive initiatives.
    National/Research Libraries National Library of Australia (NLA) Preservation Metadata for Digital Collections Develop a preservation metadata set that could support both born digital and digital surrogate content.
    Archives and Records Conference of European National Libraries (CENL) Networked European Deposit Library (NEDLIB) Develop an OAIS-compliant preservation metadata set that focused on technological obsolescence issues.
    Archives and Records University of Pittsburgh Functional Requirements for Evidence in Recordkeeping ("Pittsburgh Project") Define a metadata structure that contains a layer for basic discovery data, with additional layers for storing information on terms and conditions of use, data structures, provenance, content and the use of a record since its creation.
    Archives and Records Monash University Australian Recordkeeping Metadata Schema (RKMS) Specify and standardize a wide range of recordkeeping metadata to enable the management of records in digital environments and to facilitate interoperability with other metadata standards and resource discovery schemas.
    Archives and Records National Archives of Australia Australian Standard for Records Management Specify the metadata that should be captured by the recordkeeping systems of Australian government agencies.
    Archives and Records Public Record Office Victoria (PROV) Victorian Electronic Records Strategy (VERS) Defined a self-documenting exchange format (an XML-based VERS Encapsulated Object) to facilitate the transfer of record content and metadata.
    Digitization Projects National Information Standards Organization (NISO) Metadata for Still Images (NISO Z39.87) Develop a data dictionary containing elements for recording information about images, such as formats, compression techniques, the image creation process, quality metrics, and change history.
    Digitization Projects Library of Congress Metadata Encoding & Transmission Standard (METS) Provide an XML Schema for encoding metadata to facilitate the management and exchange of digital library objects.

    Sources: Day 2004; PMWG 2001

    Eprint Repositories and OAI

    Well before the CCSDS initiated the work that would culminate in the OAIS Reference Model, when there were no formal electronic peer review mechanisms such as those that are available today, scholars sought an efficient way to communicate the status of ongoing research. One of the earliest implementations of what is now known as an "eprint repository"9 occurred in 1991. arXiv, as it was eventually called, initially served the High Energy Physics scholarly community, and ultimately expanded its reach to faculty in related disciplines such as Physics, Mathematics, and Computer Science. Before long, other disciplines were implementing their own eprint repositories (Open Archives Forum [OAF] 2003).

    As beneficial as the early eprint repositories were, it soon became clear that broader collaboration could be beneficial to scholars and others in the academic community (Warner 2003). At meetings in 1999 and 2000 in Santa Fe, New Mexico, two of the key issues under discussion were related to the lack of interoperability among eprint repositories: 1) each eprint repository had its own search interface and finding aids, and; 2) there was no manageable way to share metadata. One of outcomes of these sessions was the creation of the Open Archives Initiative (OAI)10, as well as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) (OAF 2003).

    OAI-PMH

    The meetings in Santa Fe made it clear that the scholarly community needed a consensus-driven approach for sharing metadata. Specifically, to facilitate metadata harvesting, it would be necessary to agree on: 1) a transport protocol; 2) a metadata format; 3) a metadata element set, and; 4) a set of conventions on intellectual property and usage rights.11 The OAI-PMH provides a methodology that enables organizations such as digital archives (also referred to as "Data Providers") to expose metadata about digital objects in their collections so that aggregators (also referred to as "Service Providers") can collect the metadata that is exposed to them (Tennant 2004).

    OAI-PMH Data Provider Responsibilities

    Data Providers manage network-accessible servers that serve as repositories. To enable different repository configurations, the OAI-PMH distinguishes between the following three entities: 1) "resources" are what the metadata describes (the items in the repository; whether a resource is stored in the repository or is in another database is outside the scope of the OAI-PMH); 2) "items" are data elements in a repository from which metadata about a resource can be disseminated, and; 3) "records" are metadata in a specific format (OAI 2004).

    Typical Data Provider architectural components, as shown in Figure 4, are: a parser that can validate OAI requests; an error generator that creates responses in XML; a tool that can query the database and retrieve metadata from the repository in the required format; an XML generator that creates responses with encoded metadata, and; an optional flow control mechanism that looks for incomplete sequences when querying larger repositories (OAF 2003).

    The OAI-PMH was designed to make participation relatively easy for Data Providers, to encourage those with content to share it. The only required metadata scheme for Data Providers, for example, is simple Dublin Core, although Data Providers have the option to use richer metadata schemes. "As service providers gain more experience in OAI harvesting ... pressure may build on data providers to expose richer sets of metadata. However, as the OAI directorate intended, service providers will likely continue to shoulder a greater burden and responsibility to perform normalization and transformation routines" on the various metadata schemes that Data Providers expose (Tennant 2003).

    OAI-PMH Service Provider Responsibilities

    Service Providers issue OAI-PMH requests to Data Providers and use metadata harvesting tools to facilitate value-added services based on that metadata, such as metadata normalization, search, and citation linking. To date, the open source and academic communities have developed customized tools that best meet their individual needs, and OAI-PMH allows for selective harvesting. For example, the focus of a recent project at the University of the Illinois was on harvesting metadata for "culturally significant" materials, and the types of metadata harvested included simple Dublin Core, Dublin Core variants, Machine-Readable Cataloging record (MARC), and Encoded Archival Description (EAD) (Cole et al. 2003).

    Typical Service Provider architectural components, as shown in Figure 5, include: a Scheduler module that manages retrieval of archives; a Flow Control mechanism that partitions requests and facilitates the retrieval of additional results; an Update Mechanism that watches for previously harvested metadata, merging old and new data; an XML Parser analyses responses from the repositories, validates them against the XML schema, and transforms the metadata encoded in XML into the internal data structure; a Normaliser transforms different metadata formats into a common structure and can optionally map between or translate languages; a Database stores the output of the Normaliser; a Duplication Checker merges identical records from different data providers, and; a Service Module provides services to users (OAF 2003).

    Note: As of April 28, 2005, there were 284 OAI-conforming repositories on the list of OAI Registered Data Providers.

    METS

    Although numerous standards have been developed that are intended to encode metadata for the content in digital libraries, it was not until the Library of Congress developed the Metadata Encoding and Transmission Standard (METS) that an overall framework for integrating the various metadata schemes became available. In the absence of an overarching framework and widely adopted standards, individual institutions have tended to implement preservation schemes that meet their particular needs. METS is intended to address the lack of standardized metadata practices that has been a constraint on the growth of digital collections (Gartner 2002).

    Depending on the intended purpose of a METS document, content creators can structure the document so that it can function as any of the top-level OAIS information objects (Submission Information Package (SIP), Archival Information Package (AIP), or Dissemination Information Package (DIP)). Each METS document consists of seven main sections. The METS document sections are described in Table 3.

                    Table 3. METS Document Sections

    Section
    Description
    Example
    METS Header Contains minimal descriptive metadata about the METS object, including its date of creation, the date of its last modification, and status. Optional metadata includes the names and roles of one or more agents who have been involved with the METS document. <metsHdr CREATEDATE="2003-07-04T15:00:00" RECORDSTATUS="Complete">
    <agent ROLE="CREATOR" TYPE="INDIVIDUAL">
    <name>Jerome McDonough</name>
    </agent>
    <agent ROLE="ARCHIVIST" TYPE="INDIVIDUAL">
    <name>Ann Butler</name>
    </agent>
    </metsHdr>
    Descriptive Metadata Contains one or more <dmdSec> (Descriptive Metadata Section) elements. Each <dmdSec> element may contain a pointer to external metadata (an <mdRef> element), internally embedded metadata (within an <mdWrap> element), or both. For example, a metadata reference for finding a particular digital library object could look something like this:

    <dmdSec ID="dmd001">
    <mdRef LOCTYPE="URN" MIMETYPE="application/xml" MDTYPE="EAD"
    LABEL="Berol Collection Finding Aid">urn:x-nyu:fales1735</mdRef>
    </dmdSec>

    Administrative Metadata

    <amdSec> elements contain the administrative metadata that relates to the files that make up a digital library object, as well as the original source material from which the digital object was created.

    Four main types of administrative metadata are found in this section:
    1. Technical Metadata (for example, file format);
    2. Intellectual Property Rights Metadata (copyright and license information),
    3. Source Metadata (describes the analog source of a digital library object);
    4. Digital Provenance Metadata (describes source/destination relationships among files).

    Here is a sample <techMD> element that includes technical metadata about how a file was prepared:

    <techMD ID="AMD001">
    <mdWrap MIMETYPE="text/xml" MDTYPE="NISOIMG" LABEL="NISO Img. Data">
    <xmlData>
    <niso:MIMEtype>image/tiff</niso:MIMEtype>
    <niso:Compression>LZW</niso:Compression>
    <niso:PhotometricInterpretation>8</niso:PhotometricInterpretation>
    <niso:Orientation>1</niso:Orientation>
    <niso:ScanningAgency>NYU Press</niso:ScanningAgency>
    </xmlData>
    </mdWrap>
    </techMD>

    File Section The file section (<fileSec>) contains one or more <fileGrp> elements that group together related files. A <fileGrp> lists the files that make up a single electronic version of the digital library object. For example, there might be separate <fileGrp> elements for thumbnails, master archival images, .pdf versions, and TEI-encoded text versions. The following portrays a file section from a digital library object for an oral history that exists as a TEI-encoded transcript, a master audio file (.wav), and derivative audio (.mp3):

    <fileSec>
    <fileGrp ID="VERS1">
    <file ID="FILE001" MIMETYPE="application/xml" SIZE="257537" CREATED="2001-06-10">
    <FLocat LOCTYPE="URL">http://dlib.nyu.edu/tamwag/beame.xml</FLocat>
    </file>
    </fileGrp>
    <fileGrp ID="VERS2">
    <file ID="FILE002" MIMETYPE="audio/wav" SIZE="64232836"
    CREATED="2001-05-17" GROUPID="AUDIO1">
    <FLocat LOCTYPE="URL">http://dlib.nyu.edu/tamwag/beame.wav</FLocat>
    </file>
    </fileGrp>
    <fileGrp ID="VERS3" VERSDATE="2001-05-18">
    <file ID="FILE003" MIMETYPE="audio/mpeg" SIZE="8238866"
    CREATED="2001-05-18" GROUPID="AUDIO1">
    <FLocat LOCTYPE="URL">http://dlib.nyu.edu/tamwag/beame.mp3</FLocat>
    </file>
    </fileGrp>
    </fileSec>

    Structural Map Defines a hierarchical structure to facilitate navigation by using the <structMap> element to encode a hierarchy as a nested series of <div> elements. Each <div> includes attributes that specify what kind of division it is, as well as optional METS pointer (<mptr>) and file pointer (<fptr>) elements that can identify content that corresponds with that <div>. The following is an excerpt from a structural map with little complexity, where there is an oral history with Mayor Abraham Beame of New York City that begins with an opening introduction by the interviewer:

    <structMap TYPE="logical">
    <div ID="div1" LABEL="Oral History: Mayor Abraham Beame"
    TYPE="oral history">
    <div ID="div1.1" LABEL="Interviewer Introduction"
    ORDER="1">
    <fptr FILEID="FILE001">
    <area FILEID="FILE001" BEGIN="INTVWBG" END="INTVWND"
    BETYPE="IDREF" />
    </fptr>
    </structMap>

    Structural Links Contains only an <smLink> element that can be repeated and is used to record the existence of hyperlinks between items within the structural map (typically<div> elements). The following is an example for a web page that contains an image that is hyperlinked to another page, such that the <structMap> element contains <divs> for the two pages:

    <div ID="P1" TYPE="page" LABEL="Page 1">
    <fptr FILEID="HTMLF1"/>
    <div ID="IMG1" TYPE="image" LABEL="Image Hyperlink to
    Page 2">
    <fptr FILEID="JPGF1"/>
    </div>

    <div ID="P2" TYPE="page" LABEL="Page 2">
    <fptr FILEID="HTMLF2"/>
    </div>

    Behavior Section Associates executable behaviors with content in a METS object and contains one or more <behavior> elements, each of which has an interface definition element that represents an abstract definition of the set of behaviors represented by a particular behavior section. A <behavior> element also has a <mechanism> element that points to a module of executable code that implements and runs the specified behavior. The following example from the Mellon Fedora project shows how digital object behaviors can be implemented as linkages to distributed web services:

    <METS:behavior ID="DISS1.1" STRUCTID="S1.1" BTYPE="uva-bdef:stdImage"
    CREATED="2002-05-25T08:32:00" LABEL="UVA Std Image Disseminator"
    GROUPID="DISS1" ADMID="AUDREC1">
    <METS:interfaceDef LABEL="UVA Standard Image Behavior Definition"
    LOCTYPE="URN" xlink:href="uva-bdef:stdImage"/>
    <METS:mechanism LABEL="A NEW AND IMPROVED Image Mechanism"
    LOCTYPE="URN" xlink:href="uva-bmech:BETTER-imageMech"/>
    </METS:behavior>


    Source: Library of Congress 2004

    The Future of Digital Archives

    This section describes research initiatives, projects, and technological developments that appear likely to benefit the digital preservation community.

    PREMIS and Preservation Metadata

    Recognizing that published specifications for preservation metadata were for the most part organization-specific or broadly theoretical, the Online Computer Library Center (OCLC) and the Research Libraries Group (RLG) established the Preservation Metadata: Implementation Strategies (PREMIS) Working Group in 2003. The following year, the PREMIS Working Group published the results of a survey that they conducted with representatives from libraries, archives, museums, and other institutions. As pointed out by Caplan (2004), the results of this survey have considerable significance for future decisions and research on preservation metadata. The following is a summary of the survey results:

    Metadata Harvesting Tools

    Greenberg et al. (2005) observe that there is an immediate need for improved tools that can assist both Information and Library Science researchers and the digital preservation community with the generation and harvesting of metadata. For example, the needs of the research community are not being met in part because there is insufficient communication between that community and those in the software development community. Perhaps the most urgent need of all is for the digital preservation community to define the requirements of metadata generation and harvesting tools in terms that can be clearly understood and implemented by the software development community.

    Tennant (2003) identifies many areas where there is a need for additional research and development. For example:

    Recent OAI-PMH Developments

    Recent postings on the OAI web site suggest that there continues to be considerable interest in the capabilities that OAI-PMH offers. For example, Google is now using OAI-PMH to harvest metadata from the National Library of Australia (NLA), and CiteSeer recently announced OAI-PMH support. Furthermore, the open source mod_oai module is intended to optimize web crawling by exposing content on Web servers running Apache via OAI-PMH.

    Institutional Repository Software

    Although it is still early in the development of institutional repositories and the tools to support them, it is possible to draw some conclusions and to anticipate likely future directions. One of the first issues that requires attention is to seek consensus on what the intended purpose of an institutional repository is. Until such a time that stakeholder institutions can reach agreement on what their requirements are, there is little chance that there will be software that can meet most institutional repository needs. For now, institutions tend to fall into two camps—those who favor easy access and dissemination, and those who favor long-term preservation. Not surprisingly, therefore, institutional repository software tends to cater mainly to one side or the other of the institutional repository community (Wheatley 2004).12

    Fortunately, there are many tools that are available for use by institutional repositories, many of which are available for use under an open source license. The following resources provide information about these tools:

    For organizations that wish to build their own tools, there are numerous packages and toolkits available. For example, Net::OAI::Harvester is a Perl-based toolkit that is intended to enable individuals or organizations to build their own metadata harvesting applications (Summers 2004). And for Service Providers interested in building their own applications, Tennant (n.d.) proposes a set of specifications that can serve as the basis for creating a search service for harvested metadata.

    Technical Registry Development: PRONOM

    In a recent article, Brown (2005) provides updated information about the PRONOM project at The National Archives (TNA). PRONOM, essentially a registry that contains detailed information on over 100 file formats. Brown reports that the TNA's intention is to "develop a holistic risk assessment methodology for electronic records that will enable us to identify risk factors at an early stage, predict their impact, and plan appropriate mitigation strategies." Significantly, the most recent version of the data model has been reworked to facilitate interoperability with other initiatives, such as the Global Digital Format Registry, and the TNA is working to cultivate relationships with major software developers.

    Emerging Hardware Technologies and Information Lifecycle Management

    The storage industry continues to introduce products and solutions that enhance the storage capabilities of archives and other organizations. By systematically applying Information Lifecycle Management (ILM)—a combination of processess and technologies that facilitate better data management—administrators can determine where various types of data should reside, based on criteria such as the type, age, and value of the data to an organization. As with any data management strategy, organizations must continuously review ILM practices such as the usage patterns of storage assets to ensure that policies are achieving the intended goals (Dupplessie, Marrone, and Kenniston 2003).

    Significantly, the "storage management gap" is growing ever wider, as the rapidly growing amount of data increases the size of the void between the capabilities of management tools and the ability of an organization to manage its data. Meanwhile, projections indicate that data storage needs will outstrip both the availability of existing storage administrators and the ability of organizations to bring in more administrators to address the gaps. Fortunately, "the robust hierarchy of storage devices that exists today, in conjunction with improved storage management software and SANs with intelligent, autonomic data-movement capabilities, should ultimately and significantly reduce the storage management burden from its current painful levels" (Moore 2004).

    One of the hardware technologies that is worthy of investigation by the archival community is Massive Array of Inactive Disks (MAID). The idea behind a MAID is that all of the disks in the array stay powered down—only the disk controller is active. When an application asks for an object, the controller powers up the appropriate disk drives, transfers the data, and then powers the drives down again, an approach that offers potentially significant savings. Progress is also being made in the realm of tape drives. Some manufacturers insist that by 2010 tape drives will be available that can store a Terabyte (TB) of data, and the standards that are being developed lend additional credence to the idea that such drives will be available (Linden et al. 2005). 

    Conclusion

    Organizations face numerous challenges in managing and preserving digital content, whether it is content that was "born digital" or artifacts such as a manuscripts that were converted into a digital format. In addition to the technical challenges that were described in this paper, organizations must also be concerned with the integrity and authenticity of their digital content. For example, as software vendors introduce products that are intended to manage "unstructured data" (documents),13 perhaps the greatest challenge of all is determining which version of a digital document is the original version.

    Due in part to undertakings such as OAIS, Eprint repositories, OAI-PMH, METS, and PREMIS, the archival community stands to benefit in numerous ways. Furthermore, advances in hardware technologies and the application of best practices such as Information Lifecycle Management can help archives better manage voluminous amounts of data. Nonetheless, preserving digital content is proving to be one of the greatest and most important challenges of the early 21st century.

    Acknowledgements

    The author would like to thank Paul Conway (Director, Information Technology Services, Duke University Libraries) for providing both an introduction to the subject of digital preservation and for providing tips and motivation that were helpful in pursuing this line of inquiry, Jane Greenberg (Associate Professor, School of Information and Librarhy Science, University of North Carolina at Chapel Hill) for identifying fruitful lines of inquiry and providing insight on areas where additional contributions to research are needed, and Thomas Brown (Links Technology storage consultant) for adding his perspective on current and future storage trends.

    References

    Beebe, L. and Meyers, B. (1999 June). "The Unsettled State of Archiving." The Journal of Electronic Publishing, Vol. 4, No. 4. http://www.press.umich.edu/jep/04-04/beebe.html

    Besser, H. (2000). "Digital Longevity." In M. Sitts (Ed.). Handbook for Digital Projects: A Management Tool for Preservation and Access. Andover, MA: Northeast Document Conservation Center. http://www.nedcc.org/digital/ix.htm

    Brown, A. (2005 April 15). "Automating Preservation: New Developments in the PRONOM Service." RLG DigiNews, Vol. 9, No. 2. http://www.rlg.org/en/page.php?Page_ID=20571#article1

    Budapest Open Access Initiative. http://www.soros.org/openaccess/software/

    California Digital Library. http://www.cdlib.org

    Caplan, P. (2004 October 15). "PREMIS - Preservation Metadata - Implementation Strategies Update 1. Implementing Preservation Repositories for Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community." RLG Diginews, Vol. 8, No. 5. http://www.rlg.org/en/page.php?Page_ID=20462&Printable=1&Article_ID=1674

    CEDARS Project Team (2001 June). "The Cedars Project Report: April 1998 - March 2001." http://www.leeds.ac.uk/cedars/OurPublications/CedarsProjectReportToMar01.pdf

    Cole, T. et al. (2003 25 July). "Implementation of a Scholarly Information Portal Using the Open Archives Initiative Protocol for Metadata Harvesting." University of Illinois at Urbana-Champaign. http://oai.grainger.uiuc.edu/FinalReport/TableofContents.htm

    Consultative Committee for Space Data Systems (2002 January). "Reference Model for an Open Archival Information System (OAIS)." CCSDS 650.0-R-1 – Blue Book. http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html

    Conway, P. (1996 March). "Preservation in the Digital World." Commission on Preservation and Access. http://www.clir.org/pubs/reports/conway2/index.html

    Crow, R. (2004 August). "A Guide to Institutional Repository Software" (3rd ed). Open Society Institute. http://www.soros.org/openaccess/software/

    Day, M. (2003). "Integrating Metadata Schema Registries with Digital Preservation Systems to Support Interoperability: a Proposal." http://www.siderean.com/dc2003/101_paper38.pdf

    Day, M. (2004). "Preservation metadata initiatives: practicality, sustainability, and interoperability." http://www.ukoln.ac.uk/preservation/publications/erpanet-marburg/day-paper.pdf

    Digital Preservation and Archive Committee (2001 October 18). Final Report of the Digital Preservation and Archive Committee. California Digital Library. http://libraries.universityofcalifornia.edu/sopag/dpac/DPACFinalReport.pdf

    Dollar, C. (2000). Authentic Electronic Records: Strategies for Long-Term Access. Chicago: Cohasset Associates. Online table of contents and summary: http://www.mybestdocs.com/CSum1.html

    Dupplessie, S., Marrone, N. and Kenniston, S. (2003 March 31). "The new buzzwords: Information lifecycle managememt." Computerworld (Quick Link 37432). http://www.computerworld.com/printthis/2003/0,4814,79885,00.html

    Florida Center for Library Automation Digital Archive. http://www.fcla.edu/digital/Archive/index.html

    Fox, E. and Marchionini, G. (1998). "Toward a World Wide Digital Library." Communications of the ACM, Vol. 41, No. 4: 28-98. http://fox.cs.vt.edu/DL/URLs.htm

    Frey, F. and Reilly, J. (1999). "Digital Imaging for Photographic Collections: Foundations for Technical Standards." Image Permanence Institute, Rochester Institute of Technology. http://www.rit.edu/~661www1/sub_pages/digibook.pdf

    Garrett, J. and Waters, D. (1996 May 1). "Preserving Digital Information: Report of the Task Force on Archiving of Digital Information." Commission on Preservation and Access and the Research Libraries Group. http://www.rlg.org/ArchTF/

    Gartner, R. (2002 October). "METS: Metadata Encoding and Transmission Standard." Techwatch report TSW 02-05. http://www.jisc.ac.uk/index.cfm?name=techwatch_report_0205

    Gilliland-Swetland, A. (2000 February). "Enduring Paradigm, New Opportunities: The Value of the Archival Perspective in the Digital Environment." Council on Library and Information Resources. http://www.clir.org/pubs/abstract/pub89abst.html

    Greenberg, J. et al. (2005 February 17). "Final Report for the AMeGA (Automatic Metadata Generation Applications) Project." http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf

    Hedstrom, M. and Ross, S. (2003). "Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation." http://delos-noe.iei.pi.cnr.it/activities/internationalforum/Joint-WGs/digitalarchiving/digitalarchiving_c.html

    Hirtle, P. (2001 April). "OAI and OAIS: What's in a Name?" D-Lib Magazine, Vol. 7, No. 4. http://www.dlib.org/dlib/april01/04editorial.html

    Lavoie, B. (2000 January). "Meeting the challenges of digital preservation: The OAIS reference model." OCLC Newsletter, No. 243: 26-30. http://www.oclc.org/research/publications/archive/2000/lavoie/

    Lavoie, B. (2004 January). "The Open Archival Information System Reference Model: Introductory Guide." DPC Technology Watch Series Report 04-01. Office of Research, Online Computer Library Center. http://www.oclc.org/research/announcements/2004-02-24.htm

    Lavoie, B. and Dempsey, L. (2004 July). "Thirteen Ways of Looking at ... Digital Preservation." D-Lib Magazine, Vol. 10, No. 7/8. http://www.dlib.org/dlib/july2004/lavoie/07lavoie.html

    Lawrence, G., Kehoe, W., Rieger, O., Walters, W., & Kenney, A. (2000 June). "Risk Management of Digital Information: A File Format Investigation." Council on Library and Information Resources. Last updated xxx. http://www.clir.org/pubs/reports/pub93/contents.html

    Library of Congress (2004 September 23). "METS: An Overview & Tutorial." http://www.loc.gov/standards/mets/METSOverview.v2.html

    Linden, J., Martin, S., Masters, R., & Parker, R. (2005 February). "The large-scale archival storage of digital objects." Digital Preservation Coalition (DPC) Technology Watch Series Report 04-03. http://www.dpconline.org/graphics/reports/

    McKemmish, S. (1993). "Introducing Archives." In J. Ellis (Ed.). Keeping Archives (2nd ed.). Victoria Australia: Thorpe.

    Moore, F. (2004). "Storage Navigator." Horison Information Strategies. http://www.horison.com/horison/books/2005/index.shtml

    National Library of Australia (1999 October 15). "Preservation Metadata for Digital Collections." Exposure Draft. http://www.nla.gov.au/preserve/pmeta.html

    National Library of New Zealand (2003 June). Metadata Standards Framework: Preservation Metadata (Revised). http://www.natlib.govt.nz/files/4initiatives_metaschema_revised.pdf

    Networked European Deposit Library (2000). "Metadata for Long Term Preservation."
    http://www.kb.nl/coop/nedlib/results/preservationmetadata.pdf

    OCLC/RLG Preservation Metadata Working Group (2001 January 31). "Preservation Metadata for Digital Objects: A Review of the State of the Art." http://www.oclc.org/research/projects/pmwg/wg1.htm

    OCLC/RLG Preservation Metadata Working Group (2002 June). "Preservation Metadata and the OAIS Information Model: A Metadata Framework to Support the Preservation of Digital Objects." http://www.oclc.org/research/projects/pmwg/wg1.htm

    Oltman, E. and Kol, N. (2005 April 15). "A Comparison Between Migration and Emulation in Terms of Costs." RLG DigiNews, Vol. 9, No. 2. http://www.rlg.org/en/page.php?Page_ID=20571#article0

    Open Archives Forum (2003). "OAI for Beginners - The Open Archives Forum Online Tutorial." http://www.oaforum.org/tutorial/english/intro.htm

    Open Archives Initiative (2005). "Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting: Guidelines for Repository Implementers." Document Version 2005/01/19T19:27:00Z. http://www.openarchives.org/OAI/2.0/guidelines-repository.htm

    Open Archives Initiative (2004). "The Open Archives Initiative Protocol for Metadata Harvesting." Document Version 2004/10/12T15:31:00Z. http://www.openarchives.org/OAI/openarchivesprotocol.html

    Open Archives Initiative (2005). "Registered Service Providers." Last updated March 9, 2005. http://www.openarchives.org/service/listproviders.html

    O'Toole, J. (1989). "On the Idea of Permanence." American Archivist, Vol. 52, No. 1: 10-25.

    Payette, S. & Staples, T. (2002). "The Mellon Fedora Project: Digital Library Architecture Meets XML and Web Services." Research and Advanced Technology for Digital Technology: 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002. http://www.fedora.info/documents/ecdl2002final.pdf

    Pelz-Sharpe, A. (2005 May). "The eternal document question." KMWorld, Vol. 14, No. 5. http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&Article_ID=2125&Publication_ID=134

    Preserving Access to Digital Information. http://www.nla.gov.au/padi/index.html

    PREMIS Working Group (2004 September). "Implementing Preservation Repositories for Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community." A Report by the Preservation Metadata: Implementation Strategies (PREMIS) Working Group. http://www.oclc.org/research/projects/pmwg/surveyreport.pdf

    Rothenberg, J. (1999 January). "Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation." A Report to the Council on Library and Information Resources. http://www.clir.org/pubs/abstract/pub77.html

    Summers, E. (2004 January). "Building OAI-PMH Harvesters with Net::OAI:Harvester." Ariadne, Issue 38. http://www.ariadne.ac.uk/issue38/summers/

    Tennant, R. (2004 November 4). "Inside CDL: OAI Harvesting Infrastructure." California Digital Library. http://www.cdlib.org/inside/projects/harvesting/index.html

    Tennant, R. (2003). "Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers." http://www.cdlib.org/inside/projects/harvesting/bitter_harvest.html

    Tennant, R. (n.d.). "Specifications for Metadata Processing Tools." http://www.cdlib.org/inside/projects/harvesting/metadata_tools.htm

    Warner, S. (2003 July 4). "Eprints and the Open Archives Initiative." Library Hi Tech, Vol. 21, No. 2: 151-158.

    Wheatley, P. (2004 March). "Institutional Repositories in the context of Digital Preservation." Digital Preservation Coalition (DPC) Technology Watch Series Report 04-02. http://www.dpconline.org/graphics/reports/

    Working Group on Preservation Issues of Metdata (1998 May). "Final Report on Preservation Metadata for Digital Master Files." http://www.rlg.org/en/page.php?Page_ID=8341

    Notes

    1 As pointed out by Fox and Marchionini (1998), preservation is one of three activities that are central to digital libraries and archives, the other two of which are maintaining access to collections and collection management. Rothenberg (1999) observes that "without preservation, access becomes impossible, and collections decay and disintegrate."

    2 For the purposes of this paper, an "archive" and a "repository" are seen as interchangeable terms. For the remainder of the paper, the term "digital archive" will be used as shorthand for "digital archival repository."

    3 McKemmish (1993) provides significant insights on the nature and scope of archival activities. For instance, she observes that "organizations create records employing whatever technology is available to them. Therefore, such records can be in any media, e.g., paper, microform, film, magnetic tape or disk, optical disk, video or audiotapes. They also come in various shapes, sizes, and formats, including letters, paper files, diaries, registers, index cards, maps, plans, microfiche, aperture cards, photographs, videocassettes, computerized databases, and electronic mail." McKemmish also notes what is a well-known schism in the archival community, best characterized by the views of Jenkinson, whose writings "emphasized the evidentiary, transactional nature of archives and the part they play in the conduct of business, and therefore the importance of preserving their integrity and authenticity, powerful concepts which are highly relevant today." By way of contrast, Schellenberg was among the most prominent voices in the school of thought that downplayed the "evidentiary nature of archives by definining them primarily in terms of the acts of selection and transfer of custody."Although the notions of integrity and authenticity are particularly significant vis-a-vis digital artifacts, integrity and authenticity are well-documented in the archival literature and are not explicitly addressed in this paper.

    4 Although it is outside the scope of this paper, it is important to note that the decision on what to preserve still tends to come down to traditional criteria that archivists have long applied to preservation decisions, such as usage, quality, and substantiation (Beebe and Meyers 1999). Garrett and Waters (1996) also noted that archivists can apply traditional preservation criteria to digital artifacts, including the extent to which the subject matter of the artifact under appraisal aligns with the collection goals of the institution, as well as the quality, uniqueness, and current and future value of the artifact.

    5 Additional considerations that limit the long-term viability of hardware include significant discrepancies between the claims of vendors and the actual performance of media such as magnetic tape and CD, the fact that even a small perfection in certain types of media (such as tape) may render an entire medium unusable, and changes to standards that can complicate backward-compatibility (Linden et al. 2005).

    6 JPEG stands for Joint Photographic Expert Group, which is the group responsible for the development of the compression approach named after it. JPEG is one type of lossy compression with a number of user-selectable options.The advantages of JPEG compression are its user selectability to ensure visually lossless compression, high compression ratio, good computational efficiency, and good film grain suppression characteristics. Future development proposed for the JPEG standard allow for tiling extensions, meaning that multiple-resolution versions of an image can be stored within the same file (similar to the concept behind the Photo CD files). The concern that repeated JPEG compression causes deterioration of image quality is valid.61 Consequently, all image processing should occur before the file is compressed, and the image should only be saved once using JPEG.

    7 The CCSDS (2002) defines "long-term" as follows: "The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely."

    8 Lavoie provides a real-world example of the relationship between Management, Producers, and Consumers in the OAIS reference model: "The National Digital Archive of Datasets (NDAD) is a UK-based initiative aimed at preserving computer datasets produced by UK central government departments and agencies. In this scenario, the OAIS is the National Data Repository (NDR) service, a digital preservation and access system operated by the University of London Computer Centre (ULCC). The Management role, however, resides with the UK National Archives, which retains legal custody of the archived datasets and performs a number of high-level functions associated with the NDAD initiative, including the provision of funds and selection of datasets for long-term preservation. The Producers are, of course, the various UK government departments and agencies which, as part of their organizational mission, produce computer datasets. The archived datasets are freely available for use by anyone with Web access, so NDAD's Consumers appear to be defined in the broadest terms: the general public. A visit to the NDAD Web site suggests that the scope of the Designated Community extends to the general public as well: the Web site notes that apart from Web access, little else is required to use the NDAD database. Moreover, the archived datasets are accompanied by fairly detailed descriptive information, including finding aids that explain 'why, how, and when the datasets were created'. In short, no scientific expertise or domain-specific knowledge appears to be required to use the datasets in the NDAD collection; put another way, the datasets are, by and large, 'independently understandable' by the general public."

    9 Warner (2003) offers the following working definition of an eprint: "I use the term eprint to group together many forms of scholarly literature for which there is open access to the full-content via the internet. Eprints may include: journal articles, pre-prints, technical reports, books, theses and dissertations. Eprints may or may not be refereed."

    10. Given the similarity of the terms, there is considerable potential for confusion in regard to what the difference is between OAI and OAIS. The differences are significant, as noted by Hirtle (2001): "OAI seeks to develop and promote interoperability standards to facilitate the efficient dissemination of content. Whereas the OAIS initiative arose from a need to ensure that scientific data would still be accessible in the future, OAI grew from a desire to enhance access to e-print archives as a means of increasing the availability of scholarly communication. OAI is one of the most exciting developments in the area of information dissemination, and holds out the promise of radically changing how we access and use scholarly information."

    11. As of OAI-PMH v 2.0, the protocol works as follows: requests are handled via HTTP GET / POST, responses are in XML, transport is via HTTP, and the base metadata set is unqualified Dublin Core (OAF 2003).

    12 ePrints and Fedora were developed primarily to provide open access to scholarly content, while one of the key objectives of DSpace was to provide long-term digital preservation, along with enabling access to institutional content (Wheatley 2004).

    13 An example of a recently introduced software offering that is intended to manage documents is Network Appliance's LockVault. Because unstructured data such as documents tend to heavily tax storage systems, copy-based storage systems such as LockVault are becoming increasingly common, raising concerns about whether it will be possible to determine the original version of a document with a high degree of confidence (Pelz-Sharpe 2005).

    Author Details

    Gershom Rogers is an I.T. Analyst for Cisco Systems and is a part-time Ph.D. student in the School of Information and Library Science at the University of North Carolina, Chapel Hill. Gershom's Research interests include digital preservation, knowledge representation, and metadata interoperability.

    Appendix A: Best Practices for Data Providers and Service Providers

    Tennant (2003) suggests the following best practices for Data Providers and Service Providers.

    Best Practices for Data Providers

    Best Practices for Service Providers

    Appendix B: Methods to Mitigate the Risk of Losing Digital Materials

    The following list of best practices is from Appendix A of "Preserving Digital Materials," by the California Digital Library's Digital Preservation and Archive Committee (2001).

    "All organizations that wish to mitigate the risk of losing digital materials should:

    1. Place the preservation repository in an institution that has preservation as a 'core value' and has a stable long-term future.
    2. Ensure the preservation repository service has a secure funding model.
    3. Specify and collect the preservation metadata that will allow for future data migrations. For example, information on how the original object was created, stored and rendered on access systems.
    4. Set standards, guidelines and best practices for the repository. For example, the CDL Digital Object Standard provides a single metadata/content encoding that can be used with any hierarchical object (book, journal, diary, correspondence, photograph, etc). Therefore, if someday TIFF is replaced, it will be easy to find all the TIFF files in each object.
    5. Limit 'linking'outside the repository. Try to have a stored object fully contained inside the repository, as the repository service cannot control the preservation of materials on the outside. However, there may be times it is not appropriate, important or possible to do this (e.g., copyright issues over harvesting linked objects).
    6. Use standard file formats for digital content. There is a greater chance that popular file formats (TIFF, PDF, XML encoding, etc.) will be able to be migrated to new technologies.
    7. Store multiple content file formats where possible and economical. For example, it should be possible for a born-digital XML document (e.g. EAD) to be “printed” to disk to as a PDF document. Or, Word documents can be printed as PDF using Acrobat. Having more than one format increases the chance that the document can be migrated forward.
    8. Always preserve the original deposited material. While it may not be technically possible or economical to migrate a document upon obsolescence of the format, it may at a future time.
    9. Describe the materials in enough detail so one can check authenticity. One should be able to check a narrative metadata description against the content to help validate the material as being the correct item.
    10. Implement quality control measures. For example, use digital signature techniques to ensure an object hasn’t changed
      since its last back-up. Have people spot-check stored materials on a random basis.
    11. Replicate the preservation repository in multiple locations. Storing materials in different geographic locations minimizes the risk of losing materials in the primary repository to fire, flood, etc. It may also be desirable to store the same materials in multiple repositories using different technologies. This would protect against a catastrophic software failure that damages materials in the primary repository and its replications.
    12. User Education. Last but not least, it is critical that producers and consumers of preservation repository services be educated to its mission, policy and procedures, as well as their own responsibilities (e.g., developing and following standards, guidelines and best practices, negotiating submission and dissemination agreements with the repository, etc.)."

    Appendix C: Preservation Metadata Framework Recommended by the OCLC/RLG Preservation Metadata Working Group

    This appendix contains a summary of the preservation metadata framework recommended by the OCLC/RLG Preservation Metadata Working Group ("Preservation Metadata and the OAIS Information Model: A Metadata Framework to Support the Preservation of Digital Objects.").

    This appendix follows the conventions that are used in the OCLC/RLG document:

    CONTENT INFORMATION

    CONTENT DATA OBJECT
    REPRESENTATION INFORMATION
    CONTENT DATA OBJECT DESCRIPTION
    Underlying abstract form description
    Structural type
    Technical infrastructure of complex object
    File description
    Installation requirements
    Size
    Access inhibitors
    Access facilitators
    Significant properties
    Functionality
    Description of rendered content
    Quirks
    Documentation
    ENVIRONMENT DESCRIPTION
    SOFTWARE ENVIRONMENT
    RENDERING PROGRAMS
    Transformation process

    Transformer engine
    Parameters
    Input format
    Output format
    Location
    Documentation
    Display/access application
    Input format
    Output format
    Location
    Documentation
    OPERATING SYSTEM
    OS name
    OS version
    Location
    Documentation

    HARDWARE ENVIRONMENT
    Location
    COMPUTATIONAL RESOURCES
    Microprocessor requirements
    Memory requirements
    Documentation
    STORAGE
    Storage information
    Documentation
    PERIPHERALS
    Peripheral requirements
    Documentation

    PRESERVATION DESCRIPTION INFORMATION

    REFERENCE INFORMATION
    Archival system identification

    Value
    Construction method
    Responsible agency
    Global identification
    Value
    Construction method
    Responsible agency
    Resource description
    Existing metadata
    Existing records

    CONTEXT INFORMATION
    Reason for creation
    Relationships

    Manifestation
    Relationship type
    Identification
    Intellectual content
    Relationship type
    Identification

    PROVENANCE INFORMATION
    Origin

    Event
    Designation
    Procedure
    Date
    Responsible agency
    Outcome
    Note
    Next occurrence
    Pre-ingest
    Event
    Designation
    Procedure
    Date
    Responsible agency
    Outcome
    Note
    Next occurrence
    Ingest
    Event
    Designation
    Procedure
    Date
    Responsible agency
    Outcome
    Note
    Next occurrence
    Archival retention
    Event
    Designation
    Procedure
    Date
    Responsible agency
    Outcome
    Note
    Next occurrence
    Rights management
    Event
    Designation
    Procedure
    Date
    Responsible agency
    Outcome
    Note
    Next occurrence

    FIXITY INFORMATION
    Object Authentication

    Authentication type
    Authentication procedure
    Authentication date
    Authentication result