Digital Object Specifications
The HathiTrust repository was created according to the framework for Open Archival Information Systems (OAIS). Definitions from this framework are used in the discussion of specifications for Digital Objects (Archival Information Packages) below.
Definitions
- Archival Information Package (AIP) – The Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within HathiTrust.
- Submission Information Package (SIP) – The Information Package that is delivered to HathiTrust for use in the construction of one or more AIPs.
- Content Information – The set of information that is the original target of preservation. It is an Information Object comprised of its Content Data Object and its Representation Information.
- Content Data Object – the data object that, together with Representation Information, is the original target of preservation (in HathiTrust currently, page image files and associated OCR files and metadata)
- Representation Information – the information that maps a Data Object into more meaningful concepts (this includes at a very granular level standards such as Unicode and TIFF).
- Preservation Description Information – The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, and Context information.
- Provenance Information – documents the history of the Content Information, including its creation, any alterations to its content or format over time, its chain of custody, any actions (such as media refreshment or migration) taken to preserve the Content Information, and the outcome of these actions.
- Reference Information – uniquely identifies the Content Information within HathiTrust (e.g., repository identifier), as well as in relation to entities and systems external to HathiTrust (e.g., OCLC number, ISBN, etc.).
- Fixity Information – validates the authenticity or integrity of the Content Information: for example, a check sum, a digital signature, or a digital watermark.
Specifications
Provenance, Reference, and Fixity Information for Content Information in HathiTrust are stored in one or more files conforming to the Metadata Encoding and Transmission Standard (METS). Digital objects or Archival Information Packages from all digitization sources include a “HathiTrust” METS file. AIPs from the Internet Archive and Google include an additional “Source” METS file. These two files are constituted as follows:
- A “Source” METS file is assembled from metadata provided to HathiTrust in the Submission Information Package, and contains information about the Content Information from the time of its creation to the time it enters the repository;
- A “HathiTrust” METS file is created on ingest and includes a subset of the Source METS file data, but is primarily a record of the object from the time it enters the repository forward. The Source METS is kept for preservation purposes only. The HathiTrust METS is used for both preservation and access purposes (i.e., in both the archival and dissemination information packages).
Preservation information included in the METS file is recorded using Preservation Metadata Implementation Strategies (PREMIS).
HathiTrust has defined a METS profile for the Google-digitized content archived in the repository, and had defined a generalized policy and specification framework for book and journal content (including image header metadata, resolution, identifiers, etc.). This is available at Contribute Content). The METS profile for Google-digitized content is given below, along with a summary of its structure. Examples of Google and Internet Archive “Source” and “HathiTrust” METS follow.
The PREMIS implementation used for most volumes in the repository is PREMIS 1.0, though the implementation used for the more recently added Internet Archive-digitized volumes is PREMIS 2.0. A description of HathiTrust’s PREMIS 1.0 usage is provided; a description of PREMIS 2.0 is forthcoming. HathiTrust plans to migrate preservation information for all content to PREMIS 2.0 in the near future.