Submission Package Requirements for Digitized Content Submitted to HathiTrust

A Guide for HathiTrust Members

Version 1.2

May 24, 2021

Updated to reflect that image_compression_date, image_compression_agent, and image_compression_tool are preferred (not required) elements.

April 29, 2022

Updated to correct section 2.2.2.2 Scanning Order and Reading Order (our system requires dashes, not underscores, in those fields)

INTRODUCTION

The HathiTrust Digital Library ensures long-term access to the content held in its repository by requiring conformance with accepted community standards for digital preservation.  Specifically, the HathiTrust repository was designed according to the Open Archival Information Systems (OAIS) framework.  In this model, the package of data required for submission of an individual volume is referred to as a Submission Information Package, or SIP.  In the HathiTrust implementation, these objects may also be referred to as Digital Object Packages, Submission Packages or Content Packages.

 

1.0 What is a Submission Package?

The submission package contains the set of files that comprise the complete digital object (usually image files and OCR text representing a physical volume), as well as metadata and other files that facilitate processing into the repository and preservation of the volume over the long term.

2.0 Submission Package Guidelines

Each individual volume MUST be submitted as a separate submission package containing all the files necessary to represent, process and manage the item.  These include:

  • Digital content (page images and OCR)
  • A metadata file
  • Fixity information in the form of a .md5 file

2.1 Digital Content

The following items are required in order to fully represent the source volume and support services (search, display, etc.) in the HathiTrust Digital Library:

2.1.1 Page images

The content package MUST include a single, well-formed TIFF or JP2000 image file for each page.  For more information on acceptable image formats, please see our digitization requirements for page images submitted to HathiTrust.

2.1.2 Optical Character Recognition (OCR)

Plain text OCR MUST be present for every page, unless the item is a handwritten manuscript or in a language that cannot be OCRed.

  • OCR MUST be provided as one page of plain text UTF-8 per image.
  • Raw OCR output is acceptable; there is no minimum requirement for conformance with source text.
  • Filenames MUST match the corresponding page image, plus the .txt extension
    • Example: The OCR file for 00000001.jp2 MUST be named 00000001.txt

 In addition to plain text OCR, hOCR or other forms of coordinate OCR MAY be provided.

  • hOCR or coordinate OCR MUST be valid UTF-8
  • hOCR or coordinate OCR SHOULD be well-formed XML/XHTML.
  • There is no schema or format requirement for coordinate OCR. At this time, we are not able to correctly display ABBYY FineReader coordinate OCR.  Examples in our repository include hOCR, ALTO XML, and DjVuXML.
  • Filenames MUST match the corresponding page image, plus the appropriate extension
    • Example: The hOCR or coordinate OCR for 00000001.txt MUST be named 00000001.html or 00000001.xml.

The only control characters that MAY appear in OCR are tabs, carriage returns, and line feeds.  All other control characters, including form feed characters (ctrl-L, ASCII 12 / 0x0C, etc.) MUST NOT appear in the OCR.

2.2 Metadata

The meta.yml file provides additional metadata used for ingesting material into HathiTrust. This file MUST be a well-formed YAML file.  See the YAML specification for more information.

Information supplied in the .yml file should be formatted as a series of element/value pairs, with a colon and one character space separating the two components:

[Element]: [Value]

Each element/value pair should be listed on a separate line.

Note that in this document, element names are boldfaced for visibility.

YAML files can be created or edited with any text editor. We suggest TextWrangler on the Mac, vi or emacs on Linux, or notepad++ on Windows. Ruth Tillman (formerly of University of Notre Dame) has also created a Python script  that will generate YAML files to our requirements based on the prior creation of a spreadsheet containing the necessary data.

2.2.1 Required Elements

2.2.1.1 Capture Date

The date and approximate time the volume was scanned. This date will be used to populate the ModifyDate and the XMP tiff:DateTime image header elements if they are not provided in the image files.

The capture date MUST be in the ISO 8601 combined date format with timezone.

Example:

capture_date: 2013-11-01T12:31:00-05:00

Note: the -05:00 is a representation of a time zone offset from UTC, not a representation of a time range.

2.2.1.2 Scanner User

This value should reflect “who pushed the button” to actually scan the item. This could be a person, an organizational unit or the name of an outside vendor

Example:

scanner_user: “University of Michigan: Digital Conversion Unit”

2.2.1.3 Image Resolution

Image resolution MUST be supplied here if it is not present in the image files. Resolution supplied in the meta.yml file will overwrite the tiff:XResolution, tiff:YResolution, and tiff:ResolutionUnit values encoded in the image header, if present.  This element can therefore be used as a mechanism for supplying the correct resolution if the image file if needed.

Example:

bitonal_resolution_dpi: 600

or

contone_resolution_dpi: 300

2.2.1.4 Compression

The following information regarding image compression is preferred, and should be included only if the images were compressed or converted to JPEG2000 before creation of the submission package. If no compression or other post-processing occurred, this information should not be included.  If you include this information, all three elements must be provided.

Examples:

image_compression_date: 2013-11-01T12:15:00-05:00

image_compression_agent: umich

image_compression_tool: ImageMagick 6.7.8

Notes:

  • image_compression_date MUST be in ISO 8601 combined date format.
  • image_compression_agent MUST be a HathiTrust institution identifier.
  • Image_compression_tool (free-text) should include both software name and version. Multiple values are permitted and should be comma-separated.

2.2.2 Optional Elements

2.2.2.1 Scanner Make and Model

Example:

scanner_make: CopiBook

scanner_model: HD

2.2.2.2 Scanning Order and Reading Order

Scanning order and reading order designations are used to ensure the correct reading experience when viewing items in the HathiTrust Digital Library.  If the volume was scanned right-to-left and/or should read right-to-left, put “right-to-left” for the scanning or reading order here. If this information is not provided, volumes are assumed to be scanned left-to-right and read left-to-right.

Examples:

Possible combinations are:

  • Book reads left-to-right and 00000001.tif is the FRONT cover of the book:

scanning_order: left-to-right

reading_order: left-to-right

  • Book reads left-to-right but 00000001.tif is the BACK cover of the book:

scanning_order: right-to-left

reading_order: left-to-right

  • Book reads right-to-left and 00000001.tif is the FRONT cover of the book:

scanning_order: right-to-left

reading_order: right-to-left

  • Book reads right-to-left but 00000001.tif is the BACK cover of the book:

scanning_order: left-to-right

reading_order: right-to-left

For more complicated cases (e.g., books that are half in English (read left-to-right) and half in Hebrew (read right-to-left), or books that are in two left-to-right languages and one language is printed upside-down from the other), indicate the correct scanning order and either one of the correct reading orders. Users wishing to view the pages containing the other language can use the HathiTrust Digital Library interface to adjust the view appropriately.

2.2.2.3 Page Data

Optionally, page numbers and page tags can be provided in the meta.yml file. When supplied, page tags support navigation within a digitized text and are an important accessibility aid for end users who rely on screen readers. They are optional, but strongly encouraged.

The orderlabel attribute holds the source page number and the label attribute holds the page tag. Multiple page tags should be comma-separated.

Allowable page tags include:

  • BACK_COVER – Image of the back cover.
  • BLANK – An intentionally blank page.
  • CHAPTER_PAGE – A sort of half title page for a chapter or grouping of chapters — that is, a page that gives the name of the chapter or section that begins on the next page.
  • CHAPTER_START – Subsequent chapters with regular page numbering after the first. Also use this for the beginning of each appendix.
  • COPYRIGHT – Title page verso (the back of the real title page).
  • FIRST_CONTENT_CHAPTER_START – First page of the first chapter with regular page numbering. If the first chapter with regular numbering is called the introduction, that’s okay.
  • FOLDOUT – A page that folded out of the print original.
  • FRONT_COVER – Image of the front cover (if the cover of the book was scanned).
  • IMAGE_ON_PAGE – Use for plates (pages with only images, which often do not contain the regular page numbering).
  • INDEX – The first page in a sequence containing an index.
  • MULTIWORK_BOUNDARY: for items with multiple volumes bound together.
  • PREFACE – First page of each section that appears between the title page verso and the first regularly numbered page. For example, a one-page dedication on page xvi would get this tag, and then the first page of a three-page preface starting on page xviii would also get this.  However, if the introduction of the text starts on page 1 (or on an unnumbered page followed by page 2), do not use this tag (use CHAPTER_START instead). May be used for frontmatter components occurring both before and after the table of contents.
  • REFERENCES – The first page in a sequence containing endnotes or a bibliography.
  • TABLE_OF_CONTENTS – First page of the table of contents.
  • TITLE – Title page recto (the front of the real title page).
  • TITLE_PARTS – Half title page (a sort of preliminary title page before the real one).

We are aware that there are other page tagging schemes in use at various institutions.  Please contact HathiTrust staff for additional guidance in mapping your existing page tags to HathiTrust conventions.

Example:

pagedata:

  00000001.jp2: { label: “FRONT_COVER” }

  00000007.jp2: { label: “TITLE” }

  00000008.jp2: { label: “COPYRIGHT” }

  00000009.jp2: { orderlabel: “i”, label: “TABLE_OF_CONTENTS” }

  00000010.jp2: { orderlabel: “ii”, label: “PREFACE” }

  00000011.jp2: { orderlabel: “iii” }

  00000012.jp2: { orderlabel: “iv” }

  00000013.jp2: { orderlabel: “v” }

  00000014.jp2: { orderlabel: “vi” }

  00000015.jp2: { orderlabel: “1”, label: “FIRST_CONTENT_CHAPTER_START” }

  00000016.jp2: { orderlabel: “2” }

  00000017.jp2: { orderlabel: “3” }

  00000018.jp2: { orderlabel: “4”, label: “IMAGE_ON_PAGE” }

Note: the indentation above MUST use character spaces only, never tabs: see http://www.yaml.org/spec/1.2/spec.html#id2777534.

3.0 Fixity

The Submission Package MUST also contain a file named checksum.md5 which contains checksums for all other files contained in the package.

The .md5 file can be generated with md5sum on Linux or md5 -r on Mac OS X. On Windows use md5sum from CoreUtils for Windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm.

Example:

8c1a363eb3682542a16edf7dba036fe1  00000001.tif

8df14295ce4b6194bbb6ae66fc41d03b  00000001.txt

f30cc4a3d27f54329b3d9aaa5b2d7bda  00000002.tif

6a621fe605578f95cc66cc27b7ca77b5  00000002.txt

97c664aa9fb998dde78ce2aecbf59d73  00000003.tif

01cb4b01a9de2aa1660da009989f5f13  00000003.txt

e67cad94ae85bf6ae439583f4ab88227  meta.yml

The checksum.md5 file MUST NOT contain a checksum for checksum.md5 — it is not generally possible for a file to contain its own checksum. (Assume we compute the checksum, then add it to the file; the checksum will no longer be valid because by adding it to the file, the file’s checksum will have changed. That is to say, in order to compute the correct final checksum, it would have to already be in the file. This is not normally possible!)

4.0 Package Structure

Each item MUST be encapsulated in a .zip file, which MUST be named according to the object ID (barcode or ARK ID)

  • Example: 39015012345678.zip or ark+=28722=h2000017z.zip
    • Note that, for ARK IDs, + should be used instead of : and = should be used instead of /
  • The zip file SHOULD not contain any directories or internal hierarchy.
  • If the zip filename contains any alpha characters, these should be lowercased.

Sample Package:

This package contains:

  1. An image file for each page (either .tif or .jp2)
  2. A plain text OCR file for each page (.txt)
  3. Coordinate OCR for each page (.html)
  4. A checksum file for the volume (.md5)
  5. A metadata file (.yml)

4.1 Sample Package

A simple digital object package is available for download here.

5.0 Package Size

Packages larger than 15GB should be split to avoid difficulties in transfer.  One option is to split the larger packages into multiple zipped files using a tool like 7-zip and the method described at https://www.webhostinghub.com/help/learn/website/managing-files/split-file.

Questions About Digitization?

Contact our member-led user support team to get started!

Top