Technical Requirements for Digitized Page Images Submitted to HathiTrust

A Guide for HathiTrust Members

Version 1.1.1

August 17, 2022

Grateful acknowledgement is made to the Digital Conversion Unit, University Library, at the University of Michigan. This document is adapted from their excellent instructions.

Introduction

This document provides detailed instructions for members wishing to undertake new digitization projects for deposit with HathiTrust. Conformance with these guidelines ensures a consistent user experience across the HathiTrust Digital Library and facilitates ongoing collection management at scale.

Digitization is the first step in preparing content for deposit. Please see our website for information on other steps, such as creating submission packages (including structural and other metadata), preparing and submitting bibliographic records, and various administrative requirements.

Note that this document provides guidelines for new digitization from print source materials only. For guidance regarding non-print (e.g., microform) or born-digital materials, please refer to the page Getting Content Into HathiTrust.

For information on how to evaluate the degree to which already digitized content meets our requirements, please see our Guidelines for Digital Object Deposit.

The following requirements are intended to support the creation of high quality image files for deposit with HathiTrust. Because HathiTrust serves both as a preservation repository and as an access node for digitized content, the images in our repository must match a set of technical characteristics in order to ensure format integrity and support services offered through the HathiTrust Digital Library. Items digitized to these standards meet established community best practices for both preservation and access. For more information on the HathiTrust commitment to long-term preservation of the materials in its care, please see our Preservation Philosophy.

Please contact HathiTrust at support@hathitrust.org with any questions.

1.0 Image Requirements

1.1 Image Capture

Each physical volume should be scanned in its entirety, meaning:

  • Every page should be captured, from the outside front cover to the outside back cover
  • The front cover image should include the spine if possible
  • All endpapers and covers should be scanned
  • All page images should be kept and named in sequential order

We require one and only one image for every source page within a physical volume. Each page should be captured as a bitonal or continuous tone (contone) image as detailed later in this document.

1.2 Image Quality

HathiTrust recommends the following best practices for image capture, in order to ensure a minimum quality standard and consistent user experience. Note that these are recommendations only; HathiTrust will not reject content if it deviates from these guidelines. Please contact support@hathitrust.org with any questions.

1.2.1 Image Clarity

Images should be clearly legible, in proper focus, and provide sharp representation of the original. Individual characters should be clearly identifiable.

Letters with closed loops or other bounded areas should not be filled in from underexposure, and letters with thin parts should not appear faded or broken from overexposure.

All bitonal page images should be filtered or processed to eliminate or reduce general noise or speckle effects that appear in the digital image.

Pay special attention to ensure that the removal of noise or speckling does not adversely affect the quality of individual characters.

Pay special attention to pages with illustrations, halftones, etchings, or other graphics and eliminate any moiré patterns and false coloration which may appear in the scanned image.

1.2.2 Image Skew

Deskew digital images when feasible, as an aid to Optical Character Recognition (OCR) processing and for a better viewing experience.

Please be mindful of page edges, gutters, and page content when deskewing, to prevent content from being cropped away inadvertently.

1.2.3 Image Crop and Framing

All digital images should fill the frame of the image to the largest extent possible.

Text pages should be cropped to just within the page border (after deskewing).

Any dark borders where the image capture extends beyond the physical page edge should be eliminated from the final digital product.

1.2.4 Foldouts, Centerfold Images, or Two-Page Spreads

These types of pages present special challenges for image capture. Foldouts should be captured as a single image file without adjusting the resolution. If a foldout exceeds the maximum dimensions of the available digitization equipment, we ask that the images be captured as multiple scans and stitched together, if possible. If this cannot be done, the foldout should be scanned as-is (folded up) and named according to the file sequence.

The final file should comply with filename, image capture, and file format requirements mentioned later in this document.

Centerfolds (including two-page spreads and uncut plates) should be treated as foldouts for scanning purposes. An extra blank page should be inserted after the centerfold image to maintain the correct recto/verso sequence, as follows:

Correct position of centerfold blank pages in foldouts. Blank back of left side of centerfold, then centerfold(2 pages) in one image if less that eleven inches by seventeen images. Then blank back side of right of centerfold. Finally a blank page inserted in maintain recto and verso.

2.0 Technical Considerations

All scanned images MUST meet the following requirements:

  • Images MUST have non-zero width and height.
  • Images MUST be correctly displayed in natural (pixel) order; the TIFF Orientation header is ignored.

2.1 Image Color

Images may be scanned as bitonal, grayscale, or color, according to the guidelines below.

2.1.1 Bitonal Images

Any page containing text only, or which consists of line art against the background paper color, may be captured as a bitonal image. Use reasonable judgement to determine whether a bitonal image is the most appropriate capture method.  For example, if capturing a chart on lined graph paper as a bitonal image creates a lot of background noise, it is acceptable to capture the page as grayscale (or color) instead.

2.1.2 Grayscale Images

Any page containing halftone or continuous tone photographs, variously shaded gray graphs or diagrams, or variously shaded gray lines to distinguish among multiple chart or illustrative elements should be captured as a grayscale image. All grayscale images MUST be captured with 8 bits per sample and one sample per pixel.

Use reasonable judgement to determine whether a grayscale image is the most appropriate capture method; in cases where it is not definitely clear, it is acceptable to capture the page as a color image.

2.1.3 Color Images

Any page containing faded handwriting, color photographs, colored bar graphs or diagrams, or colored lines to distinguish among multiple chart or illustrative elements should be captured as a color image.

All color images MUST be captured in a 24-bit, sRGB colorspace.2.2 Image Resolution

For all images, X resolution and Y resolution (pixels per inch along the X-axis and Y-axis) MUST have identical values; that is, pixels MUST have the same height and width.

2.2.1 Bitonal Images

All bitonal page images MUST have a resolution of 600 pixels per inch (ppi).

This resolution should be an uninterpolated optical resolution wherever possible; if the scanning equipment can only achieve this resolution via interpolation, the resolution MUST be interpolated by the scanning camera and its software as part of the original page scan, not as part of a separate post-processing image adjustment.

It is acceptable to produce initial scans of at least 300 ppi 8-bit grayscale or 24-bit color optically, and then convert to 600 ppi bitonal in post-processing.

2.2.2 Continuous Tone Images

All continuous tone page images MUST have a resolution of at least 300 ppi.

It is acceptable to scan a contone image at a higher resolution (e.g., 600 ppi) and then downsample the final image file to 300 ppi, preferably using bicubic interpolation in post processing.  Upsampling of color or grayscale images to 300 ppi using interpolation is not acceptable.

2.3 Image Format

Images files MUST be submitted in one of two formats: either TIFF or JPEG2000.

  • Bitonal image files MUST be delivered in TIFF format with CCITT Group 4 compression.
  • Contone images may be delivered in TIFF format or as JPEG2000.

Images MUST be minimally well-formed against either the TIFF or JPEG2000 standard.  Compliance with the appropriate format will be verified at HathiTrust using JHOVE object validation software.  Please see the sections below for additional information.

It is acceptable, and in some cases desirable, to mix image formats within a single volume.  For example, a volume containing black-and-white text with a few color images may be scanned primarily as bitonal TIFFs with only a few color JPEG2000 files.

2.3.1 TIFF Format Requirements

Bitonal TIFFs MUST have one bit per sample and one sample per pixel.

Contone TIFFs are acceptable for submission, but they will be compressed to JPEG2000 at HathiTrust prior to ingest.  Therefore, continuous tone TIFFs MUST readable by the grk_compress utility from the Grok JPEG2000 toolkit, after being uncompressed with ImageMagick (if necessary) and after ICC profiles are stripped (if present).  The following command line statement is included as a best practice for the current (9.7.x) version of grk_compress:

 grk_compress -i in_file.tif -o out_file.jp2 -p RLCP -n 5 -SOP -EPH -M 62 -I -q 32

In addition, continuous tone TIFFs MUST be:

  • 8-bit (grayscale) or 24-bit (color); that is, they MUST have eight bits per sample and either one (for grayscale) or three (for color) samples per pixel.
  • In the sRGB colorspace (if color).

2.3.2 JPEG2000 Format Requirements

In order to maintain consistency among the JPEG2000 files contributed by our members, HathiTrust has developed the following parameters for all JPEG2000 files, based off the JPEG2000 standard.

The following parameters are required:

  • All image files supplied in the JPEG2000 file format MUST comply with Part 1 of the JPEG2000 core coding system as specified in ISO/IEC 15444-1:2000.  Images with Part 2 components will not validate.
  • Files MUST be supplied with the “.jp2” extension.
  • The JPEG2000 file’s height and width MUST be the same as the master image file after transcoding.
  • The JPEG2000 file MUST be in the sRGB or grayscale color space.
  • The JPEG2000 file MUST use only lossy JPEG2000 compression
  • The JPEG2000 file MUST consist of a single page image.  Multi-page JPEG2000 files are not supported.

The following JPEG2000 parameters are preferred:

  • The JPEG2000 file should be prepared after any image processing or clean-up (deskewing, despeckling, etc) of the original source image is performed.
  • The JPEG2000 file’s image X origin, image Y origin, tile X origin, and tile Y origin should be 0.
  • The JPEG2000’s progression order should be RLCP (resolution, layer, component, position) or RLPC (resolution, layer, position, component).
  • The JPEG2000 file should have 8 quality layers.
  • The JPEG2000 transcoding process should use the 9-7 irreversible filter.
  • The JPEG2000 file should have between 5 and 32 decomposition levels, depending on the pixel dimensions of the image. See below for details.

2.3.2.1 Decomposition

When grayscale and color image files are compressed using JPEG2000, the number of resolution decomposition levels should be based on the image maximum dimension. The following chart provides a quick reference for determining the decomposition level for most image files, based on largest dimension (length or width).

Technically, smaller items would have decomposition levels ranging from 1-4; however our hard lower limit requires all items with a length or width less than 17 inches/6788 pixels to be set to 5.

Reference for Determining the Image Decomposition Level
Minimum Dimensions (pixels) Maximum Dimensions (pixels) Minimum Dimensions (inches) Maximum Dimensions (inches) Resolution Decomposition Level
400 6788 1 inch 16.97 inches 400 5
6789 13579 17 inches 33.94 inches 400 6
13577 26976 34 inches 67.88 inches 400 7
26977 54305 67.9 inches 135.76 inches 400 8
54306 ... 135.77 inches ... 400 9

2.3.1.2 Transcoding

Many scanning hardware and software packages do not support native image capture to the JPEG2000 format. Transcoding an image file to the JPEG2000 file format is considered acceptable and normative. In such cases, the format of the original image file should be a non-distorting format, such as uncompressed TIFF, and not a format that inherently downgrades the quality of the image through lossy compression (ie. JPEG).

Software packages for the transcoding of image files into the JPEG2000 format vary widely. HathiTrust recommends the open-source Grok JPEG2000 codec for this purpose. To convert a TIFF file to a JPEG2000 file that will meet HathiTrust specifications, use the command:

 grk_compress -i in_file.tif -o out_file.jp2 -p RLCP -n 5 -SOP -EPH -M 62 -I -q 32

We can provide support for creating JPEG2000 images with grok. Please contact support@hathitrust.org for assistance.

3.0 File Naming and Directory Structure

Each page image should be given an eight character filename followed by a three letter file extension (.tif or .jp2).  File naming for each volume begins with 00000001 and increments sequentially for each subsequent image, following the sequence of the original source material.

Each physical volume should be organized into a separate zip archive named for the volume barcode or ARK ID. All page images, from outside front cover to outside back cover, should be stored inside the zip archive.

The following table sketches out a sample picture of how files should be named and organized for each volume:

 

<Barcode>         (Folder) Barcode of 1st volume

00000001.tif       TIFF page image file; filename is 8 characters long

00000002.tif       TIFF page image file; filename is 8 characters long

00000003.tif

00000004.tif

00000005.jp2     JPEG2000 color or grayscale page image; filename is 8 characters long

00000006.tif

00000007.tif

00000008.tif

 

<ARK ID>         (Folder) ARK ID of 2nd volume

00000001.tif     TIFF page image file; filename is 8 characters long 00000002.tif

00000003.tif     TIFF page image file; filename is 8 characters long 00000002.tif

00000004.tif

….

00000032.tif

00000033.tif

00000034.jp2   JPEG2000 color or grayscale page image; filename is 8 characters long

00000035.jp2   JPEG2000 color or grayscale page image; filename is 8 characters long

00000036.tif

00000037.tif

4.0 Quality Control

HathiTrust will perform the following validations prior to ingest.  The information in this section is offered to assist contributors in establishing and implementing their own quality control procedures prior to submission.

4.1 Validation of File Structure

HathiTrust will verify that the digital zip archive structure meets the guidelines described in this document. This includes:

  • Verifying that the zip archive is named with the proper identifier
  • Verifying that no unexpected files are present in the zip archive
  • Verifying that no subdirectories are present in the zip archive

4.2 Validation of File Naming Conventions

HathiTrust will verify that the digital filenames meet the guidelines described in this document. This includes:

  • Verifying the number of characters in each filename
  • Verifying that all filenames contain numeric characters only
  • Verifying that all files in a given zip archive are named in sequential numeric order, with no skipped or missing files

4.3 Validation of File Format

HathiTrust will verify that the digital image files are compliant with the image format that they purport to be, as indicated by the filenaming extension (.tif or .jp2).

4.4 Troubleshooting and Remediation

4.4.1 Contributor-side remediation

A number of requirements can be validated and/or corrected on the contributor side, prior to submission to HathiTrust.  Refer to the following checklist:

  • Is there one and only one image file for each page of each volume?
  • Are all image files named with proper 8-digit filenames?
  • Do all image files have a proper file extension (either .tif or .jp2)?
  • Are all the files for each volume contained within a properly named zip archive?
  • Does the sequence of image files for a volume start with 00000001 and increment sequentially for each file thereafter?
  • Are all bitonal images in TIFF format?
  • Do all images meet the quality and technical parameters outlined in this document?

We recommend the JHOVE object validation environment developed by Harvard University and freely available to the public. HathiTrust currently uses JHOVE 1.16; contributors are strongly discouraged from using version 2.x. Any questions regarding acceptable file validation output from JHOVE should be directed to support@hathitrust.org.

Additionally, the following information may be helpful in identifying and remediating JPEG2000 errors:

  • JPEG2000 images that use features from Part 2 can be identified by running JHOVE on them. If the “Brand” value includes “jpx” or “jpf”, the image uses features from Part 2. If the “Brand” value is only “jp2”, then the image uses only features from Part 1.
  • JPEG2000 images created with Photoshop will definitely NOT validate (they always use JPEG2000 Part 2 extensions)
  • It is possible to create compliant JPEG2000 images with other JPEG2000 codecs besides grok, including the commercial Kakadu codec and open source OpenJPEG and JasPer codecs (or tools that use these software development kits). However, we may not be able to troubleshoot problems with images created with these tools.

4.4.2 HathiTrust-side remediation

Many images that do not meet the requirements outlined above can be normalized at HathiTrust prior to ingest.

With regard to TIFF files, we can address the following:

  • Images with no compression or compression other than Group 4/CCITT
  • Bitonal TIFFs with the following errors reported by JHOVE:
  • ‘IFD offset not word-aligned’,
  • ‘Value offset not word-aligned’,
  • ‘Tag 269 out of sequence’,
  • ‘Invalid DateTime separator’,
  • ‘Invalid DateTime digit’ (if correctly formatted capture date is provided in meta.yml)
  • ‘Count mismatch for tag 306′,
  • ‘PhotometricInterpretation not defined’

We do not support (and cannot address):

  • TIFFs with 16 bits per sample
  • TIFFs in any colorspace other than grayscale or sRGB (e.g. CMYK, Adobe RGB, etc.)
  • TIFFs with alpha channels

With regard to JPEG2000 files, we can normalize the following:

  • Images with lossless compression can be re-saved with lossy compression

Questions about Digitization?

Contact our member-led user support team to get started!

Top