Technical Requirements for Digitized Page Images Submitted to HathiTrust
A Guide for HathiTrust Members
Version 1.1.1
August 17, 2022
Grateful acknowledgement is made to the Digital Conversion Unit, University Library, at the University of Michigan. This document is adapted from their excellent instructions.
Introduction
This document provides detailed instructions for members wishing to undertake new digitization projects for deposit with HathiTrust. Conformance with these guidelines ensures a consistent user experience across the HathiTrust Digital Library and facilitates ongoing collection management at scale.
Digitization is the first step in preparing content for deposit. Please see our website for information on other steps, such as creating submission packages (including structural and other metadata), preparing and submitting bibliographic records, and various administrative requirements.
Note that this document provides guidelines for new digitization from print source materials only. For guidance regarding non-print (e.g., microform) or born-digital materials, please refer to the page Getting Content Into HathiTrust.
For information on how to evaluate the degree to which already digitized content meets our requirements, please see our Guidelines for Digital Object Deposit.
The following requirements are intended to support the creation of high quality image files for deposit with HathiTrust. Because HathiTrust serves both as a preservation repository and as an access node for digitized content, the images in our repository must match a set of technical characteristics in order to ensure format integrity and support services offered through the HathiTrust Digital Library. Items digitized to these standards meet established community best practices for both preservation and access. For more information on the HathiTrust commitment to long-term preservation of the materials in its care, please see our Preservation Philosophy.
Please contact HathiTrust at support@hathitrust.org with any questions.
1.0 Image Requirements
1.1 Image Capture
Each physical volume should be scanned in its entirety, meaning:
- Every page should be captured, from the outside front cover to the outside back cover
- The front cover image should include the spine if possible
- All endpapers and covers should be scanned
- All page images should be kept and named in sequential order
We require one and only one image for every source page within a physical volume. Each page should be captured as a bitonal or continuous tone (contone) image as detailed later in this document.
1.2 Image Quality
HathiTrust recommends the following best practices for image capture, in order to ensure a minimum quality standard and consistent user experience. Note that these are recommendations only; HathiTrust will not reject content if it deviates from these guidelines. Please contact support@hathitrust.org with any questions.
1.2.1 Image Clarity
Images should be clearly legible, in proper focus, and provide sharp representation of the original. Individual characters should be clearly identifiable.
Letters with closed loops or other bounded areas should not be filled in from underexposure, and letters with thin parts should not appear faded or broken from overexposure.
All bitonal page images should be filtered or processed to eliminate or reduce general noise or speckle effects that appear in the digital image.
Pay special attention to ensure that the removal of noise or speckling does not adversely affect the quality of individual characters.
Pay special attention to pages with illustrations, halftones, etchings, or other graphics and eliminate any moiré patterns and false coloration which may appear in the scanned image.
1.2.2 Image Skew
Deskew digital images when feasible, as an aid to Optical Character Recognition (OCR) processing and for a better viewing experience.
Please be mindful of page edges, gutters, and page content when deskewing, to prevent content from being cropped away inadvertently.
1.2.3 Image Crop and Framing
All digital images should fill the frame of the image to the largest extent possible.
Text pages should be cropped to just within the page border (after deskewing).
Any dark borders where the image capture extends beyond the physical page edge should be eliminated from the final digital product.
1.2.4 Foldouts, Centerfold Images, or Two-Page Spreads
These types of pages present special challenges for image capture. Foldouts should be captured as a single image file without adjusting the resolution. If a foldout exceeds the maximum dimensions of the available digitization equipment, we ask that the images be captured as multiple scans and stitched together, if possible. If this cannot be done, the foldout should be scanned as-is (folded up) and named according to the file sequence.
The final file should comply with filename, image capture, and file format requirements mentioned later in this document.
Centerfolds (including two-page spreads and uncut plates) should be treated as foldouts for scanning purposes. An extra blank page should be inserted after the centerfold image to maintain the correct recto/verso sequence, as follows:
2.0 Technical Considerations
All scanned images MUST meet the following requirements:
- Images MUST have non-zero width and height.
- Images MUST be correctly displayed in natural (pixel) order; the TIFF Orientation header is ignored.
2.1 Image Color
Images may be scanned as bitonal, grayscale, or color, according to the guidelines below.
2.1.1 Bitonal Images
Any page containing text only, or which consists of line art against the background paper color, may be captured as a bitonal image. Use reasonable judgement to determine whether a bitonal image is the most appropriate capture method. For example, if capturing a chart on lined graph paper as a bitonal image creates a lot of background noise, it is acceptable to capture the page as grayscale (or color) instead.
2.1.2 Grayscale Images
Any page containing halftone or continuous tone photographs, variously shaded gray graphs or diagrams, or variously shaded gray lines to distinguish among multiple chart or illustrative elements should be captured as a grayscale image. All grayscale images MUST be captured with 8 bits per sample and one sample per pixel.
Use reasonable judgement to determine whether a grayscale image is the most appropriate capture method; in cases where it is not definitely clear, it is acceptable to capture the page as a color image.
2.1.3 Color Images
Any page containing faded handwriting, color photographs, colored bar graphs or diagrams, or colored lines to distinguish among multiple chart or illustrative elements should be captured as a color image.
All color images MUST be captured in a 24-bit, sRGB colorspace.2.2 Image Resolution
For all images, X resolution and Y resolution (pixels per inch along the X-axis and Y-axis) MUST have identical values; that is, pixels MUST have the same height and width.
2.2.1 Bitonal Images
All bitonal page images MUST have a resolution of 600 pixels per inch (ppi).
This resolution should be an uninterpolated optical resolution wherever possible; if the scanning equipment can only achieve this resolution via interpolation, the resolution MUST be interpolated by the scanning camera and its software as part of the original page scan, not as part of a separate post-processing image adjustment.
It is acceptable to produce initial scans of at least 300 ppi 8-bit grayscale or 24-bit color optically, and then convert to 600 ppi bitonal in post-processing.
2.2.2 Continuous Tone Images
All continuous tone page images MUST have a resolution of at least 300 ppi.
It is acceptable to scan a contone image at a higher resolution (e.g., 600 ppi) and then downsample the final image file to 300 ppi, preferably using bicubic interpolation in post processing. Upsampling of color or grayscale images to 300 ppi using interpolation is not acceptable.
2.3 Image Format
Images files MUST be submitted in one of two formats: either TIFF or JPEG2000.
- Bitonal image files MUST be delivered in TIFF format with CCITT Group 4 compression.
- Contone images may be delivered in TIFF format or as JPEG2000.
Images MUST be minimally well-formed against either the TIFF or JPEG2000 standard. Compliance with the appropriate format will be verified at HathiTrust using JHOVE object validation software. Please see the sections below for additional information.
It is acceptable, and in some cases desirable, to mix image formats within a single volume. For example, a volume containing black-and-white text with a few color images may be scanned primarily as bitonal TIFFs with only a few color JPEG2000 files.
2.3.1 TIFF Format Requirements
Bitonal TIFFs MUST have one bit per sample and one sample per pixel.
Contone TIFFs are acceptable for submission, but they will be compressed to JPEG2000 at HathiTrust prior to ingest. Therefore, continuous tone TIFFs MUST readable by the grk_compress utility from the Grok JPEG2000 toolkit, after being uncompressed with ImageMagick (if necessary) and after ICC profiles are stripped (if present). The following command line statement is included as a best practice for the current (9.7.x) version of grk_compress:
grk_compress -i in_file.tif -o out_file.jp2 -p RLCP -n 5 -SOP -EPH -M 62 -I -q 32
In addition, continuous tone TIFFs MUST be:
- 8-bit (grayscale) or 24-bit (color); that is, they MUST have eight bits per sample and either one (for grayscale) or three (for color) samples per pixel.
- In the sRGB colorspace (if color).
2.3.2 JPEG2000 Format Requirements
In order to maintain consistency among the JPEG2000 files contributed by our members, HathiTrust has developed the following parameters for all JPEG2000 files, based off the JPEG2000 standard.
The following parameters are required:
- All image files supplied in the JPEG2000 file format MUST comply with Part 1 of the JPEG2000 core coding system as specified in ISO/IEC 15444-1:2000. Images with Part 2 components will not validate.
- Files MUST be supplied with the “.jp2” extension.
- The JPEG2000 file’s height and width MUST be the same as the master image file after transcoding.
- The JPEG2000 file MUST be in the sRGB or grayscale color space.
- The JPEG2000 file MUST use only lossy JPEG2000 compression
- The JPEG2000 file MUST consist of a single page image. Multi-page JPEG2000 files are not supported.
The following JPEG2000 parameters are preferred:
- The JPEG2000 file should be prepared after any image processing or clean-up (deskewing, despeckling, etc) of the original source image is performed.
- The JPEG2000 file’s image X origin, image Y origin, tile X origin, and tile Y origin should be 0.
- The JPEG2000’s progression order should be RLCP (resolution, layer, component, position) or RLPC (resolution, layer, position, component).
- The JPEG2000 file should have 8 quality layers.
- The JPEG2000 transcoding process should use the 9-7 irreversible filter.
- The JPEG2000 file should have between 5 and 32 decomposition levels, depending on the pixel dimensions of the image. See below for details.
2.3.2.1 Decomposition
When grayscale and color image files are compressed using JPEG2000, the number of resolution decomposition levels should be based on the image maximum dimension. The following chart provides a quick reference for determining the decomposition level for most image files, based on largest dimension (length or width).
Technically, smaller items would have decomposition levels ranging from 1-4; however our hard lower limit requires all items with a length or width less than 17 inches/6788 pixels to be set to 5.
Minimum Dimensions (pixels) | Maximum Dimensions (pixels) | Minimum Dimensions (inches) | Maximum Dimensions (inches) | Resolution | Decomposition Level |
---|---|---|---|---|---|
400 | 6788 | 1 inch | 16.97 inches | 400 | 5 |
6789 | 13579 | 17 inches | 33.94 inches | 400 | 6 |
13577 | 26976 | 34 inches | 67.88 inches | 400 | 7 |
26977 | 54305 | 67.9 inches | 135.76 inches | 400 | 8 |
54306 | ... | 135.77 inches | ... | 400 | 9 |
2.3.1.2 Transcoding
Many scanning hardware and software packages do not support native image capture to the JPEG2000 format. Transcoding an image file to the JPEG2000 file format is considered acceptable and normative. In such cases, the format of the original image file should be a non-distorting format, such as uncompressed TIFF, and not a format that inherently downgrades the quality of the image through lossy compression (ie. JPEG).
Software packages for the transcoding of image files into the JPEG2000 format vary widely. HathiTrust recommends the open-source Grok JPEG2000 codec for this purpose. To convert a TIFF file to a JPEG2000 file that will meet HathiTrust specifications, use the command:
grk_compress -i in_file.tif -o out_file.jp2 -p RLCP -n 5 -SOP -EPH -M 62 -I -q 32
We can provide support for creating JPEG2000 images with grok. Please contact support@hathitrust.org for assistance.
3.0 File Naming and Directory Structure
Each page image should be given an eight character filename followed by a three letter file extension (.tif or .jp2). File naming for each volume begins with 00000001 and increments sequentially for each subsequent image, following the sequence of the original source material.
Each physical volume should be organized into a separate zip archive named for the volume barcode or ARK ID. All page images, from outside front cover to outside back cover, should be stored inside the zip archive.
The following table sketches out a sample picture of how files should be named and organized for each volume:
<Barcode> (Folder) Barcode of 1st volume
00000001.tif TIFF page image file; filename is 8 characters long
00000002.tif TIFF page image file; filename is 8 characters long
00000003.tif
00000004.tif
00000005.jp2 JPEG2000 color or grayscale page image; filename is 8 characters long
00000006.tif
00000007.tif
00000008.tif
<ARK ID> (Folder) ARK ID of 2nd volume
00000001.tif TIFF page image file; filename is 8 characters long 00000002.tif
00000003.tif TIFF page image file; filename is 8 characters long 00000002.tif
00000004.tif
….
00000032.tif
00000033.tif
00000034.jp2 JPEG2000 color or grayscale page image; filename is 8 characters long
00000035.jp2 JPEG2000 color or grayscale page image; filename is 8 characters long
00000036.tif
00000037.tif
4.0 Quality Control
HathiTrust will perform the following validations prior to ingest. The information in this section is offered to assist contributors in establishing and implementing their own quality control procedures prior to submission.
4.1 Validation of File Structure
HathiTrust will verify that the digital zip archive structure meets the guidelines described in this document. This includes:
- Verifying that the zip archive is named with the proper identifier
- Verifying that no unexpected files are present in the zip archive
- Verifying that no subdirectories are present in the zip archive
4.2 Validation of File Naming Conventions
HathiTrust will verify that the digital filenames meet the guidelines described in this document. This includes:
- Verifying the number of characters in each filename
- Verifying that all filenames contain numeric characters only
- Verifying that all files in a given zip archive are named in sequential numeric order, with no skipped or missing files
4.3 Validation of File Format
HathiTrust will verify that the digital image files are compliant with the image format that they purport to be, as indicated by the filenaming extension (.tif or .jp2).
4.4 Troubleshooting and Remediation
4.4.1 Contributor-side remediation
A number of requirements can be validated and/or corrected on the contributor side, prior to submission to HathiTrust. Refer to the following checklist:
- Is there one and only one image file for each page of each volume?
- Are all image files named with proper 8-digit filenames?
- Do all image files have a proper file extension (either .tif or .jp2)?
- Are all the files for each volume contained within a properly named zip archive?
- Does the sequence of image files for a volume start with 00000001 and increment sequentially for each file thereafter?
- Are all bitonal images in TIFF format?
- Do all images meet the quality and technical parameters outlined in this document?
We recommend the JHOVE object validation environment developed by Harvard University and freely available to the public. HathiTrust currently uses JHOVE 1.16; contributors are strongly discouraged from using version 2.x. Any questions regarding acceptable file validation output from JHOVE should be directed to support@hathitrust.org.
Additionally, the following information may be helpful in identifying and remediating JPEG2000 errors:
- JPEG2000 images that use features from Part 2 can be identified by running JHOVE on them. If the “Brand” value includes “jpx” or “jpf”, the image uses features from Part 2. If the “Brand” value is only “jp2”, then the image uses only features from Part 1.
- JPEG2000 images created with Photoshop will definitely NOT validate (they always use JPEG2000 Part 2 extensions)
- It is possible to create compliant JPEG2000 images with other JPEG2000 codecs besides grok, including the commercial Kakadu codec and open source OpenJPEG and JasPer codecs (or tools that use these software development kits). However, we may not be able to troubleshoot problems with images created with these tools.
4.4.2 HathiTrust-side remediation
Many images that do not meet the requirements outlined above can be normalized at HathiTrust prior to ingest.
With regard to TIFF files, we can address the following:
- Images with no compression or compression other than Group 4/CCITT
- Bitonal TIFFs with the following errors reported by JHOVE:
- ‘IFD offset not word-aligned’,
- ‘Value offset not word-aligned’,
- ‘Tag 269 out of sequence’,
- ‘Invalid DateTime separator’,
- ‘Invalid DateTime digit’ (if correctly formatted capture date is provided in meta.yml)
- ‘Count mismatch for tag 306′,
- ‘PhotometricInterpretation not defined’
We do not support (and cannot address):
- TIFFs with 16 bits per sample
- TIFFs in any colorspace other than grayscale or sRGB (e.g. CMYK, Adobe RGB, etc.)
- TIFFs with alpha channels
With regard to JPEG2000 files, we can normalize the following:
-
Images with lossless compression can be re-saved with lossy compression
Questions about Digitization?
Contact our member-led user support team to get started!