Quality in HathiTrust
May 13, 2015
by Jeremy York (HathiTrust) and Kat Hagedorn (University of Michigan Library)
As reported in our monthly updates, we receive well over a hundred inquiries every month about quality problems with page images or OCR text of volumes in HathiTrust. That’s the bad news. The good news is that in most of these cases, there is something we can do about it. This blog post is intended to shed some light on our thinking and practices about quality in HathiTrust. We hope it will also encourage you to report any problems you might find so that we might have the opportunity to fix them, and deliver the highest quality collections we can for educational and research needs.
HathiTrust and Quality
We go to great lengths to ensure we have the highest possible quality volumes in HathiTrust. Our approach to quality at a broad level is outlined in our commitment to quality. On a day-to-day level, we strive to offer one of the best user support teams around, responding to reported issues and providing updates as we make progress on addressing them. Someone might reasonably wonder, however, why there are quality problems in HathiTrust at all? Shouldn’t libraries, or HathiTrust, have better quality control? Aren’t librarians primarily concerned about information quality?
Of course we are. As libraries – as HathiTrust – we strive for the highest quality of digitization and digital production. This striving can be tempered, however, by very practical concerns about fulfilling the needs of users, and fitting the investment of time and resources in digital conversion to the purpose of providing greater, enduring access to the materials in our collections.
Collaboration at Scale: A Digitization Cornucopia
One factor contributing to the presence of errors in HathiTrust is the sheer scale at which we operate. It would be truly remarkable if there were no mistakes in the more than 13 million digital items contributed by partnering institutions and others to the digital library. Errors at this scale are par for the course. Why is this so?
There are a variety of approaches that libraries take toward digitization. The choice of a particular approach may be influenced by the amount of materials to be digitized, the time and resources available for digitization, and, significantly, the intended purpose of digitization. For instance, it could be important to preserve, as much as possible for users, the artifactual value of the print original – the texture and color of the pages, the wear and tear on the book, etc. On the other hand it may be important, or sufficient for the purpose (or at least, initial purpose), to target the intellectual content in the book for preservation, rather than the intellectual content and artifact together. In the first instance, manual review and manipulation of each digitized image may be needed to achieve the desired end. The second instance, however, may lend itself well to a larger scale digitization project, where a higher production rate could be achieved by focusing (for instance) on capture of the printed text in the book with a high level of accuracy (not necessarily to the exclusion of physical attributes of the book), rather than fuller characteristics of the artifact, which may be more time-consuming.
The variety of approaches libraries have taken to digitization is reflected in the materials available in HathiTrust. Some, such as the collection of Islamic Manuscripts from the University of Michigan or incunabula from the Universidad Complutense de Madrid, are the result of specialized digitization of rare or valuable materials. Others are volumes digitized in large-scale initiatives by Google or the Internet Archive, or smaller scale initiatives carried out by libraries or third party vendors. Still others, such as volumes from Utah State University Press or Knowledge Unlatched, are the original digital files that were used for print or digital publication.
For the digital content we ingest, HathiTrust has established specifications related to image formats, resolution, color space, and other characteristics. Rigorous validation ensures that these specifications are met. The methods of production or processing of digitized items may leave fingerprints of some sort, however. These may be benign, such as the presence of digitization color targets, added coversheets, book cradles, or a characteristic coloration of pages, which do not generally interfere with the display or understanding of the original object and its content. They may also be more serious, including mis-colorations of pages, human fingers in the images, systemic cropping, warping, or bolded or light text—problems that do interfere with legibility or clarity of the image.
Not surprisingly, materials produced through large-scale digitization are the most likely to have quality problems. In such projects, it is simply not feasible to review the quality of each individual digitized image, and libraries are more likely to rely on sampling pages for quality review, or using automated metrics (Google scanning is a good example) to understand the quality of digitization outputs. Though errors are observed in large-scale digitization, the benefits of being able to search and use millions of volumes from library collections in digital form are so significant, and have had such an impact on the types and scope of research performed using library collections, that it is something libraries worldwide continue to pursue.
New Approach, New Responsibility
Some might say that libraries that engage in large-scale digitization are trading quality for quantity, and this would not be completely wrong (or necessarily bad). When faced with a choice of digitizing books at a rate of thousands per year for hundreds of years, as opposed to millions of books in a decade, many libraries chose the latter. Another way to look at it, however, is that libraries are reconceiving the way they offer services to their constituencies and tailoring their strategies and investments to meet user needs in the most appropriate way. In our preservation and access activities in HathiTrust, for example, we strive for both quantity and quality, working to provide materials that are fit for the purposes they will be used for, and matching the investment of resources to the effort required to produce that fit result.
It is part and parcel of this approach (and critical to our mission) that when we receive reports of problems with materials in HathiTrust’s collections, we do our utmost to address them. Our users are in the best position to recognize problems and let us know when the quality of volumes is not sufficient to meet their needs, and we take this very seriously. We recognize that the trustworthiness of our repository lies not only in preserving and providing access to content over the long-term, but responding to problems in the short-term when possible, and tracking on and developing strategies to address problems in the future when not.
Corrections: The Nitty Gritty
As might be expected, corrections are a manual process, one both labor-intensive and time-consuming. Not all problems can be addressed in an ideal time frame. Some problems (see the table below) may be systemic across volumes digitized using a particular approach. Foldout pages that are scanned while unfolded, obscuring content, is a good example. Moire problems in scanned images is another. In the future, large-scale remediation efforts may be the most effective strategy for addressing these problems. In the meantime, we are committed to working with vendors, our partner institutions, and any entities that deposit materials in HathiTrust, to prioritize and address quality issues and achieve a collection over time that is fit for the purposes our researchers want to make of it.
Here’s what we’ve done so far:
From the time HathiTrust was launched to the present, 6,499 volumes have been reported to have some kind of quality issue. As of May 4, 2015, we have managed to fix the problems in 2,310 of these. Overall, of 1,141 problem reports on full view volumes that are known to come from end users, which are prioritized (many problem reports come from staff at partner institutions engaged in copyright review of limited view materials), we were able to fix 913 of them, a total of more than 80%!
The correction methods for the 2,310 volumes breakdown as follows:
- 468 were corrected through the contributing institution rescanning and replacing individual problem pages with corrected ones;
- 128 were corrected by the contributing institution completely re-scanning and re-depositing the volume;
- in the remaining cases the vendor made corrections and volumes were re-ingested into the digital library.
Types of Problems and Outlook for Correction
Some of the most common types of problems we receive reports of are listed below. Along with each problem we have given some information about what we are able, or in some cases not able, to do to address them. We hope this will provide somewhat of a guide to understand how we view and prioritize problems.
Problem | Description | Outlook |
Warp | Either the words on a page are strangely stretched, or the page itself is. | This error can be introduced during the image processing stage after a digital image has been captured. It can frequently be corrected through re-processing of the image by the vendor. See appendix for example. |
Skew or crop | Pages are sometimes not correctly aligned to the vertical or horizontal axis of display, or the pages are cropped inappropriately (such as when the inner gutter is cropped and it is not possible to see the first few letters of words on each line). | Similar to warp, this problem can often be fixed through re-processing by the vendor. In extreme cases, the page may need to be re-scanned and inserted. Sometimes cropping occurs because of a tight gutter in the book that was scanned that prevented content from being captured without damaging the book. In these cases also, re-scanning may be the only option. See appendix for example. |
Upside-down pages | Self-explanatory | Upside down pages are generally fairly easy for vendors to address when we let them know. They re-package and deliver the content and we re-ingest. |
OCR text quality | OCR may be unintelligible or absent. | We require deposited volumes to have OCR text when OCR text can reasonably be obtained (we do not require OCR, for example, for hand-written manuscripts). If the OCR quality is poor and is due to the nature of the material (e.g., tables and charts, small text, or text in columns can be difficult for machines to OCR), the outlook for correction is not good. If there are obvious errors, e.g., a volume is in English but the OCR is Russian, the outlook is good. And there are shades in between. In general, vendor OCR capabilities (certainly Google’s capabilities) have increased over time, so simple re-processing could go a long way. However, OCR corrections are exclusively machine generated–we do not currently manually correct OCR nor do we have a process in place for accepting manual corrections to OCR. See appendix for example. |
Folded foldouts (maps, plates, tables) | Foldouts may be scanned while folded, causing much or all of the content to be inaccessible. This is a common practice in large-scale digitization due to goals for high throughput. | In these cases, we ask the contributing institution to rescan the foldout in its unfolded state and insert it into the volume. See more about this process below. See appendix for example. |
Missing pages | Self-explanatory | Scanned volumes are sometimes missing pages because the original volume was missing pages, and sometimes due to problems in the way the scanned images are processed and assembled by the vendor. We have had a relatively high rate of success in obtaining missing pages (either through vendor or contributing institution correction) when they result from a vendor problem. |
Moire | Moire is an often checkered or lined pattern superimposed on illustrations, that is introduced at the time of scanning. | Moire can only be fixed by re-scanning the affected pages. Because illustrations with moire are generally legible to some degree, moire is often not considered as high of a priority for correction in comparison with issues where content is clearly missing, obscured, or illegible. See appendix for example. |
The following institutions currently have processes in place to rescan and replace individual pages of digitized books. Keep in mind that the processes are run, and volumes processed, but only as resources allow.
- Cornell University
- Northwestern University
- Penn State
- University of Michigan
As mentioned earlier, your inquiries are the primary way that we identify problems with HathiTrust volumes. To report a quality issue in a volume, use the “Feedback” link at the top of the page where you see the error. This feedback goes to members of the HathiTrust User Support Working Group (view full membership and charge), who will reply promptly and keep you informed of progress. Sometimes fixes take days or weeks. They can also at times take upwards of 6 months because of the time it takes to rescan pages or volumes. Please don’t get discouraged. We may not be able to fix all problems immediately, but we will certainly try (an 80% success rate is pretty good!) or save them for a future time. Help us create a corpus that will meet your needs to the fullest, and the needs of future generations as well.
Appendix: Error Examples
Warp Example
Crop example
Skew Example
OCR Example
Unfolded Foldout Example
Moire Example