Digitization and Document Searches

It has become apparent that the way a document is digitized has a significant impact on how historians use them. Digitization is far more than simply scanning a the source and placing it online for all to read. A librarian, archivist, digital historian, or OCR program often times placed another layer of text behind the visual image that makes it searchable. As our readings have made abundantly clear, this has significant impacts on how we find that information especially when it comes to large digitized archival sources or early modern books. Either somebody provided indexed keywords to the text, re-keyed the text, or used an OCR program to represent the text in a machine readable format. All of these have error rates, especially OCR. A good analogy for this transformation from the analog text into a digitally searchable text is the process of translation. When a Latin or French document is translated into English, the translator plays a pivotal role in how the text is read–how it is understood. The differences in how an author uses “langue” and “langage” (both french words for language) can be lost when the translator renders both as “language” in English. In such a case the text is understood differently to the French reader than the English reader. Similar issues come up with the digitization process, except the outcome is not a mis-translation in our understanding but in our findings.  Indexing digitized manuscripts leads to a discrepancy in what one may find through one’s search. One’s search findings are subject to the judgment of the indexer. The same goes for OCRed documents. The error rate may render certain searchable text unfindable–this being particularly apparent in sixteenth century type-setting, spelling, and character usage. This may lend a historian to not use (or use sparingly) a particular source.

Nevertheless, although this has a discernible affect on how we find the sources we use, it does not necessarily cause more problems than we previously faced as historians–just different ones. Previously, the sources we used/found was biased by the research of the historians that preceded us (and their sometimes less than judicious note taking/footnoting), the knowledge archivists had of their archival holdings, and sometimes simple happenstance. Now we have those biases as well as the error rates provided by OCR programs and which keywords were chosen to be indexed. These issues should be understood. But should the methods of finding our sources really be included in our writing along side our methods of finding our sources? Had we not had keyword searches available, would a document that was not used or used sparingly have been used at all otherwise? I think the difference here comes in how one is using the keyword search–in other words, how important OCR error rates are to ones analysis. If one simply uses keywords as a means of finding documents to read, such error rates are much less important and the methods of finding sources are far less important. But if keyword searches determine what parts of sources to read, then the search function is less a method of finding sources and becomes in many ways a method of analyzing them.  In such cases, OCR errors impact how a historian understands a source and play a much more important role in one’s research–thus meriting a place in one’s writing.

Leave a Reply

Your email address will not be published. Required fields are marked *