Digitization and OCR

The OCR feature in Google drive provides and very instructive view of the possible limitations in how one searches documents. I converted document P715003  to text, and the results were less than desirable. They improved as I cropped out the borders of the paper to leave only the text, but this only went so far. The text did not always pick up in the line breaks or word breaks in the document. Various markings made by hand found their way into the text as random letters. At one point, “that you talked with in connection” was converted to “that you ta].de 11th in mmnmtiozfs1.” One of the major problems is that the document is slightly slanted. I do not know how to manipulate the document to correct for this, but one may hypothesize that a straightened document would render superior results. That being said, I cannot find any distinct markings in the document that may account for the discrepancy in the above conversion.

The Chronicling America project at the Library of Congress also demonstrates some of the limitations of OCR technologies if they have not been cleaned up by someone. I looked at the Anderson Daily Intelligence, 9 Sept. 1914; Omaha Daily Bee, 9 Sept. 1914; and The Spanish America, 9 Sept. 1914. Many of the problems with the OCR on this project stem from the condition of the paper. The OCR has problems with text that is faded on particular lines or smudges. In such cases one sometimes finds odd characters–a dollar sign for instance. It also has problems with hyphenation. For example, “Von Hinden-burg’s” is converted to “Von Hinden [next line] ‘ v hug’s” in the Omaha Daily Bee (“German Center Gaining Slowly against Allies,” pg.1). This might be especially problematic if someone is searching stories that reference von Hindenburg since this story would clearly not be returned. It is possible that fuzzy logic may still find the story if one takes into account the problems OCR has with hyphenation.

My own research poses more problems most likely because the 16th and 17th century French language poses extra complications on the OCR programs. The most used words that might direct one to documents for my research, “Turc” (and all its variant spellings used at the time), tends to be converted accurately. However, searching for documents that reference particular Ottoman Sultans would be particularly complicated due to spellings, but these tend not to be to problematic if one finds the particular spelling in use. Looking particularly at L’Inventaire Generale de l’Histoire des Turcs by Michel Baudier, one finds that the accuracy is problematic in many areas. In terms of the searchable words that I would use, there are fewer mistakes, unless “Constantinople” ends up hyphenated, or a smudge falls on “Soliman” which is not too often. In sum, for one to find documents that reference issues, one must be aware of the limitations of OCR to account for them.


Leave a Reply

Your email address will not be published. Required fields are marked *