OCR Resources and Services¶
Optical Character Recognition (OCR) is a process of converting text images (e.g., scanned documents, historic print, handwriting samples) into machine readable formats. At the moment, campus’ OCR support is primarily self-service.
Some considerations before engaging with OCR projects:
- Has the content been OCR’d already and made available elsewhere? If you are not sure or have not searched, reaching out to librarydataservices@berkeley.edu will connect you with subject librarians who can assist in this process.
- How much content is involved? For projects that involve “hundreds” of documents/materials, software such as ABBYY FineReader are initial options to explore. If projects involve “millions” of documents/materials then use of Tesseract (or other computational oriented software) might be best to investigate.
Common solutions used on campus¶
ABBYY FineReader¶
An often recommended PDF software program due to its accuracy, support for multiple languages, and ability to handle technical/complicated formats.
A campus license is not available. There is a 7-day trial available, which is recommended to use for a test run and decide if it is worth investing in an ABBYY license. In the past, this software was available on computers in the D-Lab and through Research IT’s virtual machine service; unfortunately both services are no longer offered/available.
Tesseract¶
An open source software that runs through a command line interface to process large scale OCR projects. More information about the Tesseract project can be found here: https://github.com/tesseract-ocr/tesseract
In the past, Tesseract has been used on Savio (campus’ high performance computing cluster) and documentation on how to do so (via a Singularity container) is here: https://github.com/ucb-rit/singularity-tesseract. If you would like to explore the use of Tesseract on Savio, follow these instructions to gain access to the cluster.
Scanning¶
If you have physical materials to be scanned, in order to then perform OCR upon, there are scanners available in the Library.
Related Resources¶
- SimpleOCR (freeware) https://www.simpleocr.com/
- Doxie.ai (paid) http://doxie.ai/
- Resources and support for text mining & computational text analysis