Skip to content

OCR Resources and Services

Optical Character Recognition (OCR) is a process of converting text images (e.g., scanned documents, historic print, handwriting samples) into machine readable formats. At the moment, campus’ OCR support is primarily self-service.

Some considerations before engaging with OCR projects:

  • Has the content been OCR’d already and made available elsewhere? If you are not sure or have not searched, reaching out to librarydataservices@berkeley.edu will connect you with subject librarians who can assist in this process.
  • How much content is involved? For projects that involve “hundreds” of documents/materials, software such as ABBYY FineReader are initial options to explore. If projects involve “millions” of documents/materials then use of Tesseract (or other computational oriented software) might be best to investigate.

Common solutions used on campus

ABBYY FineReader

An often recommended PDF software program due to its accuracy, support for multiple languages, and ability to handle technical/complicated formats.

A campus license is not available. There is a 7-day trial available, which is recommended to use for a test run and decide if it is worth investing in an ABBYY license. In the past, this software was available on computers in the D-Lab and through Research IT’s virtual machine service; unfortunately both services are no longer offered/available.

Tesseract

An open source software that runs through a command line interface to process large scale OCR projects. More information about the Tesseract project can be found here: https://github.com/tesseract-ocr/tesseract

In the past, Tesseract has been used on Savio (campus’ high performance computing cluster) and documentation on how to do so (via a Singularity container) is here: https://github.com/ucb-rit/singularity-tesseract. If you would like to explore the use of Tesseract on Savio, follow these instructions to gain access to the cluster.

Scanning

If you have physical materials to be scanned, in order to then perform OCR upon, there are scanners available in the Library.