Joaquim Rocha
Published on

OCRFeeder hacking, caring about Tesseract

Authors
  • avatar
    Name
    Joaquim Rocha
    Twitter
  • Principal Software Engineering Manager at Microsoft

Today I was hacking a bit on OCRFeeder. Since I released it that I wanted to give better support for Tesseract since it’s one of the best Open Source OCR engines out there. The problem was that the image clips that I simply converted to TIF wouldn’t work when trying to read them with the mentioned engine. After a bit of investigation I found that I needed to first convert my Image object to grayscale before saving it to TIF… After doing this, Tesseract worked like a charm for me!

I also found a few bugs on the OCR Engines’ managing dialog and corrected them so you don’t have to kill yourself editing the engines’ XML files.

Note that unlike engines such as GOCR or OCRAD, Tesseract won’t print its results to stdout, instead it receives a file name as argument and produces a .txt file with the results. This has a been supported in OCRFeeder since its first release so, you only have to use the $FILE special keyword when configuring the Tesseract engine and print the contents of that file to stdout using for example “cat”. Here’s a screenshot of a correct Tesseract configuration:

Tesseract engine configuration on OCRFeeder

And the produced XML file:

Tesseract TIF /usr/local/bin/tesseract $IMAGE $FILE; cat $FILE.txt

I’ll soon post here a list of ToDos that I want to have done soon. So far, the “better Tesseract support” and the “fit image size” ones are already implemented.

Enjoy!