Today I was hacking a bit on OCRFeeder.
Since I released it that I wanted to give better support for Tesseract since it’s one of the best Open Source OCR engines out there.
The problem was that the image clips that I simply converted to TIF wouldn’t work when trying to read them with the mentioned engine.
After a bit of investigation I found that I needed to first convert my Image object to grayscale before saving it to TIF… After doing this, Tesseract worked like a charm for me!
I also found a few bugs on the OCR Engines’ managing dialog and corrected them so you don’t have to kill yourself editing the engines’ XML files.
Note that unlike engines such as GOCR or OCRAD, Tesseract won’t print its results to stdout, instead it receives a file name as argument and produces a .txt file with the results.
This has a been supported in OCRFeeder since its first release so, you only have to use the $FILE special keyword when configuring the Tesseract engine and print the contents of that file to stdout using for example “cat”.
Here’s a screenshot of a correct Tesseract configuration:
And the produced XML file:
<?xml version=”1.0″ encoding=”utf-8″?>
<arguments>$IMAGE $FILE; cat $FILE.txt</arguments>
I’ll soon post here a list of ToDos that I want to have done soon. So far, the “better Tesseract support” and the “fit image size” ones are already implemented.
15 thoughts on “OCRFeeder hacking, caring about Tesseract”
With OCRfeeder you’ve done a great job!
Unfortunately I’ve not been able to add Tesseract.
First I tried it within OCRfeeder with the settings given above, but then OCRfeeder told me “Error setting the new engine, please check your settings.”, although tesseract is in /usr/local/bin/tesseract as well as in /usr/bin/tesseract.
So I tried it with /usr/bin/tesseract with no success and afterwards with putting a “correct” XML file into the engines folder of OCRfeeder, with doesn’t get recognized by OCRfeeder.
Elsewhere I found a blog entry in german where somebody posted, that he wrote a working ocube script, which would make OCRfeeder work together with tesseract, but unfortunately he didn’t post the script itself.
Here you can find the blog and a Google-translation of it:
Is there something else I could do to get OCRfeeder working with tesseract?
P.S.: It would be nice to have OCRfeeder in different languages. If you should need help with a german translation, tell me!
The direct link the the single post is:
Thank you for checking out OCRFeeder.
About tesseract, are you sure you’re filling all the necessary entries?
Please checkout this post where I talk about Tesseract:
Also, please try the GIT version of OCRFeeder which is more recent. I hope I have time to do a tarball release soon.
A German version would be nice, please look at the .po files in the locale directory if you want to translate them.
Let me know if you finally get it to use Tesseract.
I just checked out the newest version of OCRfeeder for the git-repo, made an engine-config file for tesseract as you mentioned in the above post and it works like a charm.
I just had to move the icons folder 😉 …but tesseract as an engine works fine.
Good work! Thx
Thanks for your fast answer!
I’ll try to get tesseract working tomorrow.
Today I translated the po-file from the new GIT version to German. And here it is:
How do I tell OCRfeeder to use it?
And how do I get the new GIT version working?
This evening I installed ocrfeeder via GIT, which created a “ocrfeeder” directory in my home folder, where I already placed the “ocrfeeder-0.1-beta” folder before.
Although the contents of both folders are, of course, similar the GIT Version can’t be executed, while the old beta version can be started normally.
Obviously the new version would like to open a missing icon as you can see here:
Traceback (most recent call last):
File “./ocrfeeder”, line 26, in
studio = Studio()
File “/home/rennie/ocrfeeder/studio/studioBuilder.py”, line 59, in __init__
self.main_window = widgetPresenter.MainWindow()
File “/home/rennie/ocrfeeder/studio/widgetPresenter.py”, line 100, in __init__
glib.GError: Datei »/usr/share/ocrfeeder/icons/window_icon.png« konnte nicht geöffnet werden: No such file or directory
Thank you for your comment.
Why did you have to move the icons folder? It should work after you install it with:
$ sudo python setup.py install
Give it a try.
I also received a German translation from Kai I’ll give them a look and try to integrate it later today.
About getting the GIT version working, try to install it using the Python standard way:
$ sudo python setup.py install
Then you should be able to run ocrfeeder from the command line (I’ll also include the .desktop file so we have a nice menu soon).
Thank you for your work. I appreciate it.
“$ sudo python setup.py install” worked after the installation of Python Setup-Tools, which can be found here:
[You’ll know that, but others maybe not ;)]
Now OCRfeeder 0.2 can be started normally, too, and it is possible to add tesseract within OCRfeeder, but when I try to use tesseract “nothing” happens.
Or let’s say: OCRfeeder doesn’t tell me what’s going on.
…but the terminal does:
rennie@Riese:~$ cd /home/rennie/ocrfeeder/
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
cat: /tmp/tmpXcnlq7.txt: No such file or directory
Do you know what this could mean?
For your information:
I’m working with:
And now to something complete different:
Would it be possible to try OCRfeeder with cuneiform 0.8?
Thanks for adding this comment Renard.
It seems the problem you have is with Tesseract (the cat problem just means the file wasn’t created and so, cat can’t read it).
Is it the same image you were trying before?
I’ll try to update the README file and the instructions to use the setup.py .
My fault! I didn’t know that tesseract requires separate language files and /usr/local/share/tessdata/ contains only place holders for these files.
But the positive side effect of this is that the line “Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset” showed me that tesseract uses obviously the English language files by default, when it is combined with OCRfeeder.
And this may reduce the quality of the OCR result, since my files are mainly in German, right?
Is it possible to tell OCRfeeder/tesseract that I’d like to recognize a German text?
If I’d use tesseract directly it would be possible/neccessary.
I’ve done a quick search a found this page:
Where they say you can choose the language to use with Tesseract by doing the following.
tesseract inputimage outputbase -l langcode
So, in OCRFeeder’s terms, you must configure the Engine’s Arguments in the OCR Engines manager dialog and set this:
$IMAGE $FILE -l deu; cat $FILE.txt
Let me know if it works for you
And there is a big difference between the results!
But as you can see there is another problem now: The German characters ä, ü, ö, ß are not displayed correctly.
If I use tesseract directly with a TIF-Version of the image, instead, it works. tesseract creates a file with äs, ös and üs.
Gute Nacht Renard,
About the German chars display it seems to be some problem with the encoding.
I’ll need some time to take a look at it. Maybe next week.
I am now about to apply the German translation 🙂
If you’d like to have an italian translation, too, let me know. I lived in Italy for several years and it would me relatively easy for me to translate OCRfeeder to Italian.
the problem you have with Tesseract and OCRFeeder in my case i have installed “leptonica-progs”
“sudo apt-get install leptonica-progs”
and Tesseract now work’s.
Comments are closed.