OCRFeeder 0.7.7 released

After more than 4 months, I am finally releasing OCRFeeder‘s new version (its last release was in August, just before the DesktopSummit).
The reason for the delay, apart from some vacation in Berlin and Portugal and being busy in Igalia, was that this release brings deep changes internally.

The big issue

The problem with developing such an application from scratch in just a few months and worrying about writing a thesis is that you don’t care much for design and performance. So from 2008 until now, OCRFeeder has suffered a big problem related to memory consumption: depending on the number of images loaded and their size, it would create a reviewer (this is what I call the place where you do stuff on the images) per image and those would remain in memory, eventually crashing.
I assumed that since nobody complained about that for so long it was probably because people made a simpler usage of the application and didn’t use it for full books but now it seems that some institutions are interested in OCRFeeder and there have a been complaints and bugs filed (gb#637599 and db#646605).

This was fixed by having only up to 5 instances of reviewers. When selecting a new image, it will drop the oldest reviewer and have this one added to the cache. It gets a bit slower to select a new image but the trade-off is worth IMHO. In future changes I’ll probably make the number of reviewers configurable in some way.
Each of the content areas now also shares an editor instance instead of each one having a dedicated one.

I was able to load more than 500 images of ~4.5 Mb each and it was still usable so hopefully this will improve the experience for users who had these problems.

Other changes

Another change is that now OCRFeeder stores all its temporary files in a dedicated temporary folder under the system’s temporary folder (usually /tmp). By deleting this folder when the application quits it’s guaranteed that no temporary files will be left (as happened sometimes). Related to these changes, I’ve also decided to remove the possibility of choosing the temporary folder. Supposedly Python will already know what’s the system’s temporary folder and having such an option would make it look like Windows software from 1998.

As usual, some code cleaning and bug fixing was done and I would like to thank the awesome GNOME i18n team and everyone who sent their contributions.
Thanks to my friend Berto you can also expect an OCRFeeder Debian package on a repository next to you soon.

For a more detailed list of changes, check out the NEWS file.

Source Tarball
Git
Bugzilla

14 thoughts on “OCRFeeder 0.7.7 released”

  1. Nice, I was one of the bunch of people waiting for this bug fix, thank you mr Joaquim!.

    Like always, I’ll report any new bug in your wonderful free software ;).

    p.s: users can combine Xsane, scantailor and OCRFeeder to do digital versions of their favorites book :).

  2. in “acerca de” it still showing the old number version xD, or it was my bad because I modified an AUR PKBUILD file? anyway, check it :p

  3. Pablo,
    I’m building RPMs for myself. I can provide you a SPEC file (incl. some cheats for Fedora 11). Perhaps Joachim should package one. 🙂 I’m too lazy to because a Fedora packager, the process to become one seemed too tedious to me.

  4. Joaquim (not Joachim, sorry, those were my German fingers),
    if you keep your freshmeat.net ehm freecode.net entry up-to-date by submitting release announcements, you software will get more attention. 🙂

  5. Hallo Moritz,

    I already have a .spec file that a friend wrote for me, I’ll be doing the packages before the end of the week and I’ll write about where to get them.

    I will also update OCRFeeder’s entries on those websites this week if I can. Updating stuff like that is very boring so I usually forget.

  6. Hello Jürgen,

    I am using OCRFeeder from the repositories on Ubuntu 11.10, and it works quite well with Cuneiform as the OCR engine. However, when using the latest stable version of tesseract-ocr 3.01 (which I have found to be more accurate at OCR), character recognition does not work properly in OCRFeeder. Perhaps it is related to this issue:

    http://code.google.com/p/tesseract-ocr/issues/detail?id=580

    Also, I have been experimenting with using the OCRFeeder cli to feed scanned print into a speech engine as an aid for someone with a visual impairment. The gui does a pretty good job at recognising text boxes, performing ocr and exporting as plaintext. I have found that tesseract alone reads print quite well but gets confused by artifacts such as staples, edges and line graphics. OCRFeeder is much better at dealing with this issue, and passing the correct image parts to tesseract to be recognised. I notice that the cli only exports as html and odt, and it would be very helpful if any future version had the ability to output in plaintext, which could then be passed straight to a speech engine.

Comments are closed.