One more step in OCR with OCRFeeder 0.7

I have been hacking on some new and cool features on OCRFeeder for a while and now it is time to show them to the world in a new release.

These features I’m talking about fall mainly in 2 areas: improving the a11y of the UI and improving the recognition of documents.

A11y Improvement

The improvement of the a11y has the typical UI changes to include mnemonics, missing labels and relations, but also other approaches that have more to do with UX like using a progress dialog to inform users that time-taking operations are being carried. This means that now, the PDF importation and OCR won’t block the UI.
Other changes in this category were the navigation through the content boxes (before, these could only be selected by clicking on them), the selection of all boxes and the deletion of selected boxes.

The following screenshot shows the box editor area of OCRFeeder with its mnemonics highlighted:

Box edition area
Box edition area

Recognition Improvements

Sometimes, text columns are so close to each other that they end up being recognized as a single paragraph, so I added a post-detection method to solve this issue. This feature is optional and can be toggled from the Preferences dialog.

Here’s an example of the difference it makes:

Before columns' detection improvements
Before columns' detection improvements
After columns' detection improvements
After columns' detection improvements

Scanned document images are usually skewed and this makes it more difficult for the contents to be successfully detected and “OCRed”. I decided to implement an algorithm to deskew these images. The algorithm uses the Hough transform to try to find lines in the image and their angles and, while it is a bit slow, it works well:

Skewed image
Skewed image
Deskewed image
Deskewed image

This action can be used in a loaded image but can also be configured to be automatically performed before the images are added. The Unpaper tool can now also be set to be clean images before adding them.
This makes it much easier to successfully recognize images obtained from a scanner device.

Some fine tunning of the content boxes’ bounds was done by trying to shorten their margins, that is, lowering the distance between the boxes and their actual contents.

The font size recognition was also tweaked to solve the problem of having paragraphs with initials (you know, the huge starting characters) which were influencing the whole paragraphs’ font size.

To finish the recognition’s improvements, I have added an optional action to find and fix the text’s line breaks. Usually, OCR engines don’t consider “semantic line-breaks”, that is, OCR engines always insert a newline in the end of each line.
Using some regular expressions, I try to find these “fake” line-breaks and recover the original flow of the text. Like some of the features mentioned above, this one can also be turned on/off from the Preferences dialog.

Here’s how the Preferences dialog looks like now:



To finish, images can now be dragged and dropped onto the pages’ area and the mouse wheel can be used to scroll horizontally combining it with the Shift key, thanks to Stefan Löffler, and of course, several bugs were corrected and code was improved.

As you see, this is a “rich” new version of OCRFeeder that keeps being the easiest way to use OCR in a desktop. You are welcome to file bugs in bugzilla or to send patches and features’ requests to its mailing list or approaching me if you’re in GUADEC.

Download: OCRFeeder 0.7 tarball on GNOME FTP

21 thoughts on “One more step in OCR with OCRFeeder 0.7”

  1. You say that this was your masters thesis – but you didn’t actually write the OCR part!!!! How is that a masters thesis to write a GUI for someone else’s program? How is that original research?

  2. Can’t you just do a PCA of the (X, Y) coordinates of the dark pixels in the image instead of the Hough-stuff? A quick test of this approach seems to indicate that it works almost as well as your approach and it’s practically instant in terms of computations.

  3. Hi Bob Bobson,

    I didn’t write an OCR engine because there are quite a few, out there, and good ones.

    What I did write was the whole contents’ detection algorithm, you know, detecting the paragraphs and graphics on document images. Only after this process, which constitutes my masters’ thesis’ core, the OCR engines and GUI stuff come in.

    So I suggest you check your facts before going this aggressive with my work or anyone’s.
    I think what I did is a good thing and it was original research, so, maybe you didn’t know these facts but still, going anonymous and leaving flame comments is not really the way to present your concerns.

    I hope I have clarified doubts and that you are able to use my program with ease.

  4. This looks truly promising!

    Practical OCR is still a weak point and this looks like a important step in the right direction.

  5. Hi Tobias,

    Yes, Ocropus already performs that layout analysis but when I started the thesis and some previous research for it, Ocropus was not what it is today.

    Anyway, if someone wants to add Ocropus as an alternative layout analysis method, I’m totally okay with it, it only brings value to the app and its users.


  6. Joaquim, I think that even if you had not done original research (which you have done as you write), the fact that something _useful_ came out of your thesis is awesome.

  7. Thank you Tomáš.
    It’s a shame some people try to bring down one’s work (even more when it’s based in false assumptions) instead of realizing that it’s a useful tool and it’s Free Software.

    Anyway, I try to focus on positive, constructive critics 😉

  8. Hey Mr. Rocha –
    It’s been a while since I last tried OCRFeeder – maybe just under a year. You may have already included it now, but I can’t remember an option to fully manually zone/box the areas of the scanned image to OCR, bypassing any automatic layout recognition. Also, for the output (.odt), are there options for each scanned page’s OCR output to be marked off by page breaks? If these options are absent, do you have any interest in providing them, or could you email me and tell me who I could pester for that? My coding capacity is very, very limited.

    I remember it being a solid performer for standard academic layouts, but it couldn’t handle the intensely goofy (like TV-in-a-book) undergrad textbooks that I convert. Manual zoning is necessary for such oddly formatted pages…

    Excellent work, looking forward to trying the new version, and keep on with this splendid project! I thank you on behalf of many for not going proprietary on us!

  9. Hi John,

    You can fully use manual outlining of content since the first release of OCRFeeder.

    Can you elaborate on the page breaks? Each image you import into OCRFeeder represents a page in the generated document.


  10. Joaquim,
    Do you have a PPA setup? The current release of ocrfeeder on my Ubuntu distro (10.04 LTS) is 0.6.6, and I could use the upgraded capability to distinguish columns from each other.

  11. Hi Joaquim,

    Wonderful program! I now use it for a couple of days and the import of images, detecting and cleaning of pages all work fine and are straightforward to use.

    There is one thing I can’t work out. I gave OCRfeeder 20 scanned and corrected pages to export to ODT. Great job: withind seconds I had my .ODT file with all 20 pages, correct layout with columns and pictures.
    Only one problem: the page size is defined as Custom and all pages have a slightly different size, but much smaller than the original A4 size. When I resize the pages to A4 (in LibreOffice) the different text and graphics frames stay the same size and need to be resized manually. With an average of 5 frames per page and 180 pages to go, that is a sizeable job!

    Any suggestions how I can predefine the pagesize before exporting to .ODT??

    Thanks in advance,

  12. Hi Herman,

    You can try setting the pages’ size from Edit > Edit Page and see if it works for you (I cannot remember if it fixes your case 🙂 ).

    Let me know if it works for you and/or if you have more suggestions.

  13. Joaquim,
    A year ago, I thought I had tried everything dealing with scanning under Linux (including using no cost windows programs under wine), until I saw ocrfeeder just last month.
    I am enjoying learning your system, and the improvements, even in a short jump from .7.5 to .7.9! (The difference between Mint12 and 13′ repos.)

    Keep up the good work!


Comments are closed.