OCRFeeder goes public!

Finally, the first initial commit to a public SVN of my new project — OCRFeeder.

OCRFeeder is an Optical Character Recognition and Document Analysis and Recognition program for GNU/Linux.
It features a complete graphical user interface in GTK but can also be used from the command line for automation purposes.

It is written in Python and was developed as the project for my Master’s Thesis in Computer Science Engineering.

So go on and checkout the project’s source:

  svn checkout http://ocrfeeder.googlecode.com/svn/trunk/ ocrfeeder-read-only

Note this is only an SVN release yet so I get some feedback and the traditional first bug reports.
You can also be part of this project as a developer or a translator, just drop me an email.

I hope this is a good step on the evolution of OCR technologies in GNU/Linux system.

Soon I’ll be adding here a list of features you can find as well as a screencast.


My Master Thesis

Its being almost a month since my last post… so, lets catch up a little.

On the last February 19th I drove down from Galicia to Portugal, it was quite a boring trip of more than 7 hours. Luckily I had my girlfriend right on my side and the iPod’s battery honored its fame and soundtracked the whole trip.

I went to Portugal because on the next day, February 20th, I finally presented my Master Thesis in Computer Science Engineering!
Yeah! A little more than a year after I went to Seville and about 8 months since returned to Portugal, I finally presented it and culminated my Master of Science degree.

The thesis was about the developing of an OCR suite for GNU/Linux, based on some ideas I had before. I started developing it the when I returned from Seville and finished it on October (had the luck that the deadlines got extended and wouldn’t need to deliver it before September), then it took me until the mid of December to finish writing the thesis and (final tests of the program included) — I delivered it the 15th of December. Thanks to the bureaucratic services at my University, the sooner the thesis presentation could be arranged was the mentioned February 20th… But hey! Now it is done!

About the OCR program, it is written in Python featuring a GUI powered by PyGTK and can use several Open Source OCR engines to perform OCR. It allows user correction/edition of the results, etc. and generates ODT or HTML file. You can also use it from the CLI in case you want to automate some tasks or link it with other apps.

I am releasing the program soon as GPL, so stay tuned.

I’d really like to thank a lot to all the people that supported me all the time and keep supporting:
Mom, Dead,  Bro, Girlfriend, Professor Luís Arriaga, and friends such as Luís Rodrigues and Pedro Salgueiro.

PS: My absence in the www world outside of work due to the fact that I’m internetless since I came to A Coruña, *hopefully* next week the ISP I chose will turn the switch of information in my flat and I’ll be connected once again to the world. Then I’ll post what’s happened in my world of GTK, Igalia and Django.

SAPO Summerbits

I was selected to be part of the first SAPO Summerbits, an initiative inspired on Google Summer of Code but only to people studying in Portugal.
I was really willing to be part of Google Summer of Code this year but didn’t apply to it since I was supposed to be working in Spain. So, when I left Spain I was of course disappointed by not being then part of Google SoC.

This said, I am really happy to participate in Summerbits and be part of this great initiative by SAPO.
This first edition of Summerbits is composed by only 10 projects and two of them from my University, mine and Paulo‘s one.

My work will be on top of DSpace developing a system to perform OCR over a scanned document so it retrieves the printed words from it and sets them as its document’s tags. This will hopefully automate and spare a lot of work for some people.

For the ones who don’t know:
SAPO is the Portuguese major ADSL provider and is considered by some to be the Portuguese Yahoo (the company, not the adjective 🙂 ) with a search/media/shopping/information/etc homepage.