Dedicated service to do PDF to text extraction (or from TIFF or from other image scan formats). (Originally: http://trac.okfn.org/ticket/182) |
Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service. OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity. |
Hi Tim, Could you share that OCR pipeline with us? I've toyed with OCRopus & Tesseract, but have never used / heard of Celery/RabbitMQ before. I'm not so interested in using this as a web service yet tbh, a local, desktop pipeline would be preferred if possible, so I can tailor it to my own use-case a bit. |
Also very interested in your setup Tim. Otherwise the "pdftotext whatever.pdf" works pretty well! http://en.wikipedia.org/wiki/Poppler_(software) |
There are already lots of web services that do this for free. http://www.newocr.com/ http://www.free-ocr.com/ http://www.onlineocr.net/ just a few examples. |
Google App Engine in theory now supports this (though still seems you need to request this): http://developers.google.com/appengine/docs/python/conversion/overview |