Dedicated service to do PDF to text extraction (or from TIFF or from other image scan formats). (Originally: http://trac.okfn.org/ticket/182) submitted 26 Apr '11, 18:47 rgrp ♦♦ |
We could set up an instance of the Data Science Toolkit... solved 27 Apr '11, 01:05 tim mcnamara |
Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service. OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity. solved 19 Sep '11, 23:34 tim mcnamara |
Hi Tim, Could you share that OCR pipeline with us? I've toyed with OCRopus & Tesseract, but have never used / heard of Celery/RabbitMQ before. I'm not so interested in using this as a web service yet tbh, a local, desktop pipeline would be preferred if possible, so I can tailor it to my own use-case a bit. solved 31 Mar, 12:53 Ross Mounce |
Also very interested in your setup Tim. Otherwise the "pdftotext whatever.pdf" works pretty well! http://en.wikipedia.org/wiki/Poppler_(software) solved 21 Apr, 01:31 dcht00 |
There are already lots of web services that do this for free. http://www.newocr.com/ http://www.free-ocr.com/ http://www.onlineocr.net/ just a few examples. solved 18 May, 20:58 AlexGetty |
Google App Engine in theory now supports this (though still seems you need to request this): http://developers.google.com/appengine/docs/python/conversion/overview solved 21 May, 16:58 rgrp ♦♦ |