Scan to Text Conversion Service

0	Dedicated service to do PDF to text extraction (or from TIFF or from other image scan formats). (Originally: http://trac.okfn.org/ticket/182) pdf service text submitted 26 Apr '11, 18:47 rgrp ♦♦ 646●19●31●34 accept rate: 45% edited 27 Feb, 14:32

6 Responses

oldest newest most voted

0	We could set up an instance of the Data Science Toolkit... link solved 27 Apr '11, 01:05 tim mcnamara 201●10●12●22 accept rate: 0%

Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service.

OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity.

link

solved 19 Sep '11, 23:34

tim mcnamara
201●10●12●22
accept rate: 0%

Hi Tim,

Could you share that OCR pipeline with us? I've toyed with OCRopus & Tesseract, but have never used / heard of Celery/RabbitMQ before.

I'm not so interested in using this as a web service yet tbh, a local, desktop pipeline would be preferred if possible, so I can tailor it to my own use-case a bit.

link

solved 31 Mar, 12:53

Ross Mounce
1
accept rate: 0%

0	Also very interested in your setup Tim. Otherwise the "pdftotext whatever.pdf" works pretty well! http://en.wikipedia.org/wiki/Poppler_(software) link solved 21 Apr, 01:31 dcht00 1●1 accept rate: 0% edited 21 Apr, 01:34

0	There are already lots of web services that do this for free. http://www.newocr.com/ http://www.free-ocr.com/ http://www.onlineocr.net/ just a few examples. link solved 18 May, 20:58 AlexGetty 1●1 accept rate: 0%

0	Google App Engine in theory now supports this (though still seems you need to request this): http://developers.google.com/appengine/docs/python/conversion/overview link solved 21 May, 16:58 rgrp ♦♦ 646●19●31●34 accept rate: 45%

Your response

toggle preview

community wiki

Follow this question

By Email:

<span class='strong'>Here</span> (once you log in) you will be able to sign up for periodic email updates about this idea.

By RSS:

Answers

Answers and Comments

Markdown Basics

*italic* or _italic_
**bold** or __bold__
link:[text](http://url.com/ "title")
image?![alt text](/path/img.jpg "title")
numbered list: 1. Foo 2. Bar
to add a line break simply add two spaces to where you would like the new line to be.
basic HTML tags are also supported

learn more about Markdown

Tags:

service ×5
pdf ×1
text ×1

Asked: 26 Apr '11, 18:47

Seen: 1,469 times

Last updated: 21 May, 16:58

Ideas Incubator

PDF / TIFF / Scan to Text Conversion Service

Follow this question

Related ideas