Dedicated service to do PDF to text extraction (or from TIFF or from other image scan formats).

(Originally: http://trac.okfn.org/ticket/182)

submitted 26 Apr '11, 18:47

rgrp's gravatar image

rgrp ♦♦
646193134
accept rate: 45%

edited 27 Feb, 14:32


We could set up an instance of the Data Science Toolkit...

link

solved 27 Apr '11, 01:05

tim%20mcnamara's gravatar image

tim mcnamara
201101222
accept rate: 0%

Last weekend, I created an OCR pipeline with OCRopus, Tesseract & Celery/RabbitMQ. I need to do a little bit of work to make it available as a web service.

OCRopus does layout analysis, splitting the image into lines/words. These split files is then sent to Tesseract for OCR and reassembled to create hOCR output. Celery is used for ad-hoc clustering, making it trivial to add more processing capacity.

link

solved 19 Sep '11, 23:34

tim%20mcnamara's gravatar image

tim mcnamara
201101222
accept rate: 0%

Hi Tim,

Could you share that OCR pipeline with us? I've toyed with OCRopus & Tesseract, but have never used / heard of Celery/RabbitMQ before.

I'm not so interested in using this as a web service yet tbh, a local, desktop pipeline would be preferred if possible, so I can tailor it to my own use-case a bit.

link

solved 31 Mar, 12:53

Ross%20Mounce's gravatar image

Ross Mounce
1
accept rate: 0%

Also very interested in your setup Tim.

Otherwise the "pdftotext whatever.pdf" works pretty well! http://en.wikipedia.org/wiki/Poppler_(software)

link

solved 21 Apr, 01:31

dcht00's gravatar image

dcht00
11
accept rate: 0%

edited 21 Apr, 01:34

There are already lots of web services that do this for free.

http://www.newocr.com/ http://www.free-ocr.com/ http://www.onlineocr.net/

just a few examples.

link

solved 18 May, 20:58

AlexGetty's gravatar image

AlexGetty
11
accept rate: 0%

Google App Engine in theory now supports this (though still seems you need to request this): http://developers.google.com/appengine/docs/python/conversion/overview

link

solved 21 May, 16:58

rgrp's gravatar image

rgrp ♦♦
646193134
accept rate: 45%

Your response
toggle preview

Follow this question

By Email:

<span class='strong'>Here</span> (once you log in) you will be able to sign up for periodic email updates about this idea.

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×5
×1
×1

Asked: 26 Apr '11, 18:47

Seen: 1,469 times

Last updated: 21 May, 16:58

powered by OSQA