Status report on OCR

Just a quick update on our progress on the OCR side. While ABBYY was unable to reproduce the bug that affects the conversion of a multi-page PDF into separate images, we have now written a by-pass that does this before feeding the data into the OCR engine. This has also other advantages, e.g. the ability to run OCR also on very large documents without running into memory problems. We are now writing the new thread handling that we need for this workaround and hope to begin internal testing tonight or tomorrow morning.

Update: Our workaround is working and we have just begun to test it on a number of different machines in our labs. Due to the new architecture the plugin is chewing through a 40-page document here at the moment with medium memory requirements and queueing seems to work as expected. OCR results on both German- and English-language documents seem to be much better compared to IRIS and contain less garbled characters, even though some tests show that the ABBYY engine seems to drop unrecognized words completely. We will be curious to read your reports when we have released the new OCR plugin.

13 Responses to “Status report on OCR”

  1. RMF says:

    Bravo. Keep it coming.

  2. JamesR says:

    Thanks for the update! We appreciate the communication.

  3. PaddyS. says:

    Really appreciate the news.Good luck!

  4. Chad Black says:

    Excellent news. OCR is the one thing holdin gback my databse conversions. I’m also glad to hear this will ease the memory requirements for large files– full text books downloaded from google books don’t have their text layer, and are often very large.

  5. Danny Zacharias says:

    good to hear. I have not realized how much I rely on OCR— you don’t know what you got till its gone!

  6. michaelnau says:

    Looking forward to it–and streamlining Acrobat Pro out of my workflow!

  7. Friar says:

    ¡Excelente, Gracias!
    Amazing how you have taken this problem and are turning it into something better! Sounds like it will be worth the wait.

  8. Nigel Spier M.D. says:

    Thanks for the update. Encouraging news indeed. Meanwhile, I have figured a work around using folder actions, a bit of scripting and the “watched folder” feature in ReadIris Pro 11.6. I would be happy to share if anyone is interested. Of course, you have to own ReadIris, which thankfully I do.

  9. kalisphoenix says:

    Great! Thank you guys 🙂

  10. Mirko says:

    I really like your transparent way of communicating. What about the internal test, have they been successfull?

  11. allan says:

    Still waiting for OCR!

  12. Michael Abramson says:

    Why does it take five to ten times longer to scan 3-5 pages of text with the update.

    Stuff that would take under a minute is still processing five minutes longer.

    Please advise.

  13. Eric says:

    @Michael: The next beta will have an option slider that you can use to choose yourself between accuracy and speed, maybe this helps.