Some assistance/advice for OCRing?

hedge@beehaw.org · 1 year ago

Some assistance/advice for OCRing?

ZickZack@kbin.social · 1 year ago

Have a look at Kraken which has many state-of-the-art models for both HTR and OCR

MasterBuilder@lemmy.one · 1 year ago

I use ocrmypdf, after being a bit frustrated with gscan2pdf. There is a simple ui available, but I just created a tiny script that does the ocr , deskew, etc. In one operation with wildcard file selection.

I also installed a jbig compressor that really shrinks images. My processed docs are generally 40% to 80% smaller, and it seems to get better tesseract output than gscan does.

donio@beehaw.org · edit-2 1 year ago

OCRmyPDF is what I use as well, had good luck with it on boardgame rulebooks that sometimes come with missing or partial embedded text. Combined with recoll and the Emacs pdf-tools mode I have it all indexed and at my fingertips.

hedge@beehaw.org · 1 year ago

deleted by creator

MasterBuilder@lemmy.one · 1 year ago

I don’t know, but there might be pdf viewers that permit editing layers. Try LibreOffice Draw or gscan2pdf. Maybe The Gimp can do it.

brie@venera.social · 1 year ago

@hedge If the font is “strange”, you may try Ocular OCR. Intended for historical books, has a possibility to learn new fonts.

hedge@beehaw.org · 1 year ago

Thanks to @ZickZack@kbin.social, @brie@venera.social, & @bownage@beehaw.org for their responses. I forgot that Tesseract is mainly used from the command line; something which, despite being a Linux person, I’m not super proficient with. It looks like gscan2pdf and Master PDF OCRs got different results despite, I think, both using the same version of Tesseract.

aname@lemmy.one · 1 year ago

despite, I think, both using the same version of Tesseract.

So difference must be in the settings, which you can achieve both by using tesseract directly

🐝bownage [they/he]@beehaw.org · 1 year ago

I’ve read good things about donut, although I haven’t used it yet myself.