Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
793 views
in Technique[技术] by (71.8m points)

tesseract - OCR of PDF files with images

I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There are 2 important flags that tika uses to extract text:

  1. X-Tika-PDFextractInlineImages (true/false). When false than all images is ignored. So it works fine for the native pdfs - the text is extracted from the native pdf When true than images will be used to text extraction
  2. X-Tika-PDFocrStrategy: https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.OCR_STRATEGY.html NO_OCR - extract the text without ocr - works for native pdfs OCR_ONLY - only the ocr is used - so the text from "native pdf" is also send to ocr OCR_AND_TEXT_EXTRACTION - invokes NO_OCR OCR_ONLY

so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR seems to be the best

for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY

but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...