Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
374 views
in Technique[技术] by (71.8m points)

java - Apache Tika extract scanned PDF files

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages method of the PDFParserConfig class but this didn't change a thing. Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor did extract embedded resources of a doc file but not for my PDF files.

It would be awesome if anyone of you could provide some help :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Tim Allison brought the solution:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

EDIT: Here is the complete solution:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Maven Dependencies:

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...