Details
Description
As described in this stackoverflow-post i'm having troubles extracting text out of scanned PDF files. By scanned PDF files i mean PDF files that consist only of images. Because each page is an image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I also tried using the setExtractInlineImages method of the PDFParserConfig but this didn't work aswell.
There was already a ticket regarding the OCR support and including the PDF file i'm using for my tests.
Here is a JUnit-test about my issue:
PDFOCRTest.java
@Test public void testPDFOCRExtraction() throws IOException, SAXException, TikaException { File file = new File(filePath); InputStream stream = new FileInputStream(file); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); Metadata metadata = new Metadata(); PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext context = new ParseContext(); context.set(PDFParserConfig.class, config); PDFParser pdfParser = new PDFParser(); pdfParser.setPDFParserConfig(config); pdfParser.parse(stream, handler, metadata, context); String text = handler.toString().trim(); assertFalse(text.isEmpty()); }