Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2844

OCR_STRATEGY.OCR_ONLY does not extract any text

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.20
    • Fix Version/s: None
    • Component/s: ocr
    • Labels:
      None
    • Environment:

      Win7, 64-bit, Tesseract 4.1.0 and Image Magiick 7.0.8 installed

      Description

      I have some PDF which were scanned including OCR with some other software. But the recognized text quality is quite poor. So I would like to ignore the text in the pdf and just do a new OCR with tesseract.

      So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text from the PDF.

      When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the original PDF.

      I called tesseract binary in console and there the expected text was extracted.

      After trying several tutorials and examples, this is my code:

      final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
      final ByteArrayOutputStream out = new ByteArrayOutputStream();
      
      final TikaConfig config = TikaConfig.getDefaultConfig();
      final String version = (new Tika(config)).toString();
      LOG.info("Tika version " + version + " / " + config.getParser().getClass().getName());
      
      final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
      
      final PDFParserConfig pdfConfig = new PDFParserConfig();
      pdfConfig.setExtractInlineImages(true);
      pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);
      
      final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
      tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
      tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
      tesserConfig.setEnableImageProcessing(1);
      
      final Parser parser = new AutoDetectParser();
      final Metadata meta = new Metadata();
      final ParseContext parsecontext = new ParseContext();
      
      parsecontext.set(Parser.class, parser);
      parsecontext.set(PDFParserConfig.class, pdfConfig);
      parsecontext.set(TesseractOCRConfig.class, tesserConfig);
      
      parser.parse(pdf, handler, meta, parsecontext);
      System.out.println("OCR Result: " + handler.toString());
      
      

      My maven dependencies:

      <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.20</version> <!-- 1.20 -->
      </dependency>
      
      <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
      </dependency>
      
      <dependency>
      <groupId>com.github.jai-imageio</groupId>
      <artifactId>jai-imageio-core</artifactId>
      <version>1.3.1</version> <!-- 1.4.0 -->
      </dependency>
      
      <dependency>
      <groupId>com.github.jai-imageio</groupId>
      <artifactId>jai-imageio-jpeg2000</artifactId>
      <version>1.3.0</version>
      </dependency>
      
      <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>jbig2-imageio</artifactId>
      <version>3.0.0</version>
      </dependency>
      
      

       

      As there is no error message or stack trace at all, I don't understand why I don't get any result. If it is not a bug, it should at least output some hint what's going wrong.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hkrause Horst Krause
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: