Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3002

Possible bug with OCR strategy AUTO

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.22
    • Fix Version/s: None
    • Component/s: ocr, parser
    • Labels:
      None

      Description

      For performance reasons, I would like to activate the OCR scanning only when necessary. I therefore tried to set the OCR strategy to "AUTO".

      However, I see that also for "normal" PDF files (where no OCR should be required), OCR is performed and this not also slows down the application but (more important) results in doubling the resulting text.

      Trying to understand how this works, I think I may have found a possible error in the class AbstractPDF2XHTML. There, in case of selected OCR Strategy AUTO, on line 404 the total number of characters found on the page is checked: if this is lower than 10 OCR is performed.

      } else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO)) {
          //TODO add more sophistication
          if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
              doOCROnCurrentPage();
          }
      }
      

      The logic is correct, but unfortunately at the beginning of the method (line 361 and 362) the two variables checked on this line are reset to 0, so this conditions is going to be always true.

      I would suggest to move the reset of the two variables inside a finally block at the end of the method.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              patrickherber Patrick Herber
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: