Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3361

Improve intelligence of OCRStrategy=AUTO

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.0
    • None
    • None

    Description

      Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt at improving OCRStrategy=Auto

      Currently, this strategy performs the following test

      if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
                          doOCROnCurrentPage(AUTO);
                      }
      

      I added a way to change the new numbers involved: the threshold for the total characters per page (below which, we OCR the page), and the threshold for unmapped characters (above which we OCR the page)

      My main concern is with the unmapped characters. OCR adds a lot of overhead, which might not be necessary for simply a few unmapped characters

      I added a new config, OCRStrategyAuto, which is only used if OCRStrategy=AUTO. Its format is

      ocrStrategyAuto = best|fast|m[%], n
      

      ‘best’ and ‘fast’ are shortcuts. More later

      m, n – m is the threshold for the number of unmapped characters per page. It can also be specified as a percentage. So, m=20 means if your page has more than 20 unmapped characters, it will OCR. m=20% means if the unmapped characters are more than 20% of the total characters, then it will OCR.

      n is the threshold for the total number of characters on the page. n does not need to be specified and defaults to 10

      <param name="ocrStrategyAuto" type="string">20</param>
      

      is equivalent to

      <param name="ocrStrategyAuto" type="string">20, 10</param>
      

      best is shorthand for 20,10

      <param name="ocrStrategyAuto" type="string">best</param>
      

      is equivalent to

      <param name="ocrStrategyAuto" type="string">20, 10</param>
      

      best is the default and is equivalent to the current behavior

      fast is a shortcut for 10%, 10, which will avoid OCR unless the number of unmapped characters is greater than 10%

      <param name="ocrStrategyAuto" type="string">fast</param>
      

      is equivalent to

      <param name="ocrStrategyAuto" type="string">10%, 10</param>
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            peterkronenberg Peter Kronenberg
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: