Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2106

"hocr" case on Linux fails, but works on OSX. Related to TIKA-2093

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ocr
    • Labels:
      None
    • Environment:

      Bug in Linux, but fine in OSX.

    • Flags:
      Important

      Description

      We pass a output type, either TXT or HOCR to the Tesseract command line. When we call the command line we lowercase it to "txt" or "hocr". However, when we read back in the output, we don't lower case it. on OSX the constructed file path "output.HOCR" is actually found, but in Linux it doesn't. This patch lower cases the HOCR to hocr and TXT to txt in the constructed file path.

      I didn't write a unit test as I don't have a good linux env to test it in, but I was able to put a patched version of the Tika Parser Jar into my Docker Build to test it works.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                epugh David Eric Pugh
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: