Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3517

Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

Attach filesAttach ScreenshotAdd voteVotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • None
    • None
    • I tested this on RHEL7.  I got the same results whether I was using Tesseract 3 or Tesseract 4, but that doesn't really matter because the problems I'm having are when Tesseract is disabled.

    Description

      When I try running tika to try to extract text from Mac Pages and Numbers files, the text extraction does not work if Tesseract is disabled.  I'm attaching sample files, including the config file I use to disable Tesseract.  I get the same results whether I run the server version (tika-server-standard-2.0.0.jar) or the command line app (tika-app-2.0.0.jar).  

      The following commands extract text along with what appears to be a list of a bunch of .iwa files and .jpg files inside the Pages and Numbers files:

      java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages

      java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers

      However, when I run the following commands using the configuration file to disable Tesseract, all that is extracted is the list of .iwa and .jpg files and none of the actual text is extracted:

      java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages

      java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers

       

      I haven't see similar problems with other types of files I've tested with, including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf.  Those work fine with or without Tesseract disabled.

       

      On a somewhat separate issue, I have been unable to get any text extracted from my test Keynote file at all, whether Tesseract is enabled or not.  I'm having difficulty uploading that file, so I'll see if I can add that later.

       

       

      Attachments

        1. SSN.pages
          814 kB
          Chris Bryant
        2. SSN.numbers
          654 kB
          Chris Bryant
        3. no_ocr.xml
          0.2 kB
          Chris Bryant
        4. Document.iwa
          9 kB
          Tim Allison
        5. Document
          16 kB
          Tim Allison

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            heyyoucb Chris Bryant

            Dates

              Created:
              Updated:

              Slack

                Issue deployment