[TIKA-3517] Text extraction doesn't work for Pages and Numbers when Tesseract is disabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

I tested this on RHEL7. I got the same results whether I was using Tesseract 3 or Tesseract 4, but that doesn't really matter because the problems I'm having are when Tesseract is disabled.

Description

When I try running tika to try to extract text from Mac Pages and Numbers files, the text extraction does not work if Tesseract is disabled. I'm attaching sample files, including the config file I use to disable Tesseract. I get the same results whether I run the server version (tika-server-standard-2.0.0.jar) or the command line app (tika-app-2.0.0.jar).

The following commands extract text along with what appears to be a list of a bunch of .iwa files and .jpg files inside the Pages and Numbers files:

java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages

java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers

However, when I run the following commands using the configuration file to disable Tesseract, all that is extracted is the list of .iwa and .jpg files and none of the actual text is extracted:

java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages

java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers

I haven't see similar problems with other types of files I've tested with, including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf. Those work fine with or without Tesseract disabled.

On a somewhat separate issue, I have been unable to get any text extracted from my test Keynote file at all, whether Tesseract is enabled or not. I'm having difficulty uploading that file, so I'll see if I can add that later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Document
09/Aug/21 20:16
16 kB
Tim Allison
Document.iwa
09/Aug/21 20:16
9 kB
Tim Allison
no_ocr.xml
06/Aug/21 21:11
0.2 kB
Chris Bryant
SSN.numbers
06/Aug/21 21:09
654 kB
Chris Bryant
SSN.pages
06/Aug/21 21:09
814 kB
Chris Bryant

Issue Links

is related to

TIKA-1358 Add support for newer iWork file formats

Open

Activity

People

Assignee:: Unassigned

Reporter:: Chris Bryant

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Aug/21 21:28

Updated:: 09/Aug/21 20:27