[TIKA-3657] Microsoft documents are not text parsed when running under Docker - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.2.0, 2.2.1
Fix Version/s: None
Component/s: config, core, depedency
Labels:
None

Description

We use EmbeddedDocumentExtractor, with this code:

NalyticsEmbeddedDocumentExtractor nalyticsEmbeddedDocumentExtractor = new NalyticsEmbeddedDocumentExtractor(this);

this.context.set(EmbeddedDocumentExtractor.class, nalyticsEmbeddedDocumentExtractor);

This all works fine for us, and has been used in production for a few years. This also works under Tika 2.2.0 when running in development environments (Eclipse, Apache Tomcat). However when running under Docker the text withinMicrosoft documents (Word etc) is not parsed. Under Tika 2.1.0, under Docker, the Microsoft documents are fully parsed, so this problem was introduced in 2.2.0

Interestingly, I found that if anything at all is added to the context via context.set the same problem occurs. Also, if the standard Tika Embedded Document Extractor is used the same problem occurs. Our Docker image contains our application's code which uses Tika, as well as Apache DS. The problem occurs running Docker on Ubuntu, Mac OS and Windows.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

POIFSContainerDetector.java
02/Feb/22 10:39
16 kB
Tim Barrett
scenario traces.txt
28/Jan/22 11:39
9 kB
Tim Barrett
tika-config.xml
21/Jan/22 08:21
0.8 kB
Tim Barrett

Issue Links

is related to

TIKA-3672 Bump markLimit in POIFSDetector

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tim Barrett

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Jan/22 08:21

Updated:: 04/Feb/22 11:14