[TIKA-1109] Metadata not extracted before the content in OOXML (pptx) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5
Component/s: parser
Labels:
- patch

Description

It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first.

As a symptom:

java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx

outputs only as metadata:

while there is more medata in the file (e.g. <dc:title>Attachment Test</dc:title>).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-1109.patch
27/Jun/13 12:26
6 kB
Daniel Bonniot de Ruisselet

Issue Links

is related to

STANBOL-1171 Update to Tika 1.5

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Bonniot de Ruisselet

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Apr/13 11:06

Updated:: 23/Apr/14 06:25

Resolved:: 27/Jun/13 12:41