[TIKA-1454] Extracting as HTML loses links in xlsx, ppt, and pptx files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
Fix Version/s: 1.17, 2.0.0-BETA, 2.1.0
Component/s: None
Labels:
None
Environment:

RedHat EL5, EL6, EL7

Description

I am trying to convert documents to HTML, then looking through the HTML for anchor tags to find links to external URLs. This works fine when looking at some document types, including PDFs, Open Document formats, Microsoft Word formats .doc and .docx, and the older Microsoft Excel .xls format, but it does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it does not work for the newer Excel .xlsx format. For the .ppt, .pptx, and .xlsx formats, the text is extracted properly and formatted into HTML, but the link is not converted to an anchor tag.

I am running tika in --server --html mode.

I included samples of .xlsx, .ppt, and .pptx files that do not properly extract links, and also included samples of .ods and .odp files that do extract links properly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testurl.xlsx
22/Oct/14 19:02
28 kB
Chris Bryant
testurl.ods
22/Oct/14 19:02
9 kB
Chris Bryant
urltest.pptx
22/Oct/14 19:02
37 kB
Chris Bryant
urltest.ppt
22/Oct/14 19:02
46 kB
Chris Bryant
urltest.odp
22/Oct/14 19:02
12 kB
Chris Bryant

Activity

People

Assignee:: Tim Allison

Reporter:: Chris Bryant

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Oct/14 19:01

Updated:: 03/Oct/23 19:55

Resolved:: 03/Oct/23 19:55