Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1454

Extracting as HTML loses links in xlsx, ppt, and pptx files

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
    • 1.17, 2.0.0-BETA, 2.1.0
    • None
    • None
    • RedHat EL5, EL6, EL7

    Description

      I am trying to convert documents to HTML, then looking through the HTML for anchor tags to find links to external URLs. This works fine when looking at some document types, including PDFs, Open Document formats, Microsoft Word formats .doc and .docx, and the older Microsoft Excel .xls format, but it does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it does not work for the newer Excel .xlsx format. For the .ppt, .pptx, and .xlsx formats, the text is extracted properly and formatted into HTML, but the link is not converted to an anchor tag.

      I am running tika in --server --html mode.

      I included samples of .xlsx, .ppt, and .pptx files that do not properly extract links, and also included samples of .ods and .odp files that do extract links properly.

      Attachments

        1. testurl.xlsx
          28 kB
          Chris Bryant
        2. testurl.ods
          9 kB
          Chris Bryant
        3. urltest.pptx
          37 kB
          Chris Bryant
        4. urltest.ppt
          46 kB
          Chris Bryant
        5. urltest.odp
          12 kB
          Chris Bryant

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tallison Tim Allison
            cbryant Chris Bryant

            Dates

              Created:
              Updated:

              Slack

                Issue deployment