Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1454

Extracting as HTML loses links in xlsx, ppt, and pptx files

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
    • Fix Version/s: 1.16
    • Component/s: None
    • Labels:
      None
    • Environment:

      RedHat EL5, EL6, EL7

      Description

      I am trying to convert documents to HTML, then looking through the HTML for anchor tags to find links to external URLs. This works fine when looking at some document types, including PDFs, Open Document formats, Microsoft Word formats .doc and .docx, and the older Microsoft Excel .xls format, but it does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it does not work for the newer Excel .xlsx format. For the .ppt, .pptx, and .xlsx formats, the text is extracted properly and formatted into HTML, but the link is not converted to an anchor tag.

      I am running tika in --server --html mode.

      I included samples of .xlsx, .ppt, and .pptx files that do not properly extract links, and also included samples of .ods and .odp files that do extract links properly.

      1. testurl.ods
        9 kB
        Chris Bryant
      2. testurl.xlsx
        28 kB
        Chris Bryant
      3. urltest.odp
        12 kB
        Chris Bryant
      4. urltest.ppt
        46 kB
        Chris Bryant
      5. urltest.pptx
        37 kB
        Chris Bryant

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for opening this issue and supplying test docs. For ppt and pptx, I have a reasonable patch. We'll need to add some things into POI to make the extraction cleaner, but this should be good to go soonish.

        For xlsx, it looks like we'll have to dump hyperlinks at the bottom of each sheet...we'd have to do a double pass to cache hyperlinks and insert them in the proper cells. Not great, but at least we should be able to get the hyperlinks for your purposes.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for opening this issue and supplying test docs. For ppt and pptx, I have a reasonable patch. We'll need to add some things into POI to make the extraction cleaner, but this should be good to go soonish. For xlsx, it looks like we'll have to dump hyperlinks at the bottom of each sheet...we'd have to do a double pass to cache hyperlinks and insert them in the proper cells. Not great, but at least we should be able to get the hyperlinks for your purposes.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #989 (See https://builds.apache.org/job/tika-trunk-jdk1.7/989/)
        TIKA-1454 – added initial hyperlink extraction for ppt, pptx, xlsx. (tallison: rev 69852e4cb55d34e6513e0b66af7d75cb1b1408ba)

        • tika-parsers/src/test/resources/test-documents/testEXCEL_hyperlinks.xlsx
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
        • tika-parsers/src/test/resources/test-documents/testEXCEL_hyperlinks.xls
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #989 (See https://builds.apache.org/job/tika-trunk-jdk1.7/989/ ) TIKA-1454 – added initial hyperlink extraction for ppt, pptx, xlsx. (tallison: rev 69852e4cb55d34e6513e0b66af7d75cb1b1408ba) tika-parsers/src/test/resources/test-documents/testEXCEL_hyperlinks.xlsx tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java tika-parsers/src/test/resources/test-documents/testEXCEL_hyperlinks.xls
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I added preliminary extraction from xlsx, ppt, pptx.

        For ppt and pptx, it would be helpful if we could distinguish external (actual hyperlinks) from internal (references to a footnote)...this will have to be made at the POI level. For now, there's a bit of a hack to make the distinction and only href-ify external.

        For xlsx, for now, we are dumping the hyperlinks at the bottom of each sheet. If we ran the sheet reader twice, we'd be able to cache the hyperlinks and put them in the cells in which they belong. I'm not sure we want to add that double parsing unless there is demand.

        For xls, I found no way to extract a hyperlink associated with a text box. I have no doubt that there is a way...I couldn't find it.

        We could add more tests for ppt and pptx.

        I would close this issue now, but we also have to add extraction for ods and odp.

        Show
        tallison@mitre.org Tim Allison added a comment - I added preliminary extraction from xlsx, ppt, pptx. For ppt and pptx, it would be helpful if we could distinguish external (actual hyperlinks) from internal (references to a footnote)...this will have to be made at the POI level. For now, there's a bit of a hack to make the distinction and only href-ify external. For xlsx, for now, we are dumping the hyperlinks at the bottom of each sheet. If we ran the sheet reader twice, we'd be able to cache the hyperlinks and put them in the cells in which they belong. I'm not sure we want to add that double parsing unless there is demand. For xls, I found no way to extract a hyperlink associated with a text box. I have no doubt that there is a way...I couldn't find it. We could add more tests for ppt and pptx. I would close this issue now, but we also have to add extraction for ods and odp.
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-2.x #91 (See https://builds.apache.org/job/tika-2.x/91/)
        TIKA-1454: extract hyperlinks from ppt, pptx and xlsx (tallison: rev 229329d6ea58d5ef90aef7887bdf444463aed127)

        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        • CHANGES.txt
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
        • tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ArParserTest.java
          TIKA-1454: extract hyperlinks from ppt, pptx and xlsx – undo ignoring (tallison: rev 6f5e7f94e6f4f01b4d2a7c453d025f0d1750817a)
        • tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ArParserTest.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-2.x #91 (See https://builds.apache.org/job/tika-2.x/91/ ) TIKA-1454 : extract hyperlinks from ppt, pptx and xlsx (tallison: rev 229329d6ea58d5ef90aef7887bdf444463aed127) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java CHANGES.txt tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ArParserTest.java TIKA-1454 : extract hyperlinks from ppt, pptx and xlsx – undo ignoring (tallison: rev 6f5e7f94e6f4f01b4d2a7c453d025f0d1750817a) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ArParserTest.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #990 (See https://builds.apache.org/job/tika-trunk-jdk1.7/990/)
        TIKA-1454 – clean up and add entry to CHANGES.txt (tallison: rev bb78082b0028b57a5bf1ae30858dda6aebeacf63)

        • CHANGES.txt
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #990 (See https://builds.apache.org/job/tika-trunk-jdk1.7/990/ ) TIKA-1454 – clean up and add entry to CHANGES.txt (tallison: rev bb78082b0028b57a5bf1ae30858dda6aebeacf63) CHANGES.txt tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #92 (See https://builds.apache.org/job/tika-2.x/92/)
        TIKA-1454: extract hyperlinks from ppt, pptx and xlsx – actually add (tallison: rev c53d5385aac0bad7fec8e86fa0c455790006b437)

        • tika-test-resources/src/test/resources/test-documents/testEXCEL_hyperlinks.xlsx
        • tika-test-resources/src/test/resources/test-documents/testEXCEL_hyperlinks.xls
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #92 (See https://builds.apache.org/job/tika-2.x/92/ ) TIKA-1454 : extract hyperlinks from ppt, pptx and xlsx – actually add (tallison: rev c53d5385aac0bad7fec8e86fa0c455790006b437) tika-test-resources/src/test/resources/test-documents/testEXCEL_hyperlinks.xlsx tika-test-resources/src/test/resources/test-documents/testEXCEL_hyperlinks.xls

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            cbryant Chris Bryant
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development