Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1255

WordExtractor - bold hyperlink not closed properly

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3, 1.4, 1.5
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Any

      Description

      If a Word document contains a bold hyperlink, the resulting xhtml is:

      <a href="http://www.testdomain.com/support/workcentre-7232-7242/file-download/enus.html?operatingSystem=macosx108&amp;fileLanguage=en&amp;contentId=126220&amp;from=downloads&amp;viewArchived=false"><b>Test link</a></b>

      The closing bold and anchor tags are transposed, which isn't valid XHTML.

      1. example.doc
        56 kB
        Alan Hunter
      2. testWORD_bold_hyperlink.doc
        26 kB
        Alan Hunter
      3. testWORD_italic_hyperlink.doc
        26 kB
        Alan Hunter
      4. testWORD_strikethrough_hyperlink.doc
        26 kB
        Alan Hunter
      5. WordExtractor.java
        26 kB
        Alan Hunter
      6. WordParserTest.java
        17 kB
        Alan Hunter

        Activity

        Hide
        alanhunter Alan Hunter added a comment -

        An example with a bold hyperlink

        Show
        alanhunter Alan Hunter added a comment - An example with a bold hyperlink
        Hide
        alanhunter Alan Hunter added a comment -

        Suggested patch, test and test documents

        Show
        alanhunter Alan Hunter added a comment - Suggested patch, test and test documents
        Hide
        alanhunter Alan Hunter added a comment -

        I have attached a suggested fix, test and test documents to improve the resilience of the Word parser when handling styled hyperlinks

        Show
        alanhunter Alan Hunter added a comment - I have attached a suggested fix, test and test documents to improve the resilience of the Word parser when handling styled hyperlinks
        Hide
        tpalsulich Tyler Palsulich added a comment -

        Hi Alan Hunter. Sorry no one ever got back to you on this! Can you please attach your changes as a patch (see the Submitting Enhancements and Fixes section of the contributing page)? Thank you!

        Show
        tpalsulich Tyler Palsulich added a comment - Hi Alan Hunter . Sorry no one ever got back to you on this! Can you please attach your changes as a patch (see the Submitting Enhancements and Fixes section of the contributing page )? Thank you!
        Hide
        Hassan.Akram@verint.com Akram, Hassan added a comment -

        Hi,

        I am on annual leave and will return on 5th January 2015.

        If you want to discuss anything related to Colossus team or 14R1 SP2, please reach out to Craig Pinkerton.
        For anything else, please contact Leigh Dastey

        I will pick up emails on my return.

        Regards,
        Hassan

        Show
        Hassan.Akram@verint.com Akram, Hassan added a comment - Hi, I am on annual leave and will return on 5th January 2015. If you want to discuss anything related to Colossus team or 14R1 SP2, please reach out to Craig Pinkerton. For anything else, please contact Leigh Dastey I will pick up emails on my return. Regards, Hassan
        Hide
        Ashish.Sood@verint.com Sood, Ashish added a comment -

        I am out of the office until Monday 23 March 2015.

        Please contact ke-pm-arch@verint.com for any urgent issues.

        Show
        Ashish.Sood@verint.com Sood, Ashish added a comment - I am out of the office until Monday 23 March 2015. Please contact ke-pm-arch@verint.com for any urgent issues.
        Hide
        shabirbhat Shabir Bhat added a comment -

        Hi,

        This issue still exists. Is there is a workaround?

        Thanks

        Show
        shabirbhat Shabir Bhat added a comment - Hi, This issue still exists. Is there is a workaround? Thanks
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for the ping. Let me know if my fixes didn't fix your problem.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for the ping. Let me know if my fixes didn't fix your problem.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1097 (See https://builds.apache.org/job/Tika-trunk/1097/)
        TIKA-1255 – fix hyperlinks in doc/docx if there is formatting TIKA-2078 (tallison: rev 80efc84b675c8defa5e86b01b85e1dabc84d32f5)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
        • (add) tika-parsers/src/test/resources/test-documents/testWORD_boldHyperlink.docx
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • (add) tika-parsers/src/test/resources/test-documents/testWORD_boldHyperlink.doc
        • (edit) CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1097 (See https://builds.apache.org/job/Tika-trunk/1097/ ) TIKA-1255 – fix hyperlinks in doc/docx if there is formatting TIKA-2078 (tallison: rev 80efc84b675c8defa5e86b01b85e1dabc84d32f5) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java (add) tika-parsers/src/test/resources/test-documents/testWORD_boldHyperlink.docx (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (add) tika-parsers/src/test/resources/test-documents/testWORD_boldHyperlink.doc (edit) CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #44 (See https://builds.apache.org/job/tika-2.x-windows/44/)
        TIKA-1255 and TIKA-2078 – fix hyperlinks that include formatting and (tallison: rev 4636f95b2a122a7d52f3b12956c5eeb1f34a8b0e)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
        • (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.docx
        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.doc
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #44 (See https://builds.apache.org/job/tika-2.x-windows/44/ ) TIKA-1255 and TIKA-2078 – fix hyperlinks that include formatting and (tallison: rev 4636f95b2a122a7d52f3b12956c5eeb1f34a8b0e) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.docx (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.doc
        Hide
        hudson Hudson added a comment -

        ABORTED: Integrated in Jenkins build tika-2.x #140 (See https://builds.apache.org/job/tika-2.x/140/)
        TIKA-1255 and TIKA-2078 – fix hyperlinks that include formatting and (tallison: rev 4636f95b2a122a7d52f3b12956c5eeb1f34a8b0e)

        • (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.docx
        • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.doc
        Show
        hudson Hudson added a comment - ABORTED: Integrated in Jenkins build tika-2.x #140 (See https://builds.apache.org/job/tika-2.x/140/ ) TIKA-1255 and TIKA-2078 – fix hyperlinks that include formatting and (tallison: rev 4636f95b2a122a7d52f3b12956c5eeb1f34a8b0e) (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.docx (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_boldHyperlink.doc

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            alanhunter Alan Hunter
          • Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development