Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2242

opendocument parsing produces malformed xml

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13, 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: handler, parser
    • Labels:
      None

      Description

      For some odt documents, a malformed xml is produced when parsing.

      1. 2017-01-02-16B833-16B833VANCAUTEREN.odt
        17 kB
        Jan Van Raemdonck
      2. 2017-02-01-15B96Ghijsens-17B96GHIJSENS.odt
        17 kB
        Jan Van Raemdonck

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Bad nesting of markup or something else?

          <p><b><u>name</b>, advocaat ... persoon.</u></p>
          
          Show
          tallison@mitre.org Tim Allison added a comment - Bad nesting of markup or something else? <p><b><u>name</b>, advocaat ... persoon.</u></p>
          Hide
          jvanraemdonck Jan Van Raemdonck added a comment -

          Yes, the bad nesting is the issue i'm having.

          Show
          jvanraemdonck Jan Van Raemdonck added a comment - Yes, the bad nesting is the issue i'm having.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you. The attached test file also shows that we need to handle paragraph level default styles "P1", etc...

          Will work on both.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you. The attached test file also shows that we need to handle paragraph level default styles "P1", etc... Will work on both.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          And we have to handle style:text-underline-style="none".

          Show
          tallison@mitre.org Tim Allison added a comment - And we have to handle style:text-underline-style="none" .
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for opening this and submitting a test document!

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for opening this and submitting a test document!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x #200 (See https://builds.apache.org/job/tika-2.x/200/)
          TIKA-2242 fix style markup in ODT (tallison: rev 4374bcecf4600664998cd08bd1c6b044b9acded7)

          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
          • (edit) CHANGES.txt
          • (add) tika-test-resources/src/test/resources/test-documents/testODTStyles2.odt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x #200 (See https://builds.apache.org/job/tika-2.x/200/ ) TIKA-2242 fix style markup in ODT (tallison: rev 4374bcecf4600664998cd08bd1c6b044b9acded7) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java (edit) CHANGES.txt (add) tika-test-resources/src/test/resources/test-documents/testODTStyles2.odt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1181 (See https://builds.apache.org/job/Tika-trunk/1181/)
          TIKA-2242 – fix style markup in ODT (tallison: rev 8a04f207e8c4e2a87e61930e0e28be48ce99d20c)

          • (edit) CHANGES.txt
          • (add) tika-parsers/src/test/resources/test-documents/testODTStyles2.odt
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1181 (See https://builds.apache.org/job/Tika-trunk/1181/ ) TIKA-2242 – fix style markup in ODT (tallison: rev 8a04f207e8c4e2a87e61930e0e28be48ce99d20c) (edit) CHANGES.txt (add) tika-parsers/src/test/resources/test-documents/testODTStyles2.odt (edit) tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
          Hide
          jvanraemdonck Jan Van Raemdonck added a comment -

          Found another instance of this issue in one of the odt documents. This time it is a div tag that is causing the issue:
          <p><b>WOUTERS Rolf<div><p />Beschermde persoon is overleden</p>
          </div>
          I will attach the new document to the ticket.

          Show
          jvanraemdonck Jan Van Raemdonck added a comment - Found another instance of this issue in one of the odt documents. This time it is a div tag that is causing the issue: <p><b>WOUTERS Rolf<div><p />Beschermde persoon is overleden</p> </div> I will attach the new document to the ticket.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for re-opening this. The second document shows we need some non-trivial fixes. We're mapping, note, annotation and notes to <div>. However, as your document shows, these can occur within a <p/> and contain their own <p/>. We'll want to avoid putting a <p/> inside a <div/> and a <p/> inside a <p/>.

          For example:

          <p>text<div><p>this is an annotation</p></div></p>.
          

          Should we map note, annotation and notes to <span/> instead of <div/>? Or, should we close the <p> when we hit an annotation and friends?

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for re-opening this. The second document shows we need some non-trivial fixes. We're mapping, note , annotation and notes to <div> . However, as your document shows, these can occur within a <p/> and contain their own <p/> . We'll want to avoid putting a <p/> inside a <div/> and a <p/> inside a <p/> . For example: <p>text<div><p>this is an annotation</p></div></p>. Should we map note , annotation and notes to <span/> instead of <div/> ? Or, should we close the <p> when we hit an annotation and friends?
          Hide
          jvanraemdonck Jan Van Raemdonck added a comment -

          Hi Tim, sorry for the delay in replying!
          For us it would be best if the annotation and friends would be mapped to <span> tags.

          Show
          jvanraemdonck Jan Van Raemdonck added a comment - Hi Tim, sorry for the delay in replying! For us it would be best if the annotation and friends would be mapped to <span> tags.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ha. Sorry, now, for my delay. I'll work on this over the next few days in coordination with TIKA-2264.

          Show
          tallison@mitre.org Tim Allison added a comment - Ha. Sorry, now, for my delay. I'll work on this over the next few days in coordination with TIKA-2264 .
          Hide
          jvanraemdonck Jan Van Raemdonck added a comment -

          Hi Tim, did you have a chance to work on this issue yet?

          Show
          jvanraemdonck Jan Van Raemdonck added a comment - Hi Tim, did you have a chance to work on this issue yet?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          No. Sorry. Working on other tasks. I'm hoping to get to this on Wed.

          Show
          tallison@mitre.org Tim Allison added a comment - No. Sorry. Working on other tasks. I'm hoping to get to this on Wed.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          The issue in the second triggering file is now fixed.

          And, y, it turns out that Wed in Open source time looks like Friday, and "over the next few days" = "next week, maybe". Sorry.

          Please let us know what else you find. Cheers!

          Show
          tallison@mitre.org Tim Allison added a comment - The issue in the second triggering file is now fixed. And, y, it turns out that Wed in Open source time looks like Friday, and "over the next few days" = "next week, maybe". Sorry. Please let us know what else you find. Cheers!

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              jvanraemdonck Jan Van Raemdonck
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development