Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3972

Parsing RTF sample with hyperlink and ToXMLContentHandler returns malformed XHTML from toString method call

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.7.0
    • None
    • parser
    • Tested with Java 8 (Temurin Eclipse) and Tika 2.7.0 on Windows 11.

    Description

      I am exploring Tika for RTF to X(HT)ML parsing, I have run into a problem with some RTF having an hyperlink where unfortunately the result of using a ContentHandler created with ToXMLContentHandler and calling the toString() method on the handler returns a malformed X(HT)ML document where the starting `<a>` tag is not properly closed.

      I have attached the relevant RTF sample document. The output I get is

      ```

      <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
      <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
      <meta name="Content-Type" content="application/rtf" />
      <title></title>
      </head>
      <body><p />
      <p />
      <p>    10”Flour Tortilla</p>
      <p>    Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips
      Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
      <p><b />    Ripped Romaine</p>
      <p>    Blackened Salmon julienne</p>
      <p>    Shaved Red Onion</p>
      <p>    Julienne Tomato</p>
      <p>    Grated Parmesan</p>
      <p>    Blackening spice: <a href="..\\..\\SPICE
      Blackening Spice.doc">Blackening Spice.doc</a></p>
      <p />
      <p>Method</p>
      <p>Procedure Text </p>
      <p />
      <p />
      </body></html>

      ```

      where the part `<p>    Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips
      Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>` is flawed as the `<a href>` is not closed.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            martin.honnen Martin Honnen

            Dates

              Created:
              Updated:

              Slack

                Issue deployment