Tika
  1. Tika
  2. TIKA-748

RTF parser fails to extract the body

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      Using tika-app I'm getting the following result of parsing the attached document:

      <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <meta name="subject" content="tests"/>
      <meta name="Content-Length" content="2235"/>
      <meta name="comment" content="StarWriter"/>
      <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
      <meta name="X-Parsed-By" content="org.apache.tika.parser.rtf.RTFParser"/>
      <meta name="Content-Type" content="application/rtf"/>
      <meta name="resourceName" content="test.rtf"/>
      <title>test rft document</title>
      </head>
      <body/></html>
      

      The expected result would be a non-empty body containing the text "The quick brown fox jumps over the lazy dog
      ".

      1. TIKA-748.patch
        5 kB
        Michael McCandless
      2. test.rtf
        2 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Thanks Michael!

        Show
        Andrzej Bialecki added a comment - Thanks Michael!
        Hide
        Michael McCandless added a comment -

        Thanks Andrzej!

        Show
        Michael McCandless added a comment - Thanks Andrzej!
        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Michael McCandless added a comment -

        Hmm I think this doc is slightly malformed – it contains * (followed by \cs7) within a group; * is supposed to always come after a group start {

        This is causing Tika to ignore all text in the group.

        But I think we can be robust here and only ignore text when we see * right after {, else, ignore it.

        Show
        Michael McCandless added a comment - Hmm I think this doc is slightly malformed – it contains * (followed by \cs7) within a group; * is supposed to always come after a group start { This is causing Tika to ignore all text in the group. But I think we can be robust here and only ignore text when we see * right after {, else, ignore it.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development