Tika
  1. Tika
  2. TIKA-733

[PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:

      Description

      Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty. Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.

      Unable to include sample file due to sensitive nature of file contents.

      StackTrace (TIKA-0.9)

      Caused by: java.util.NoSuchElementException
      at java.util.LinkedList.remove(LinkedList.java:788)
      at java.util.LinkedList.removeLast(LinkedList.java:144)
      at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
      at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
      at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 45 more

        Activity

        Hide
        Michael McCandless added a comment -

        Thank you Jeremy! Keep the patches coming

        Show
        Michael McCandless added a comment - Thank you Jeremy! Keep the patches coming
        Hide
        Jeremy Anderson added a comment -

        Cool beans!!

        Thanks for your attention to it. Yeah, I confirmed with 18 of the other files experiencing this error, all corruption issues similar to the first one. Although the amount of info contained in the final block varies widely from a few chars to none.

        But using the patch I already submitted does appear to actually work with getting the text out for each these corrupted documents.

        Thanks again for adding it to the trunk.

        Show
        Jeremy Anderson added a comment - Cool beans!! Thanks for your attention to it. Yeah, I confirmed with 18 of the other files experiencing this error, all corruption issues similar to the first one. Although the amount of info contained in the final block varies widely from a few chars to none. But using the patch I already submitted does appear to actually work with getting the text out for each these corrupted documents. Thanks again for adding it to the trunk.
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Michael McCandless added a comment -

        Thanks Jeremy!

        Show
        Michael McCandless added a comment - Thanks Jeremy!
        Hide
        Michael McCandless added a comment -

        Actually, I think we should just commit your patch: it's harmless for non-corrupt RTF docs, and for corrupt ones (with this particular corruption) it will make a best effort to extract what text it can.

        I only wanted to confirm that you were hitting this because of document corruption and not a bug in how the new RTF parser tokenizes open/close groups. Thanks!

        Show
        Michael McCandless added a comment - Actually, I think we should just commit your patch: it's harmless for non-corrupt RTF docs, and for corrupt ones (with this particular corruption) it will make a best effort to extract what text it can. I only wanted to confirm that you were hitting this because of document corruption and not a bug in how the new RTF parser tokenizes open/close groups. Thanks!
        Hide
        Jeremy Anderson added a comment - - edited

        (Sorry, I can't seem to get the post to maintain my newline characters )

        The problem is also present in the older 0.9 release.

        Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '

        {'. However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty. My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered? I have about 20 or so files that have encountered this failure in my load set. I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted. To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected. I expect that they may also just ignore the final block in these cases. Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file. Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file. Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to: * detect if they have more ending blocks than starting, and when it does * check to see if the final one is a partial replication of the prior one * and if so, just ignore the final one. Last lines of the corrupted file: \pard\li360____VALID RTF FILE TEXT _____\line\par \pard\par \pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par }

        0\li1800\tx1800\cf2\f2\fs20\par
        \pard\cf0\b\f3\par
        \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
        \pard\cf0\b\f3\par
        \par
        \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
        \pard\cf0\b\f3\par
        \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
        \pard\cf0\b\f3\par
        \par
        \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
        \pard\cf0\b\f3\par
        \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
        \pard\cf0\b\f3\par
        \par
        }

        Show
        Jeremy Anderson added a comment - - edited (Sorry, I can't seem to get the post to maintain my newline characters ) The problem is also present in the older 0.9 release. Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets ' {'. However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty. My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered? I have about 20 or so files that have encountered this failure in my load set. I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted. To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected. I expect that they may also just ignore the final block in these cases. Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file. Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file. Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to: * detect if they have more ending blocks than starting, and when it does * check to see if the final one is a partial replication of the prior one * and if so, just ignore the final one. Last lines of the corrupted file: \pard\li360____VALID RTF FILE TEXT _____\line\par \pard\par \pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf1\b0\f0\par } 0\li1800\tx1800\cf2\f2\fs20\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par \pard\cf0\b\f3\par \par \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par \pard\cf0\b\f3\par \pard\fi-1800\li1800\tx1800\cf2\b0\f2\par \pard\cf0\b\f3\par \par }
        Hide
        Michael McCandless added a comment -

        Hmm, it makes me a little nervous just blindly not popping the group
        state once it's empty since this could be masking a more serious bug.

        Ie, it's possible we are not correctly tokenizing the open / close
        group tokens.

        The other explanation is that the RTF doc is corrupt (has too many
        closing } vs open {).

        Can you look at the doc and figure out if its corrupt?

        Does this RTF document work with older versions of Tika (before
        TIKA-683 was committed)?

        Show
        Michael McCandless added a comment - Hmm, it makes me a little nervous just blindly not popping the group state once it's empty since this could be masking a more serious bug. Ie, it's possible we are not correctly tokenizing the open / close group tokens. The other explanation is that the RTF doc is corrupt (has too many closing } vs open {). Can you look at the doc and figure out if its corrupt? Does this RTF document work with older versions of Tika (before TIKA-683 was committed)?
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Jeremy Anderson made changes -
        Field Original Value New Value
        Attachment TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch [ 12496831 ]
        Hide
        Jeremy Anderson added a comment -

        Patch file

        Show
        Jeremy Anderson added a comment - Patch file
        Jeremy Anderson created issue -

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Jeremy Anderson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development