Tika
  1. Tika
  2. TIKA-666

Unable to extract content from RTF files

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8, 0.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      Windows 32 bit OS, JDK 1.6.19

      Description

      HI,

      I have tried with various set of RTF document to extract the text Content. I have tried so many technique to extract the text from rtf.. Its failed. I have attached the RTF document here

        Activity

        Hide
        samraj added a comment -

        Unable to extract the content frm this document.

        Show
        samraj added a comment - Unable to extract the content frm this document.
        Hide
        Jukka Zitting added a comment -

        The exception I get when parsing this document is:

        Exception in thread "main" org.apache.tika.exception.TikaException: Error parsing an RTF document
        at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:135)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:125)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:339)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:96)
        Caused by: java.lang.NullPointerException
        at java.util.Hashtable.put(Hashtable.java:394)
        at javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(RTFReader.java:1279)
        at javax.swing.text.rtf.RTFReader.handleKeyword(RTFReader.java:470)
        at javax.swing.text.rtf.RTFParser.write(RTFParser.java:232)
        at javax.swing.text.rtf.RTFParser.write(RTFParser.java:117)
        at javax.swing.text.rtf.AbstractFilter.write(AbstractFilter.java:155)
        at javax.swing.text.rtf.AbstractFilter.readFromStream(AbstractFilter.java:88)
        at javax.swing.text.rtf.RTFEditorKit.read(RTFEditorKit.java:65)
        at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:117)
        ... 6 more

        P@

        The error seems to be pretty deep inside the RTF parser in javax.swing, so there isn't much we can do about this in Tika.

        Show
        Jukka Zitting added a comment - The exception I get when parsing this document is: Exception in thread "main" org.apache.tika.exception.TikaException: Error parsing an RTF document at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:135) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:125) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:339) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:96) Caused by: java.lang.NullPointerException at java.util.Hashtable.put(Hashtable.java:394) at javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(RTFReader.java:1279) at javax.swing.text.rtf.RTFReader.handleKeyword(RTFReader.java:470) at javax.swing.text.rtf.RTFParser.write(RTFParser.java:232) at javax.swing.text.rtf.RTFParser.write(RTFParser.java:117) at javax.swing.text.rtf.AbstractFilter.write(AbstractFilter.java:155) at javax.swing.text.rtf.AbstractFilter.readFromStream(AbstractFilter.java:88) at javax.swing.text.rtf.RTFEditorKit.read(RTFEditorKit.java:65) at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:117) ... 6 more P@ The error seems to be pretty deep inside the RTF parser in javax.swing, so there isn't much we can do about this in Tika.
        Hide
        Cristian Vat added a comment -

        I checked the error in more detail, mostly to check that it's not a regression from something in TIKA-422

        It isn't. The error is thrown also using only RTFEditorKit from Java.

        If it may be useful to anyone: Character style "0" is referenced but wasn't defined in the rtf stylesheet and the java RTFReader fails because of this.

        Show
        Cristian Vat added a comment - I checked the error in more detail, mostly to check that it's not a regression from something in TIKA-422 It isn't. The error is thrown also using only RTFEditorKit from Java. If it may be useful to anyone: Character style "0" is referenced but wasn't defined in the rtf stylesheet and the java RTFReader fails because of this.
        Hide
        Michael McCandless added a comment -

        It looks like TIKA-683 fixes this issue, or at least I'm able to extract text for Redline.rtf.

        Show
        Michael McCandless added a comment - It looks like TIKA-683 fixes this issue, or at least I'm able to extract text for Redline.rtf.
        Hide
        Chris A. Mattmann added a comment -
        • closed per Mike's comment, and the fix for TIKA-683
        Show
        Chris A. Mattmann added a comment - closed per Mike's comment, and the fix for TIKA-683

          People

          • Assignee:
            Unassigned
            Reporter:
            samraj
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development