Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3238

RTFParser fails to generate full content of an RTF file that has been generated in libreoffice

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.24.1
    • None
    • parser
    • Important

    Description

      Some RTF files, when created in libreoffice writer seem to not be parsed correctly. The RTFParser seems to extract only a portion of the text (ex: the title).

      However if the same file is opened in a Windows Word and saved again as an RTF file, the parser is able to extract the full text.

      An example file is attached in the ticket.

       

      And this would be a small snippet of the parser:

      private static final Set<MediaType> EXCLUDES = Collections.singleton(MediaType.application("x-tika-ooxml"));
      
      private static final Parser PARSERS[] = new Parser[] {
              new RTFParser()
      };
      
      private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
      
      private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(), PARSER_INSTANCE);
      
      public String parse(InputStream content) {
          return TIKA_INSTANCE.parseToString(content)
      }

      Attachments

        1. file-sample_1MB (1).rtf
          1009 kB
          Bruno

        Activity

          People

            Unassigned Unassigned
            bruno.vilhena Bruno
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: