Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1309

RTF TextExtractor ignores consecutive linebreaks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5, 1.6
    • 1.8
    • parser
    • None

    Description

      RTF files (such as those produced by WordPad) often encode consecutive linebreaks as consecutive \par commands. However, org.apache.tika.parser.rtf.TextExtractor ignores the second \par. Solution is simple, see attached patch.

      Attachments

        1. 0001-fix-RTF-ignores-consecutive-newlines.patch
          0.9 kB
          Aleksandr Dubinsky
        2. test.rtf
          0.3 kB
          Aleksandr Dubinsky

        Issue Links

          Activity

            People

              Unassigned Unassigned
              almson Aleksandr Dubinsky
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: