Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-777

RTF parser incorrectly applies fonts to complete group

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0
    • 1.1
    • parser
    • None

    Description

      Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:

      {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0 {\fonttbl {\f0\fswiss\fcharset0 Arial;} {\f1\fswiss\fcharset204 Arial;}

      }

      {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}

      \par
      }

      This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

      Attachments

        Activity

          People

            mikemccand Michael McCandless
            arjohn Arjohn Kampman
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: