Tika
  1. Tika
  2. TIKA-777

RTF parser incorrectly applies fonts to complete group

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: parser
    • Labels:
      None

      Description

      Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:

      {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
      {\fonttbl

      {\f0\fswiss\fcharset0 Arial;} {\f1\fswiss\fcharset204 Arial;}

      }

      {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}

      \par
      }

      This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

        Activity

        Arjohn Kampman created issue -
        Arjohn Kampman made changes -
        Field Original Value New Value
        Description Tika's RTF parser processes the following rtf fragment incorrectly, applying the wrong character encoding to the parsed characters:

        {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
        {\fonttbl
        {\f0\fswiss\fcharset0 Arial;}
        {\f1\fswiss\fcharset204 Arial;}
        }
        {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
        }

        This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.
        Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:

        {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
        {\fonttbl
        {\f0\fswiss\fcharset0 Arial;}
        {\f1\fswiss\fcharset204 Arial;}
        }
        {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
        }

        This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Hide
        Michael McCandless added a comment -

        Thanks Arjohn!

        Show
        Michael McCandless added a comment - Thanks Arjohn!
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.1 [ 12318849 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Arjohn Kampman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development