Tika
  1. Tika
  2. TIKA-422

Wrong charset conversion in some RTF documents.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      RTF parser uses javax.swing.text.rtf, but it sucks.

      It doesn't support '\ansicpg' tag (cite from RTF file format specification:
      "This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").

      Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.

      1. TIKA-422.patch
        10 kB
        Michael McCandless
      2. test-windows-1250.rtf
        0.3 kB
        Piotr Bartosiewicz
      3. test_with_curly_brackets.rtf
        9 kB
        Alex Skochin
      4. RTFParser.patch
        21 kB
        Shinsuke Sugaya
      5. RTFParser.patch
        21 kB
        Cristian Vat
      6. RTFParser.patch
        60 kB
        Cristian Vat
      7. RTFParser.patch
        63 kB
        Cristian Vat
      8. RTFParser.patch
        65 kB
        Cristian Vat

        Activity

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Piotr Bartosiewicz
          • Votes:
            4 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development