Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-422

Wrong charset conversion in some RTF documents.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      RTF parser uses javax.swing.text.rtf, but it sucks.

      It doesn't support '\ansicpg' tag (cite from RTF file format specification:
      "This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").

      Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.

        Attachments

        1. TIKA-422.patch
          10 kB
          Michael McCandless
        2. RTFParser.patch
          65 kB
          Cristian Vat
        3. test_with_curly_brackets.rtf
          9 kB
          Alex Skochin
        4. RTFParser.patch
          63 kB
          Cristian Vat
        5. RTFParser.patch
          60 kB
          Cristian Vat
        6. RTFParser.patch
          21 kB
          Cristian Vat
        7. RTFParser.patch
          21 kB
          Shinsuke Sugaya
        8. test-windows-1250.rtf
          0.3 kB
          Piotr Bartosiewicz

          Activity

            People

            • Assignee:
              jukkaz Jukka Zitting
              Reporter:
              bartex Piotr Bartosiewicz
            • Votes:
              4 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: