Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-422

Wrong charset conversion in some RTF documents.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7
    • 0.9
    • parser
    • None

    Description

      RTF parser uses javax.swing.text.rtf, but it sucks.

      It doesn't support '\ansicpg' tag (cite from RTF file format specification:
      "This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").

      Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.

      Attachments

        1. RTFParser.patch
          65 kB
          Cristian Vat
        2. RTFParser.patch
          63 kB
          Cristian Vat
        3. RTFParser.patch
          60 kB
          Cristian Vat
        4. RTFParser.patch
          21 kB
          Cristian Vat
        5. RTFParser.patch
          21 kB
          Shinsuke Sugaya
        6. test_with_curly_brackets.rtf
          9 kB
          Alex Skochin
        7. test-windows-1250.rtf
          0.3 kB
          Piotr Bartosiewicz
        8. TIKA-422.patch
          10 kB
          Michael McCandless

        Activity

          People

            jukkaz Jukka Zitting
            bartex Piotr Bartosiewicz
            Votes:
            4 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: