Tika
  1. Tika
  2. TIKA-422

Wrong charset conversion in some RTF documents.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      RTF parser uses javax.swing.text.rtf, but it sucks.

      It doesn't support '\ansicpg' tag (cite from RTF file format specification:
      "This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").

      Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.

      1. RTFParser.patch
        65 kB
        Cristian Vat
      2. RTFParser.patch
        63 kB
        Cristian Vat
      3. RTFParser.patch
        60 kB
        Cristian Vat
      4. RTFParser.patch
        21 kB
        Cristian Vat
      5. RTFParser.patch
        21 kB
        Shinsuke Sugaya
      6. test_with_curly_brackets.rtf
        9 kB
        Alex Skochin
      7. test-windows-1250.rtf
        0.3 kB
        Piotr Bartosiewicz
      8. TIKA-422.patch
        10 kB
        Michael McCandless

        Activity

        Hide
        Piotr Bartosiewicz added a comment -

        Example of problematic file attached. Tika should extract
        "zażółć gęślą jaźń ZAŻÓŁĆ GĘŚLĄ JAŹŃ" but I got
        "za ¿ ó ³ æ g ê ś l ¹ ja ź ñ ZA ¯ Ó £ Æ G Ê Ś L ¥ JA Ź Ñ"

        Show
        Piotr Bartosiewicz added a comment - Example of problematic file attached. Tika should extract "zażółć gęślą jaźń ZAŻÓŁĆ GĘŚLĄ JAŹŃ" but I got "za ¿ ó ³ æ g ê ś l ¹ ja ź ñ ZA ¯ Ó £ Æ G Ê Ś L ¥ JA Ź Ñ"
        Hide
        Jukka Zitting added a comment -

        Does anyone know an alternative RTF parser in Java with a friendly license [1]? It looks like there's little we can do about this as long as we're stuck with the Swing RTF parser.

        [1] http://www.apache.org/legal/resolved.html

        Show
        Jukka Zitting added a comment - Does anyone know an alternative RTF parser in Java with a friendly license [1] ? It looks like there's little we can do about this as long as we're stuck with the Swing RTF parser. [1] http://www.apache.org/legal/resolved.html
        Hide
        Leszek Piotrowicz added a comment -

        Look RTFParser.jj at http://www.koders.com/noncode/fid69741D7970C9D0BA22243871D7E5682434A2A19E.aspx - seems to be Apache licensed. Also RTFPlainTextExtractor.java is Apache licensed but RTFParserDelegate.java is LGPL (a bit strange)??? However it is rather a simple interface which may be rewritten.

        After some hacking I was able to try this parser. It correctly extracts Polish diactric characters. It also inserts extra spaces into text in table cells (see https://issues.apache.org/jira/browse/TIKA-392)
        But it failed on some big rtf file claiming that it is syntactically incorrect. Anyway it seems much better for non-english texts.

        Show
        Leszek Piotrowicz added a comment - Look RTFParser.jj at http://www.koders.com/noncode/fid69741D7970C9D0BA22243871D7E5682434A2A19E.aspx - seems to be Apache licensed. Also RTFPlainTextExtractor.java is Apache licensed but RTFParserDelegate.java is LGPL (a bit strange)??? However it is rather a simple interface which may be rewritten. After some hacking I was able to try this parser. It correctly extracts Polish diactric characters. It also inserts extra spaces into text in table cells (see https://issues.apache.org/jira/browse/TIKA-392 ) But it failed on some big rtf file claiming that it is syntactically incorrect. Anyway it seems much better for non-english texts.
        Hide
        Shinsuke Sugaya added a comment -

        Attached a patch to convert \'XX into \uXXXX before processing it by RTFEditorKit.

        RTFEditorKit supports unicode character only.
        So, I could not get a correct text from Japanese RTF file.
        It works by this fix.

        Show
        Shinsuke Sugaya added a comment - Attached a patch to convert \'XX into \uXXXX before processing it by RTFEditorKit. RTFEditorKit supports unicode character only. So, I could not get a correct text from Japanese RTF file. It works by this fix.
        Hide
        Cristian Vat added a comment -

        Attached updated version of the patch.

        Changed isUnicode to include characters not from 255 but from 127 upwards. RTF is 7-bit.

        Show
        Cristian Vat added a comment - Attached updated version of the patch. Changed isUnicode to include characters not from 255 but from 127 upwards. RTF is 7-bit.
        Hide
        Cristian Vat added a comment -

        Clarification:

        The previous patch added some extra spaces after/between special (encoded) characters such that "Übersicht" got transformed into "Ü bersicht"
        This is because some characters are encoded using the hex notation "\'hh" so they should not have spaces after the text run, but they are lower than 255.
        As stated in the specification, RTF is 7-bit, so actually everything above 127 will be encoded as \'hh in the file.

        Also, regarding the charset conversion:

        I tried with some documents in Czech with which I had problems before and now they are correctly extracted if I test using a RTF file created using WordPad.
        However, with RTF files output from Microsoft Word there are still some issues. Hope to investigate more tomorrow.

        Show
        Cristian Vat added a comment - Clarification: The previous patch added some extra spaces after/between special (encoded) characters such that "Übersicht" got transformed into "Ü bersicht" This is because some characters are encoded using the hex notation "\'hh" so they should not have spaces after the text run, but they are lower than 255. As stated in the specification, RTF is 7-bit, so actually everything above 127 will be encoded as \'hh in the file. Also, regarding the charset conversion: I tried with some documents in Czech with which I had problems before and now they are correctly extracted if I test using a RTF file created using WordPad. However, with RTF files output from Microsoft Word there are still some issues. Hope to investigate more tomorrow.
        Hide
        Cristian Vat added a comment -

        Attached new patch.

        Added test for checking space after umlaut/encoded character issue.
        Added test for checking extraction of Czech characters from an RTF file created from WordPad. (text sample from Czech wikipedia)

        Also added a test with same conditions and text, but with RTF file exported from Word2010.

        Extraction changes:

        Added a special condition when parsing the font table, to not stop at spaces if we detect we are in a font definition. That's basically done by checking if we have a font-family in the previous string.
        From what I saw this fixes some problems I had with files exported from Word2007 until Word2010, where the font table definitions were expanded/complicated.

        Show
        Cristian Vat added a comment - Attached new patch. Added test for checking space after umlaut/encoded character issue. Added test for checking extraction of Czech characters from an RTF file created from WordPad. (text sample from Czech wikipedia) Also added a test with same conditions and text, but with RTF file exported from Word2010. Extraction changes: Added a special condition when parsing the font table, to not stop at spaces if we detect we are in a font definition. That's basically done by checking if we have a font-family in the previous string. From what I saw this fixes some problems I had with files exported from Word2007 until Word2010, where the font table definitions were expanded/complicated.
        Hide
        Cristian Vat added a comment -

        Attached new patch.

        Previous patch broke TIKA-392 when previous cell ended in an encoded character.
        Fixed by adding extra spaces at the end of each cell. This may result in extra spaces, but I found no other way to do it. And at least for some use-cases the spaces are not important.

        Also added a test to know if it breaks again.
        Test checks if all cells in a table are in correct order and have spaces between them. There are also cells with encoded characters.

        Show
        Cristian Vat added a comment - Attached new patch. Previous patch broke TIKA-392 when previous cell ended in an encoded character. Fixed by adding extra spaces at the end of each cell. This may result in extra spaces, but I found no other way to do it. And at least for some use-cases the spaces are not important. Also added a test to know if it breaks again. Test checks if all cells in a table are in correct order and have spaces between them. There are also cells with encoded characters.
        Hide
        Cristian Vat added a comment -

        Anyone mind looking over the patch so far?
        It seems to solve most problems I encountered with encodings in RTF files.

        Show
        Cristian Vat added a comment - Anyone mind looking over the patch so far? It seems to solve most problems I encountered with encodings in RTF files.
        Hide
        Alex Skochin added a comment - - edited

        Thanks a lot for the patch. It works fine for the most of documents.
        But it seems to fail parsing files that contain curly brackets inside (see attached test_with_curly_brackes.rtf for example).
        Not patched tika read this file without errors.

        Show
        Alex Skochin added a comment - - edited Thanks a lot for the patch. It works fine for the most of documents. But it seems to fail parsing files that contain curly brackets inside (see attached test_with_curly_brackes.rtf for example). Not patched tika read this file without errors.
        Hide
        Piotr Bartosiewicz added a comment -

        Patch looks OK except one thing. Patched rtf woks much slower than the original because for every document the temporary file is created. The stream should be stored in memory for a small documents - see how it is done in tika.utils.RereadableInputStream.

        Show
        Piotr Bartosiewicz added a comment - Patch looks OK except one thing. Patched rtf woks much slower than the original because for every document the temporary file is created. The stream should be stored in memory for a small documents - see how it is done in tika.utils.RereadableInputStream.
        Hide
        Cristian Vat added a comment -

        New patch.

        • Fixed curly brackets exception.
        • Added handling also for "\afN" font setting instead of only "\fN"
        • Reset last font when entering colortbl, fixed a bug with some documents
        • Handled \deffN default font definition

        All tests pass and also some special documents I had look ok.
        Using memory for small files would be good, but not needed in my use-case.

        Show
        Cristian Vat added a comment - New patch. Fixed curly brackets exception. Added handling also for "\afN" font setting instead of only "\fN" Reset last font when entering colortbl, fixed a bug with some documents Handled \deffN default font definition All tests pass and also some special documents I had look ok. Using memory for small files would be good, but not needed in my use-case.
        Hide
        Jukka Zitting added a comment -

        Nice work, thanks! And sorry for the delay in responding. I committed the patch in revision 1066081.

        Show
        Jukka Zitting added a comment - Nice work, thanks! And sorry for the delay in responding. I committed the patch in revision 1066081.
        Hide
        Michael McCandless added a comment -

        Reading through this, I noticed that we didn't add a test case from the test_with_curly_brackets.rtf... attached patch adds it. The only oddity is that the original assert I added (commented out) fails, because we apparently add an extra space around the braces.

        Show
        Michael McCandless added a comment - Reading through this, I noticed that we didn't add a test case from the test_with_curly_brackets.rtf... attached patch adds it. The only oddity is that the original assert I added (commented out) fails, because we apparently add an extra space around the braces.
        Hide
        Chris A. Mattmann added a comment -

        Committed Mike's patch in r1158779. Thanks, Mike!

        Show
        Chris A. Mattmann added a comment - Committed Mike's patch in r1158779. Thanks, Mike!
        Hide
        Michael McCandless added a comment -

        Thanks Chris!

        It was Alex Skochin that created this test originally (test_with_curly_brackets.rtf, above); I just noticed that his example was never committed so I made a new patch for it...

        Show
        Michael McCandless added a comment - Thanks Chris! It was Alex Skochin that created this test originally (test_with_curly_brackets.rtf, above); I just noticed that his example was never committed so I made a new patch for it...

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Piotr Bartosiewicz
          • Votes:
            4 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development