[TIKA-422] Wrong charset conversion in some RTF documents. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7
Fix Version/s: 0.9
Component/s: parser
Labels:
None

Description

RTF parser uses javax.swing.text.rtf, but it sucks.

It doesn't support '\ansicpg' tag (cite from RTF file format specification:
"This keyword represents the default ANSI code page used to perform the Unicode to ANSI conversion when writing RTF text").

Unfortunately Windows WordPad saves nonascii characters using \ansicpg instead of supported by javax.swing.text.rtf unicode characters.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

RTFParser.patch
20/Dec/10 10:58
65 kB
Cristian Vat
RTFParser.patch
14/Oct/10 18:53
63 kB
Cristian Vat
RTFParser.patch
13/Oct/10 00:52
60 kB
Cristian Vat
RTFParser.patch
12/Oct/10 21:41
21 kB
Cristian Vat
RTFParser.patch
10/Oct/10 01:11
21 kB
Shinsuke Sugaya
test_with_curly_brackets.rtf
21/Oct/10 14:36
9 kB
Alex Skochin
test-windows-1250.rtf
10/May/10 08:59
0.3 kB
Piotr Bartosiewicz
TIKA-422.patch
15/Aug/11 10:13
10 kB
Michael McCandless

Activity

People

Assignee:: Jukka Zitting

Reporter:: Piotr Bartosiewicz

Votes:: 4 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/May/10 08:52

Updated:: 02/Aug/12 09:33

Resolved:: 01/Feb/11 16:16