Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Later
-
1.6
-
None
-
None
Description
When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries to detect the encoding. For a handful of files, the CharsetDetector returns "IBM424_rtl" with a confidence > the threshold. This encoding is then set with MAPIMessage.set7BitEncoding(). When MAPI tries to use this encoding, it finds that it is unsupported and throws an exception.
Full stacktrace:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@72ccd846 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264) ...irrelevant test framework junk... Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149) at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85) at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74) at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455) at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 26 more Caused by: java.io.UnsupportedEncodingException: IBM424_rtl at java.lang.StringCoding.decode(Unknown Source) at java.lang.String.<init>(Unknown Source) at java.lang.String.<init>(Unknown Source) at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147) ... 33 more
Unfortunately, I can't share the problematic documents, and I can't create a synthetic document that triggers this issue.
Two questions:
1) Should CharsetDetector return an encoding that is not supported?
2) If so, should we add a simple check before calling set7BitEncoding()?
Attachments
Attachments
Issue Links
- relates to
-
TIKA-3516 Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector
- Resolved