Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1236

CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Later
    • 1.6
    • None
    • parser
    • None

    Description

      When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries to detect the encoding. For a handful of files, the CharsetDetector returns "IBM424_rtl" with a confidence > the threshold. This encoding is then set with MAPIMessage.set7BitEncoding(). When MAPI tries to use this encoding, it finds that it is unsupported and throws an exception.
      Full stacktrace:

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@72ccd846
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      	at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
      ...irrelevant test framework junk...
      Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
      	at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
      	at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
      	at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
      	at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
      	at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	... 26 more
      Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
      	at java.lang.StringCoding.decode(Unknown Source)
      	at java.lang.String.<init>(Unknown Source)
      	at java.lang.String.<init>(Unknown Source)
      	at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
      	... 33 more
      

      Unfortunately, I can't share the problematic documents, and I can't create a synthetic document that triggers this issue.

      Two questions:
      1) Should CharsetDetector return an encoding that is not supported?
      2) If so, should we add a simple check before calling set7BitEncoding()?

      Attachments

        1. TIKA-1236.patch
          2 kB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: