Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1238

Update OutlookExtractor to handle codepage identification more rigorously

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:
      None

      Description

      Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has added more robutst capabilities for identifying codepages in Outlook .msg files. As a first step to integrating those improvements, I'll copy and paste some of POI's code into OutlookExtractor. As a second step, I'll expose more of HSMF's capabilities within POI and then factor out the duplicate code in Tika.

        Activity

        Hide
        chrismattmann Chris A. Mattmann added a comment -
        • push to 1.8
        Show
        chrismattmann Chris A. Mattmann added a comment - push to 1.8
        Hide
        rangma Magesh Tarala added a comment -

        I have the following issue with some outlook emails (.msg files). Any quick resolution to this?

        org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@fbf0926
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:258)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at com.bell.solr.CRMDataToSolr.parseAttachments(CRMDataToSolr.java:1713)
        at com.bell.solr.CRMDataToSolr.processSO(CRMDataToSolr.java:2138)
        at com.bell.solr.CRMDataToSolr.processFolders(CRMDataToSolr.java:1958)
        at com.bell.solr.CRMDataToSolr.main(CRMDataToSolr.java:2365)
        Caused by: java.lang.RuntimeException: Encoding not found - cp4020
        at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:155)
        at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
        at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
        at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
        at org.apache.poi.hsmf.MAPIMessage.guess7BitEncoding(MAPIMessage.java:389)
        at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:81)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:225)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
        ... 6 more
        Caused by: java.io.UnsupportedEncodingException: cp4020
        at java.lang.StringCoding.decode(StringCoding.java:190)
        at java.lang.String.<init>(String.java:416)
        at java.lang.String.<init>(String.java:481)
        at org.apach

        Show
        rangma Magesh Tarala added a comment - I have the following issue with some outlook emails (.msg files). Any quick resolution to this? org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@fbf0926 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:258) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.bell.solr.CRMDataToSolr.parseAttachments(CRMDataToSolr.java:1713) at com.bell.solr.CRMDataToSolr.processSO(CRMDataToSolr.java:2138) at com.bell.solr.CRMDataToSolr.processFolders(CRMDataToSolr.java:1958) at com.bell.solr.CRMDataToSolr.main(CRMDataToSolr.java:2365) Caused by: java.lang.RuntimeException: Encoding not found - cp4020 at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:155) at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85) at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74) at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455) at org.apache.poi.hsmf.MAPIMessage.guess7BitEncoding(MAPIMessage.java:389) at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:81) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:225) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256) ... 6 more Caused by: java.io.UnsupportedEncodingException: cp4020 at java.lang.StringCoding.decode(StringCoding.java:190) at java.lang.String.<init>(String.java:416) at java.lang.String.<init>(String.java:481) at org.apach
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Magesh Tarala, Any chance you could share a test file? Do you know what the actual encoding of the msg file is?

        Show
        tallison@mitre.org Tim Allison added a comment - Magesh Tarala , Any chance you could share a test file? Do you know what the actual encoding of the msg file is?
        Hide
        rangma Magesh Tarala added a comment -

        Hi Tim,
        The files have personal information and I'd rather not attach them to the JIRA ticket. However, I could share with you directly. Can I email them to you? or store in some place accessible to you? Please let me know.

        How do I find the actual encoding of these files?

        Thanks,
        Magesh.

        Show
        rangma Magesh Tarala added a comment - Hi Tim, The files have personal information and I'd rather not attach them to the JIRA ticket. However, I could share with you directly. Can I email them to you? or store in some place accessible to you? Please let me know. How do I find the actual encoding of these files? Thanks, Magesh.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        The stacktrace is related to my original problem, but actually shows an inconsistency in POI's handling of UnsupportedEncodingException. POI has a try-catch block for that exception only on the first choice for guessing 7 bit encoding. The second and third choice take whatever value could be pulled out of the header or the html meta-equiv and set7BitEncoding(charset) without the try-catch block.

        Turns out another problem is that, of course, Charset.forName() can throw an UnsupportedCharsetException (not UnsupportedEncodingException)...so that's not even checked for in POI's code. And, while we're defending against trying to create a charset from whatever value we find in msg/html headers or codepoint values, we should also add IllegalCharsetName in the catch block...or just go for IllegalArgumentException and be done with it.

        As an immediate fix at the Tika level, we can duplicate POI's guess7BitEncoding but add the try-catch blocks. I'll open an issue in POI's bugtracker, though, to fix this at the POI level too.

        Test files will be very helpful. If you can share, please do.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited The stacktrace is related to my original problem, but actually shows an inconsistency in POI's handling of UnsupportedEncodingException . POI has a try-catch block for that exception only on the first choice for guessing 7 bit encoding. The second and third choice take whatever value could be pulled out of the header or the html meta-equiv and set7BitEncoding(charset) without the try-catch block. Turns out another problem is that, of course, Charset.forName() can throw an UnsupportedCharsetException (not UnsupportedEncodingException )...so that's not even checked for in POI's code. And, while we're defending against trying to create a charset from whatever value we find in msg/html headers or codepoint values, we should also add IllegalCharsetName in the catch block...or just go for IllegalArgumentException and be done with it. As an immediate fix at the Tika level, we can duplicate POI's guess7BitEncoding but add the try-catch blocks. I'll open an issue in POI's bugtracker, though, to fix this at the POI level too. Test files will be very helpful. If you can share, please do.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Got it. For now, let's see if I can find some triggering files in a fresh pull of .msg files from CommonCrawl via Dominik Stadler's very handy CommonCrawl downloader.

        Show
        tallison@mitre.org Tim Allison added a comment - Got it. For now, let's see if I can find some triggering files in a fresh pull of .msg files from CommonCrawl via Dominik Stadler 's very handy CommonCrawl downloader .
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Probably not the best way to transfer a file...

        I made the necessary changes at the Tika level to protect against an MSG file containing corrupt encoding info in r1691962.

        If you can build from trunk, give that a shot. Or, wait for the next Jenkins build.

        Show
        tallison@mitre.org Tim Allison added a comment - Probably not the best way to transfer a file... I made the necessary changes at the Tika level to protect against an MSG file containing corrupt encoding info in r1691962. If you can build from trunk, give that a shot. Or, wait for the next Jenkins build.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        That should do it.

        Show
        tallison@mitre.org Tim Allison added a comment - That should do it.
        Hide
        rangma Magesh Tarala added a comment -

        Thanks for the super quick response Tim!! Appreciate it very much.

        I'm not building Tika now and so I'll wait for the next jenkins build.

        Show
        rangma Magesh Tarala added a comment - Thanks for the super quick response Tim!! Appreciate it very much. I'm not building Tika now and so I'll wait for the next jenkins build.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Doh. Reopening until we get the mods to POI and then the updated Tika code after the next POI release.

        Show
        tallison@mitre.org Tim Allison added a comment - Doh. Reopening until we get the mods to POI and then the updated Tika code after the next POI release.
        Hide
        rangma Magesh Tarala added a comment -

        Tim - This fix will be in 1.10, right? When do you think I can download 1.10 from here: https://tika.apache.org/download.html ?

        Show
        rangma Magesh Tarala added a comment - Tim - This fix will be in 1.10, right? When do you think I can download 1.10 from here: https://tika.apache.org/download.html ?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        That's up to the community, but I think we have another issue that will force a release within the next few weeks. I'm sorry I don't have a better idea.

        Chris A. Mattmann et al, what's your take on the importance of a new release after we fix TIKA-1690?

        Show
        tallison@mitre.org Tim Allison added a comment - That's up to the community, but I think we have another issue that will force a release within the next few weeks. I'm sorry I don't have a better idea. Chris A. Mattmann et al, what's your take on the importance of a new release after we fix TIKA-1690 ?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #796 (See https://builds.apache.org/job/tika-trunk-jdk1.7/796/)
        TIKA-1238: Update OutlookExtractor's codepoint detection algorithm (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1691962)

        • /tika/trunk/CHANGES.txt
        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #796 (See https://builds.apache.org/job/tika-trunk-jdk1.7/796/ ) TIKA-1238 : Update OutlookExtractor's codepoint detection algorithm (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1691962 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
        Hide
        davemeikle Dave Meikle added a comment -

        Fixed committed in r1691962.

        Show
        davemeikle Dave Meikle added a comment - Fixed committed in r1691962.

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development