Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2530

OutlookExtractor "buffer underrun" when parsing .msg with embedded .msg

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.16, 1.17
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Reproduced with both Tika 1.16 and Tika 1.17 on Windows but the problem is likely on all platform.

      Description

      When parsing certain .msg files containing certain attachments (e.g. other .msg files), I get this error:

      ...
      Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
              at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
              at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
              at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
              at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
              at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
      ...
      

      I think the issue is with MAPIRtfAttribute not liking it when receiving an empty byte array from OutlookExtractor. I was able to eliminate the error at around line 269 of OutlookExtractor with Tika 1.16 code (or around line 322 with Tika 1.17) with the following:

                  //--- START FIX ---
                  ByteChunk chunk = (ByteChunk) rtfChunk;
                  if (chunk != null && chunk.getValue() != null 
                          && chunk.getValue().length > 0 && !doneBody) {
                      //ByteChunk chunk = (ByteChunk) rtfChunk;
                  //--- END FIX ---
      

      I am not sure if that is a real fix or more should be done than just getting rid of the error to make sure all is extracted properly from all files.

      I cannot share the sample file I have to test since it was given to me as sensitive content and I could not recreate a faulty msg file.

      Thanks

        Attachments

          Activity

            People

            • Assignee:
              tallison@apache.org Tim Allison
              Reporter:
              pascal.essiembre Pascal Essiembre
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: