Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2530

OutlookExtractor "buffer underrun" when parsing .msg with embedded .msg


    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.16, 1.17
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      Reproduced with both Tika 1.16 and Tika 1.17 on Windows but the problem is likely on all platform.


      When parsing certain .msg files containing certain attachments (e.g. other .msg files), I get this error:

      Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
              at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
              at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
              at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
              at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
              at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)

      I think the issue is with MAPIRtfAttribute not liking it when receiving an empty byte array from OutlookExtractor. I was able to eliminate the error at around line 269 of OutlookExtractor with Tika 1.16 code (or around line 322 with Tika 1.17) with the following:

                  //--- START FIX ---
                  ByteChunk chunk = (ByteChunk) rtfChunk;
                  if (chunk != null && chunk.getValue() != null 
                          && chunk.getValue().length > 0 && !doneBody) {
                      //ByteChunk chunk = (ByteChunk) rtfChunk;
                  //--- END FIX ---

      I am not sure if that is a real fix or more should be done than just getting rid of the error to make sure all is extracted properly from all files.

      I cannot share the sample file I have to test since it was given to me as sensitive content and I could not recreate a faulty msg file.





            • Assignee:
              tallison@apache.org Tim Allison
              pascal.essiembre Pascal Essiembre
            • Votes:
              0 Vote for this issue
              4 Start watching this issue


              • Created: