Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2530

OutlookExtractor "buffer underrun" when parsing .msg with embedded .msg

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16, 1.17
    • None
    • parser
    • None
    • Reproduced with both Tika 1.16 and Tika 1.17 on Windows but the problem is likely on all platform.

    Description

      When parsing certain .msg files containing certain attachments (e.g. other .msg files), I get this error:

      ...
      Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
              at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
              at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
              at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
              at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
              at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
      ...
      

      I think the issue is with MAPIRtfAttribute not liking it when receiving an empty byte array from OutlookExtractor. I was able to eliminate the error at around line 269 of OutlookExtractor with Tika 1.16 code (or around line 322 with Tika 1.17) with the following:

                  //--- START FIX ---
                  ByteChunk chunk = (ByteChunk) rtfChunk;
                  if (chunk != null && chunk.getValue() != null 
                          && chunk.getValue().length > 0 && !doneBody) {
                      //ByteChunk chunk = (ByteChunk) rtfChunk;
                  //--- END FIX ---
      

      I am not sure if that is a real fix or more should be done than just getting rid of the error to make sure all is extracted properly from all files.

      I cannot share the sample file I have to test since it was given to me as sensitive content and I could not recreate a faulty msg file.

      Thanks

      Attachments

        1. test_file.txt
          0.0 kB
          Tomasz L

        Activity

          People

            tallison Tim Allison
            pascal.essiembre Pascal Essiembre
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: