Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.16, 1.17
-
None
-
None
-
Reproduced with both Tika 1.16 and Tika 1.17 on Windows but the problem is likely on all platform.
Description
When parsing certain .msg files containing certain attachments (e.g. other .msg files), I get this error:
... Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662) at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73) at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81) at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42) at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270) ...
I think the issue is with MAPIRtfAttribute not liking it when receiving an empty byte array from OutlookExtractor. I was able to eliminate the error at around line 269 of OutlookExtractor with Tika 1.16 code (or around line 322 with Tika 1.17) with the following:
//--- START FIX --- ByteChunk chunk = (ByteChunk) rtfChunk; if (chunk != null && chunk.getValue() != null && chunk.getValue().length > 0 && !doneBody) { //ByteChunk chunk = (ByteChunk) rtfChunk; //--- END FIX ---
I am not sure if that is a real fix or more should be done than just getting rid of the error to make sure all is extracted properly from all files.
I cannot share the sample file I have to test since it was given to me as sensitive content and I could not recreate a faulty msg file.
Thanks