Description
I have a Word (.doc) document that hits an exception when I run:
java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc
Here's the exception:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
It happens when we try to parse an OLE10 embedded object ... the code
that does this parsing captures and ignores Ole10NativeException and
skips the entry ... so I'm wondering if we should also catch AIOOBE
and skip the entry? Ie, maybe this entry really is not OLE10, and the
Ole10Native code is failing to throw Ole10NativeException for it?