Created attachment 27617 [details] Patch for issue Found a handful of Word files that cause ArrayIndexOOB. (Unable to attach sample due to sensitive nature of files). Patch included, essentially the pictureBytesStartOffset used for a System.arraycopy() is sometimes set to a negative value. Fix uses an existing less-than check but also makes sure it's greater than zero before using the new value rather than the default PICTF1BlockOffset. Stack Trace: (POI-3.8-beta4) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.usermodel.Picture.fillRawImageContent(Picture.java:363) at org.apache.poi.hwpf.usermodel.Picture.getRawContent(Picture.java:203) at org.apache.poi.hwpf.usermodel.Picture.fillImageContent(Picture.java:372) at org.apache.poi.hwpf.usermodel.Picture.getContent(Picture.java:191) at org.apache.poi.hwpf.usermodel.Picture.suggestPictureType(Picture.java:330) at org.apache.poi.hwpf.usermodel.Picture.suggestFileExtension(Picture.java:315) at org.apache.poi.hwpf.usermodel.Picture.suggestFullFileName(Picture.java:150) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:504) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:488) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:196) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 45 more
Created attachment 27619 [details] Replaces initial patch.. improved Changes to greater-than or equal to 0
Please, provide example doc. It is possible we will be able to handle image correctly. You can send it to my email privately.
*** Bug 51890 has been marked as a duplicate of this bug. ***
Image loading is completely rewritten. Please, check r1177710 or later.
Works like a charm. (In reply to comment #4) > Image loading is completely rewritten. Please, check r1177710 or later.
Agree, definitly works for the files I was having an issue with prior. Thanks very much for your attention to this matter.
Sergey, Thanks for this re-work of the Picture handling logic for MS Word documents. This seems to have fixed many of the random bugs that would pop-up across a large and varying data-set. I did however uncover one bug that was introduced by these fixes, and have supplied a patch. When you get a chance, could you please take a look at Bug 51974. (https://issues.apache.org/bugzilla/show_bug.cgi?id=51974) It's essentially a null pointer exception that is encountered when parsing text via TIKA that was not present prior. Thanks in advance, Jeremy (In reply to comment #4) > Image loading is completely rewritten. Please, check r1177710 or later.