When parsing the attached MS Word 8.0 file via Tika, I get the following exception: $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/msprog/smogcheck/july00/iiif.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@44aea710 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) Caused by: java.lang.ArrayIndexOutOfBoundsException: 610125 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:45) at org.apache.poi.ddf.EscherRecord$EscherRecordHeader.readHeader(EscherRecord.java:250) at org.apache.poi.ddf.DefaultEscherRecordFactory.createRecord(DefaultEscherRecordFactory.java:56) at org.apache.poi.hwpf.model.PicturesTable.searchForPictures(PicturesTable.java:169) at org.apache.poi.hwpf.model.PicturesTable.searchForPictures(PicturesTable.java:180) at org.apache.poi.hwpf.model.PicturesTable.searchForPictures(PicturesTable.java:180) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:207) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) ... 5 more
Created attachment 26777 [details] word file which causes error, as downloaded from http://www.arb.ca.gov/msprog/smogcheck/july00/iiif.doc
The problem is not reproducible with latest build from trunk. I added a unit test and included the attached document in our collection of test documents. Yegor
Issue reopened, tested with r1175705 from trunk (through tika) : java.lang.ArrayIndexOutOfBoundsException: 70185 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:45) at org.apache.poi.ddf.DefaultEscherRecordFactory.createRecord(DefaultEscherRecordFactory.java:60) at org.apache.poi.hwpf.model.PicturesTable.searchForPictures(PicturesTable.java:182) at org.apache.poi.hwpf.model.PicturesTable.searchForPictures(PicturesTable.java:193) at org.apache.poi.hwpf.model.PicturesTable.searchForPictures(PicturesTable.java:193) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:220) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:498) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:488) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
Created attachment 27594 [details] Throwing ArrayIndexOutOfBoundsException
Created attachment 27595 [details] Another one Throwing ArrayIndexOutOfBoundsException
(In reply to comment #5) > Created attachment 27595 [details] > Another one Throwing ArrayIndexOutOfBoundsException This one failing validation: <BFFValidation path="Bug50936_3.doc" datetime="10/30/11 03:16:10" result="FAILED"> <ParseStack> <!-- skipped --> <Type docName="MS-DOC" sectionTitle="Section Properties" msdnLink="http://msdn.microsoft.com/en-us/library/46c3ec54-53ff-4c0a-b0d6-07ad15d2546e" streamName="WordDocument" streamOffset="166413" hexStreamOffset="0x28a0d"/> <LastData><![CDATA[ 3A -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- : ]]></LastData> </BFFValidation>
*** This bug has been marked as a duplicate of bug 47958 ***