Created attachment 27727 [details] patch A coworker and I ran into some ArrayIndexOutOfBoundsExceptions while processing ppt files using POI (by way of Tika), and tracked the problem down to some of the code in VariantSupport. I can't attach the actual ppt files that we had that provoked the problem, but I'll see if can create some new ones.
1093 [Thrown class java.lang.ArrayIndexOutOfBoundsException] Restarts: 0: [QUIT] Quit to the SLIME top level 1: [ABORT] ABORT to SLIME level 0 Backtrace: 0: org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:262) 1: org.apache.poi.hpsf.Property.<init>(Property.java:164) 2: org.apache.poi.hpsf.Section.<init>(Section.java:277) 3: org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452) 4: org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247) 5: org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:67) 6: org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:58) 7: org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182) 8: org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 9: org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 10: org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) ---- Sadly, I cannot recreate the issue in a new ppt -- a new one created in libreoffice doesn't display the issue, and if I make changes to the one that I have to anonymize it then the saved version doesn't display the problem either. :(
Which version of Tika and POI? Can you obfuscate the problematic ppt file ? The problem is with reading the document properties and this is all we need to track the problem. Open the problematic ppt file in PowerPoint, delete all content from the slides and save. Is the file readable by Tika ? Yegor
No, that won't work. If I open edit and save the file, then it no longer displays the problem. I will try to track down the file again, and if I can find it, I would be willing to set up an appointment for a developer to ssh into my machine and work with the file here. In the meantime, though, I'm curious if there are any test files for the project for which the aforementioned code is actually correct -- it seems to me that it's just written incorrectly and wouldn't work for any test file that exercises the code (but, of course, these being office formats, I wouldn't be surprised if I'm completely wrong).
The patch can't be applied anymore - please test if the latest version of POI is still affected and ideally add a sample file.