Bug 51992

Summary: VariantSupport confuses bytes and chars for LPWSTR
Product: POI Reporter: Joe Gallo <jsg8pitt>
Component: HPSFAssignee: POI Developers List <dev>
Status: RESOLVED LATER    
Severity: normal    
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: patch

Description Joe Gallo 2011-10-07 14:53:09 UTC
Created attachment 27727 [details]
patch

A coworker and I ran into some ArrayIndexOutOfBoundsExceptions while 
processing ppt files using POI (by way of Tika), and tracked the problem 
down to some of the code in VariantSupport.

I can't attach the actual ppt files that we had that provoked the problem, but I'll see if can create some new ones.
Comment 1 Joe Gallo 2011-10-07 15:19:15 UTC
1093 [Thrown class java.lang.ArrayIndexOutOfBoundsException] 
 
Restarts: 
 0: [QUIT] Quit to the SLIME top level  
 1: [ABORT] ABORT to SLIME level 0   
 
Backtrace:   
  0: org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:262) 
  1: org.apache.poi.hpsf.Property.<init>(Property.java:164)  
  2: org.apache.poi.hpsf.Section.<init>(Section.java:277) 
  3: org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452) 
  4: org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247)  
  5: org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:67)   
  6: org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:58)  
  7: org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)   
  8: org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
  9: org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
 10: org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) 

----

Sadly, I cannot recreate the issue in a new ppt -- a new one created in libreoffice doesn't display the issue, and if I make changes to the one that I have to anonymize it then the saved version doesn't display the problem either. :(
Comment 2 Yegor Kozlov 2012-02-06 07:06:38 UTC
Which version of Tika and POI? 

Can you obfuscate the problematic ppt file ? The problem is with reading the document properties and this is all we need to track the problem.
Open the problematic ppt file in PowerPoint, delete all content from the slides and save. Is the file readable by Tika ? 

Yegor
Comment 3 Joe Gallo 2012-02-06 14:49:06 UTC
No, that won't work.  If I open edit and save the file, then it no longer displays the problem.

I will try to track down the file again, and if I can find it, I would be willing to set up an appointment for a developer to ssh into my machine and work with the file here.

In the meantime, though, I'm curious if there are any test files for the project for which the aforementioned code is actually correct -- it seems to me that it's just written incorrectly and wouldn't work for any test file that exercises the code (but, of course, these being office formats, I wouldn't be surprised if I'm completely wrong).
Comment 4 Andreas Beeker 2016-05-14 21:25:50 UTC
The patch can't be applied anymore - please test if the latest version of POI 
is still affected and ideally add a sample file.