Bug 51992 - VariantSupport confuses bytes and chars for LPWSTR
VariantSupport confuses bytes and chars for LPWSTR
Status: NEW
Product: POI
Classification: Unclassified
Component: HSLF
3.8-dev
PC All
: P2 normal (vote)
: ---
Assigned To: POI Developers List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2011-10-07 14:53 UTC by Joe Gallo
Modified: 2012-02-06 14:49 UTC (History)
0 users



Attachments
patch (875 bytes, application/octet-stream)
2011-10-07 14:53 UTC, Joe Gallo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joe Gallo 2011-10-07 14:53:09 UTC
Created attachment 27727 [details]
patch

A coworker and I ran into some ArrayIndexOutOfBoundsExceptions while 
processing ppt files using POI (by way of Tika), and tracked the problem 
down to some of the code in VariantSupport.

I can't attach the actual ppt files that we had that provoked the problem, but I'll see if can create some new ones.
Comment 1 Joe Gallo 2011-10-07 15:19:15 UTC
1093 [Thrown class java.lang.ArrayIndexOutOfBoundsException] 
 
Restarts: 
 0: [QUIT] Quit to the SLIME top level  
 1: [ABORT] ABORT to SLIME level 0   
 
Backtrace:   
  0: org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:262) 
  1: org.apache.poi.hpsf.Property.<init>(Property.java:164)  
  2: org.apache.poi.hpsf.Section.<init>(Section.java:277) 
  3: org.apache.poi.hpsf.PropertySet.init(PropertySet.java:452) 
  4: org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:247)  
  5: org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:67)   
  6: org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:58)  
  7: org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)   
  8: org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
  9: org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) 
 10: org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) 

----

Sadly, I cannot recreate the issue in a new ppt -- a new one created in libreoffice doesn't display the issue, and if I make changes to the one that I have to anonymize it then the saved version doesn't display the problem either. :(
Comment 2 Yegor Kozlov 2012-02-06 07:06:38 UTC
Which version of Tika and POI? 

Can you obfuscate the problematic ppt file ? The problem is with reading the document properties and this is all we need to track the problem.
Open the problematic ppt file in PowerPoint, delete all content from the slides and save. Is the file readable by Tika ? 

Yegor
Comment 3 Joe Gallo 2012-02-06 14:49:06 UTC
No, that won't work.  If I open edit and save the file, then it no longer displays the problem.

I will try to track down the file again, and if I can find it, I would be willing to set up an appointment for a developer to ssh into my machine and work with the file here.

In the meantime, though, I'm curious if there are any test files for the project for which the aforementioned code is actually correct -- it seems to me that it's just written incorrectly and wouldn't work for any test file that exercises the code (but, of course, these being office formats, I wouldn't be surprised if I'm completely wrong).