Details
Description
We see that TIKA throws a long list of errors when extraction ppt files. We tested with standalone tike application (1.13) we cannot reproduce the issue.
We took a look at POI source code and abserved the class "HSLFSlideShow" we could see the below deprecated method defined
*
/**
- * Get the lookup from slide numbers to their offsets inside
- * _ptrData, used when adding or moving slides.
- *
- * @deprecated since POI 3.11, not supported anymore
- */
- @Deprecated
- public Hashtable<Integer,Integer> getSlideOffsetDataLocationsLookup() { - throw new UnsupportedOperationException("PersistPtrHolder.getSlideOffsetDataLocationsLookup() is not supported since 3.12-Beta1"); - }
*
we may think Tika library still calling this deprecated method causing this run time Exception
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@204c3b78
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.searchtechnologies.aspire.docprocessing.extracttext.ExtractTextStage.process(ExtractTextStage.java:140)
... 14 more
Caused by: java.lang.UnsupportedOperationException
at java.util.AbstractMap$SimpleImmutableEntry.setValue(Unknown Source)
at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:293)
at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:273)
at org.apache.poi.hslf.HSLFSlideShow.<init>(HSLFSlideShow.java:188)
at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 17 more