Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I have an example ppt that embeds a chart, but Tika mis-identifies it
as an XLS document.
The progID (oleShape.getProgID() in
HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
we seem to detect it as Excel (application/vnd.ms-excel) but then the
ExcelExtractor hits this exception:
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
Since DelegatingParser silently suppresses all exceptions, when you
run TikaCLI you won't see any exception nor text extracted, but if you
run with -z, it will save 1.xls which if you then try to parse with
TikaCLI hits the above exception.
Attachments
Attachments
Issue Links
- is duplicated by
-
TIKA-1651 Add mime detection (and parsing?) for Microsoft Chart object
- Resolved