Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1033

Tika doesn't parse embedded OLE Chart/Graph objects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • parser
    • None

    Description

      I have an example ppt that embeds a chart, but Tika mis-identifies it
      as an XLS document.

      The progID (oleShape.getProgID() in
      HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
      we seem to detect it as Excel (application/vnd.ms-excel) but then the
      ExcelExtractor hits this exception:

      org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
      	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
      	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
      	at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
      	at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
      	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
      	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
      	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
      	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
      

      Since DelegatingParser silently suppresses all exceptions, when you
      run TikaCLI you won't see any exception nor text extracted, but if you
      run with -z, it will save 1.xls which if you then try to parse with
      TikaCLI hits the above exception.

      Attachments

        1. emb.ppt
          88 kB
          Michael McCandless
        2. testMSChart-govdocs-428996.pptx
          55 kB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: