Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1033

Tika doesn't parse embedded OLE Chart/Graph objects

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      I have an example ppt that embeds a chart, but Tika mis-identifies it
      as an XLS document.

      The progID (oleShape.getProgID() in
      HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
      we seem to detect it as Excel (application/vnd.ms-excel) but then the
      ExcelExtractor hits this exception:

      org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
      	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
      	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
      	at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
      	at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
      	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
      	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
      	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
      	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
      

      Since DelegatingParser silently suppresses all exceptions, when you
      run TikaCLI you won't see any exception nor text extracted, but if you
      run with -z, it will save 1.xls which if you then try to parse with
      TikaCLI hits the above exception.

        Attachments

        1. testMSChart-govdocs-428996.pptx
          55 kB
          Tim Allison
        2. emb.ppt
          88 kB
          Michael McCandless

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mikemccand Michael McCandless
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: