Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1033

Tika doesn't parse embedded OLE Chart/Graph objects

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      I have an example ppt that embeds a chart, but Tika mis-identifies it
      as an XLS document.

      The progID (oleShape.getProgID() in
      HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
      we seem to detect it as Excel (application/vnd.ms-excel) but then the
      ExcelExtractor hits this exception:

      org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
      	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
      	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
      	at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
      	at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
      	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
      	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
      	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
      	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
      

      Since DelegatingParser silently suppresses all exceptions, when you
      run TikaCLI you won't see any exception nor text extracted, but if you
      run with -z, it will save 1.xls which if you then try to parse with
      TikaCLI hits the above exception.

      1. emb.ppt
        88 kB
        Michael McCandless
      2. testMSChart-govdocs-428996.pptx
        55 kB
        Tim Allison

        Issue Links

          Activity

          Hide
          gagravarr Nick Burch added a comment -

          Are you able to get the full stacktrace? It'd be interesting to see what the cause is of the RecordFormatException, so we can work out if it's a corrupted file or a bug in POI

          Show
          gagravarr Nick Burch added a comment - Are you able to get the full stacktrace? It'd be interesting to see what the cause is of the RecordFormatException, so we can work out if it's a corrupted file or a bug in POI
          Hide
          mikemccand Michael McCandless added a comment -

          Here's the full stack trace when I parse the .xls file that TikaCLI extracts:

          Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1
          	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
          	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
          	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
          	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
          	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
          	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
          Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
          	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
          	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
          	at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
          	at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
          	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
          	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
          	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292)
          	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144)
          	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194)
          	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
          	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
          	... 5 more
          Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes
          	at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216)
          	at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233)
          	at org.apache.poi.hssf.record.WindowOneRecord.<init>(WindowOneRecord.java:71)
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
          	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
          	at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
          	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57)
          	... 15 more
          
          Show
          mikemccand Michael McCandless added a comment - Here's the full stack trace when I parse the .xls file that TikaCLI extracts: Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4eaf6cb1 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121) Caused by: org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:292) at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:144) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:194) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more Caused by: org.apache.poi.hssf.record.RecordFormatException: Not enough data (0) to read requested (2) bytes at org.apache.poi.hssf.record.RecordInputStream.checkRecordPosition(RecordInputStream.java:216) at org.apache.poi.hssf.record.RecordInputStream.readShort(RecordInputStream.java:233) at org.apache.poi.hssf.record.WindowOneRecord.<init>(WindowOneRecord.java:71) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:57) ... 15 more
          Hide
          gagravarr Nick Burch added a comment -

          Looks like the WindowOneRecord isn't the size that POI expects it to be. Do you know the origin of the file, was it produced by Office or something else? And can you try running the Microsoft Binary File Format Validator tool against it to see if it's actually a valid .xls file or not?

          Assuming it's a valid file produced by Office, you'll then want to report a POI bug. If it's not a valid file and comes from elsewhere, you'll need to report a bug in the program used to generate the file...

          Show
          gagravarr Nick Burch added a comment - Looks like the WindowOneRecord isn't the size that POI expects it to be. Do you know the origin of the file, was it produced by Office or something else? And can you try running the Microsoft Binary File Format Validator tool against it to see if it's actually a valid .xls file or not? Assuming it's a valid file produced by Office, you'll then want to report a POI bug. If it's not a valid file and comes from elsewhere, you'll need to report a bug in the program used to generate the file...
          Hide
          mikemccand Michael McCandless added a comment -

          I think emb.ppt was explicitly created as a test case, but not by me ... I'll see if I can get the details.

          OK I just ran the attached emb.ppt through the Microsoft Binary File Format Validator tool and it passed, but when I run it on 1.xls (which TikaCLI -z had saved, from the embedded Chart), it fails with this message:

          BFFValidator: "x:\tmp\1.xls" NOT RECOGNIZED (The Microsoft Office Binary File Fo
          rmat Validator encountered an error reading the file you specified, OR The Micro
          soft Office Binary File Format Validator supports Word, Excel, and PowerPoint bi
          nary file formats only. The file you specified is an unsupported file type.) at
          11/27/12 07:23:58
          

          It sounds like the tool doesn't expect to get a "raw" chart object? (Tika is mis-identifying this embedded chart object as XLS and saving 1.xls). Either that or somehow Tika saved the wrong bits when it extracted the embedded chart object?

          Show
          mikemccand Michael McCandless added a comment - I think emb.ppt was explicitly created as a test case, but not by me ... I'll see if I can get the details. OK I just ran the attached emb.ppt through the Microsoft Binary File Format Validator tool and it passed, but when I run it on 1.xls (which TikaCLI -z had saved, from the embedded Chart), it fails with this message: BFFValidator: "x:\tmp\1.xls" NOT RECOGNIZED (The Microsoft Office Binary File Fo rmat Validator encountered an error reading the file you specified, OR The Micro soft Office Binary File Format Validator supports Word, Excel, and PowerPoint bi nary file formats only. The file you specified is an unsupported file type.) at 11/27/12 07:23:58 It sounds like the tool doesn't expect to get a "raw" chart object? (Tika is mis-identifying this embedded chart object as XLS and saving 1.xls). Either that or somehow Tika saved the wrong bits when it extracted the embedded chart object?
          Hide
          gagravarr Nick Burch added a comment -

          The "raw chart object" looks to actually be an excel file, running org.apache.poi.poifs.dev.POIFSLister against it gives:

          Root Entry -
          CompObj <(0x01)CompObj>
          Workbook
          Ole <(0x01)Ole>

          So there's an excel workbook in there. POIFSViewer shows the only bit with any real data in it is the Workbook entry, and bits of text from the chart are there, so whatever the chart data is it's in the excel file part. That's why Tika is saying it's an excel file!

          Note that embedded objects in office files are actually stored as the raw object (used for editing), and a rendered version of the file (so that viewing the parent document is quick, normally an EMF)

          Show
          gagravarr Nick Burch added a comment - The "raw chart object" looks to actually be an excel file, running org.apache.poi.poifs.dev.POIFSLister against it gives: Root Entry - CompObj <(0x01)CompObj> Workbook Ole <(0x01)Ole> So there's an excel workbook in there. POIFSViewer shows the only bit with any real data in it is the Workbook entry, and bits of text from the chart are there, so whatever the chart data is it's in the excel file part. That's why Tika is saying it's an excel file! Note that embedded objects in office files are actually stored as the raw object (used for editing), and a rendered version of the file (so that viewing the parent document is quick, normally an EMF)
          Hide
          mikemccand Michael McCandless added a comment -

          I asked the person who created this test file; here's his answer:

          I created the file with my PowerPoint (PowerPoint 2003). 
          
          To embed the chart:
          
          1. Select from the menu Insert
          2. Select chart (I selected the default chart)
          3. Place the chart
          
          Show
          mikemccand Michael McCandless added a comment - I asked the person who created this test file; here's his answer: I created the file with my PowerPoint (PowerPoint 2003). To embed the chart: 1. Select from the menu Insert 2. Select chart (I selected the default chart) 3. Place the chart
          Hide
          mikemccand Michael McCandless added a comment -

          The "raw chart object" looks to actually be an excel file,

          Hmm, so now I'm very confused Did something go wrong when Tika pulled out the bits from emb.ppt to create 1.xls? When I try to open 1.xls in Excel it's unhappy ("Cannot open Microsoft Graph chart gallery files.").

          Note that embedded objects in office files are actually stored as the raw object (used for editing), and a rendered version of the file (so that viewing the parent document is quick, normally an EMF)

          Yeah I see separately the *.emf files being extracted by TikaCLI.

          Show
          mikemccand Michael McCandless added a comment - The "raw chart object" looks to actually be an excel file, Hmm, so now I'm very confused Did something go wrong when Tika pulled out the bits from emb.ppt to create 1.xls? When I try to open 1.xls in Excel it's unhappy ("Cannot open Microsoft Graph chart gallery files."). Note that embedded objects in office files are actually stored as the raw object (used for editing), and a rendered version of the file (so that viewing the parent document is quick, normally an EMF) Yeah I see separately the *.emf files being extracted by TikaCLI.
          Hide
          gagravarr Nick Burch added a comment -

          It looks like it's a special kind of excel file generated for holding the chart. If I open the ppt file in openoffice and double click on the chart it opens OOCalc, so that too thinks it's a kind of excel file. If you double click in your copy of powerpoint, does it launch excel or something else to let you modify it?

          For this bug, I'd suggest you raise a new issue in the POI bugzilla, upload the .ppt and extracted .xls, include the key details and link back to this jira.

          Show
          gagravarr Nick Burch added a comment - It looks like it's a special kind of excel file generated for holding the chart. If I open the ppt file in openoffice and double click on the chart it opens OOCalc, so that too thinks it's a kind of excel file. If you double click in your copy of powerpoint, does it launch excel or something else to let you modify it? For this bug, I'd suggest you raise a new issue in the POI bugzilla, upload the .ppt and extracted .xls, include the key details and link back to this jira.
          Hide
          mikemccand Michael McCandless added a comment -

          Interesting: with PowerPoint 2007, when I double-click the embedded chart, it pops up a dialogue box saying "To edit this chart using the new features available in the 2007 Microsoft Office system, you must first convert it to the 2007 Office system format. Do you want to convert this chart to the new format? [Convert] [Convert All] [Edit Existing]". If I click [Edit Existing] it lets me edit the chart data in what looks like Excel, in "Compatibility Mode".

          OK I'll open a POI bug and reference back to this issue...

          Thanks Nick.

          Show
          mikemccand Michael McCandless added a comment - Interesting: with PowerPoint 2007, when I double-click the embedded chart, it pops up a dialogue box saying "To edit this chart using the new features available in the 2007 Microsoft Office system, you must first convert it to the 2007 Office system format. Do you want to convert this chart to the new format? [Convert] [Convert All] [Edit Existing] ". If I click [Edit Existing] it lets me edit the chart data in what looks like Excel, in "Compatibility Mode". OK I'll open a POI bug and reference back to this issue... Thanks Nick.
          Show
          mikemccand Michael McCandless added a comment - OK I opened https://issues.apache.org/bugzilla/show_bug.cgi?id=54213
          Hide
          tpalsulich Tyler Palsulich added a comment -

          I'm able to reproduce this issue with Tika 1.8-SNAPSHOT. Didn't investigate beyond that.

          Show
          tpalsulich Tyler Palsulich added a comment - I'm able to reproduce this issue with Tika 1.8-SNAPSHOT. Didn't investigate beyond that.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          I re-discovered this roughly a year ago on TIKA-1651. The # of parse exceptions on embedded xls files was crazily higher than un-embedded xls...and then I discovered that they really weren't xls.

          On POI 54213, Yegor Kozlov (who actually knows what he's talking about) confirmed my suspicions in looking into the file format with POI. This is a very different type of file from XLS. I did some investigatory hackery to modify the read lengths on POI and I could see some data, but it looks like it'll take a fair amount of effort to add parsing for this without breaking XLS parsing.

          As a first step, we could follow Yegor's recommendation and add detection at least via inspection of the container. What mime type do we want to use? application/ms-chart?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited I re-discovered this roughly a year ago on TIKA-1651 . The # of parse exceptions on embedded xls files was crazily higher than un-embedded xls...and then I discovered that they really weren't xls. On POI 54213, Yegor Kozlov (who actually knows what he's talking about) confirmed my suspicions in looking into the file format with POI. This is a very different type of file from XLS. I did some investigatory hackery to modify the read lengths on POI and I could see some data, but it looks like it'll take a fair amount of effort to add parsing for this without breaking XLS parsing. As a first step, we could follow Yegor's recommendation and add detection at least via inspection of the container. What mime type do we want to use? application/ms-chart ?
          Hide
          gagravarr Nick Burch added a comment - - edited

          I think it should be a x- or vnd.ms- prefix under application. Given what Excel .xls uses, application/vnd.ms-chart is probably a good fit

          Show
          gagravarr Nick Burch added a comment - - edited I think it should be a x- or vnd.ms- prefix under application. Given what Excel .xls uses, application/vnd.ms-chart is probably a good fit
          Hide
          tallison@mitre.org Tim Allison added a comment -

          When I saved the .ppt file that I submitted over on TIKA-1651 as a pptx, PowerPoint saved the same embedded vnd.ms-chart object as is in "embeddings".

          Show
          tallison@mitre.org Tim Allison added a comment - When I saved the .ppt file that I submitted over on TIKA-1651 as a pptx, PowerPoint saved the same embedded vnd.ms-chart object as is in "embeddings".
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I added detection by looking into the CompObj bytes in POIFSContainerDetector. My first attempt extracted the progID and then passed that information through the Metadata, as Yegor recommended. However, the equivalent in pptx, xlsx and xls was not immediately clear. It was simpler to add the detection for all four container files in one place in POIFSContainerDetector.

          When I tried to copy and paste that chart/graph into Word, it was saved as a non-visible xls or xlsx in doc and docx respectively. I think I remember from last time I looked into this, that these objects don't exist in doc/docx.

          It would be great if we could add extraction, but I don't think I'll be able to work on that any time soon. Any takers?

          Show
          tallison@mitre.org Tim Allison added a comment - I added detection by looking into the CompObj bytes in POIFSContainerDetector. My first attempt extracted the progID and then passed that information through the Metadata, as Yegor recommended. However, the equivalent in pptx, xlsx and xls was not immediately clear. It was simpler to add the detection for all four container files in one place in POIFSContainerDetector. When I tried to copy and paste that chart/graph into Word, it was saved as a non-visible xls or xlsx in doc and docx respectively. I think I remember from last time I looked into this, that these objects don't exist in doc/docx. It would be great if we could add extraction, but I don't think I'll be able to work on that any time soon. Any takers?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-2.x #75 (See https://builds.apache.org/job/tika-2.x/75/)
          TIKA-1033 – add identification for embedded MSChart.Graph files. (tallison: rev 862234289514dede8362c04f64305a47b0580ec8)

          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
          • tika-core/src/test/java/org/apache/tika/TikaTest.java
          • tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.xls
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
          • tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.xlsx
          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/AbstractPOIContainerExtractionTest.java
          • tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.ppt
          • tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.pptx
          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java
          • CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #75 (See https://builds.apache.org/job/tika-2.x/75/ ) TIKA-1033 – add identification for embedded MSChart.Graph files. (tallison: rev 862234289514dede8362c04f64305a47b0580ec8) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java tika-core/src/test/java/org/apache/tika/TikaTest.java tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.xls tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.xlsx tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/AbstractPOIContainerExtractionTest.java tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.ppt tika-test-resources/src/test/resources/test-documents/testMSChart-govdocs-428996.pptx tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #943 (See https://builds.apache.org/job/tika-trunk-jdk1.7/943/)
          TIKA-1033 – add detection for embedded MSGraph.Chart objects. Also add (tallison: rev e9206475e683c35b09810a857ddd7bdbfa8f60fb)

          • tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.xlsx
          • tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.pptx
          • tika-core/src/test/java/org/apache/tika/TikaTest.java
          • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java
          • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java
          • tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.xls
          • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java
          • CHANGES.txt
          • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
          • tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.ppt
          • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/AbstractPOIContainerExtractionTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #943 (See https://builds.apache.org/job/tika-trunk-jdk1.7/943/ ) TIKA-1033 – add detection for embedded MSGraph.Chart objects. Also add (tallison: rev e9206475e683c35b09810a857ddd7bdbfa8f60fb) tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.xlsx tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.pptx tika-core/src/test/java/org/apache/tika/TikaTest.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/POIContainerExtractionTest.java tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.xls tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLContainerExtractionTest.java CHANGES.txt tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java tika-parsers/src/test/resources/test-documents/testMSChart-govdocs-428996.ppt tika-parsers/src/test/java/org/apache/tika/parser/microsoft/AbstractPOIContainerExtractionTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Just finished run against our TIKA-1302 corpus. We have ~33k of these chart-graphs.

          Show
          tallison@mitre.org Tim Allison added a comment - Just finished run against our TIKA-1302 corpus. We have ~33k of these chart-graphs.

            People

            • Assignee:
              Unassigned
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development