Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1945

Powerpoint parser doesn't extract text from diagrams

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12
    • Fix Version/s: 1.16
    • Component/s: parser
    • Labels:
      None

      Description

      Attached is an example org chart that Tika doesn't extract text from

      1. Diagram.pptx
        48 kB
        Nick C
      2. TIKA-1945.docx
        49 kB
        Oytun Tez
      3. TIKA-1945.pptx
        39 kB
        Oytun Tez

        Issue Links

          Activity

          Hide
          nicholasc Nick C added a comment -

          Also while looking in to the code I noticed AbstractOOXMLExtractor.getXHTML passes the content handler to handleEmbeddedParts and handleThumbnail instead of the XHTMLContentHandler that is passed to buildXHTML If that's a bug I can create another jira ticket

          Show
          nicholasc Nick C added a comment - Also while looking in to the code I noticed AbstractOOXMLExtractor.getXHTML passes the content handler to handleEmbeddedParts and handleThumbnail instead of the XHTMLContentHandler that is passed to buildXHTML If that's a bug I can create another jira ticket
          Hide
          oytun Oytun Tez added a comment - - edited

          This is a confirmed issue for 1.15 as well. `./ppt/diagrams/*.xml` files are not processed. If there is a quick work around this, we would like to do it. This is currently a production issue for us.

          Show
          oytun Oytun Tez added a comment - - edited This is a confirmed issue for 1.15 as well. `./ppt/diagrams/*.xml` files are not processed. If there is a quick work around this, we would like to do it. This is currently a production issue for us.
          Hide
          oytun Oytun Tez added a comment -

          I believe this may be due to XSLFRelation from Apache POI not providing a relation for `./ppt/diagram` directory.

          If one of the Tika or POI developers can give us clues on how to solve this the fastest -as we are not familiar with the code base, that would be fantastic!

          There is a similar issue in POI bug database: https://bz.apache.org/bugzilla/show_bug.cgi?id=57596

          Show
          oytun Oytun Tez added a comment - I believe this may be due to XSLFRelation from Apache POI not providing a relation for `./ppt/diagram` directory. If one of the Tika or POI developers can give us clues on how to solve this the fastest -as we are not familiar with the code base, that would be fantastic! There is a similar issue in POI bug database: https://bz.apache.org/bugzilla/show_bug.cgi?id=57596
          Hide
          gagravarr Nick Burch added a comment -

          A small sample file we can use for unit testing is needed, one per affected format, Oytun Tez if you could create some that'll help us help you!

          Show
          gagravarr Nick Burch added a comment - A small sample file we can use for unit testing is needed, one per affected format, Oytun Tez if you could create some that'll help us help you!
          Hide
          oytun Oytun Tez added a comment -

          Doing this right away, Nick Burch, thank you for looking into this.

          Show
          oytun Oytun Tez added a comment - Doing this right away, Nick Burch , thank you for looking into this.
          Hide
          oytun Oytun Tez added a comment -

          Sample files for unit tests for issue TIKA-1945.

          Show
          oytun Oytun Tez added a comment - Sample files for unit tests for issue TIKA-1945 .
          Hide
          oytun Oytun Tez added a comment -

          Attached 2 sample files for .docx and .pptx. If you need anything else, let me know.

          Show
          oytun Oytun Tez added a comment - Attached 2 sample files for .docx and .pptx. If you need anything else, let me know.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Working version of patch handles .docx/pptx/xlsx/xlsb. It does not handle .doc/ppt/xls. Will commit tonight or tomorrow.

          Show
          tallison@mitre.org Tim Allison added a comment - Working version of patch handles .docx/pptx/xlsx/xlsb. It does not handle .doc/ppt/xls. Will commit tonight or tomorrow.
          Hide
          oytun Oytun Tez added a comment -

          Sounds great, Tim Allison, thank you for taking this so quickly. Looking forward to the commit.

          Are you working on dev or will this patch be available for 1.15 release?

          Show
          oytun Oytun Tez added a comment - Sounds great, Tim Allison , thank you for taking this so quickly. Looking forward to the commit. Are you working on dev or will this patch be available for 1.15 release?
          Hide
          gagravarr Nick Burch added a comment -

          I don't know exactly what Tim'll do, but assuming it's similar to what I'd try... It'll almost certainly need some changes to both Apache POI and Apache Tika, and therefore need to wait for a POI release then a Tika 1.16 release. The patches will be open source, so you'd be most welcome to do a custom local build until then, but it wouldn't be in an official release for a few months

          Show
          gagravarr Nick Burch added a comment - I don't know exactly what Tim'll do, but assuming it's similar to what I'd try... It'll almost certainly need some changes to both Apache POI and Apache Tika, and therefore need to wait for a POI release then a Tika 1.16 release. The patches will be open source, so you'd be most welcome to do a custom local build until then, but it wouldn't be in an official release for a few months
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Tika-trunk #1288 (See https://builds.apache.org/job/Tika-trunk/1288/)
          TIKA-1945 – extract text from diagrams in ooxml files. (tallison: https://github.com/apache/tika/commit/7842600560e02a4fd213d175301b4397bbe030a3)

          • (add) tika-parsers/src/test/resources/test-documents/testEXCEL_diagramData.xlsb
          • (add) tika-parsers/src/test/resources/test-documents/testPPT_diagramData.pptx
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
          • (add) tika-parsers/src/test/resources/test-documents/testEXCEL_diagramData.xlsx
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_diagramData.docx
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1288 (See https://builds.apache.org/job/Tika-trunk/1288/ ) TIKA-1945 – extract text from diagrams in ooxml files. (tallison: https://github.com/apache/tika/commit/7842600560e02a4fd213d175301b4397bbe030a3 ) (add) tika-parsers/src/test/resources/test-documents/testEXCEL_diagramData.xlsb (add) tika-parsers/src/test/resources/test-documents/testPPT_diagramData.pptx (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java (add) tika-parsers/src/test/resources/test-documents/testEXCEL_diagramData.xlsx (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (add) tika-parsers/src/test/resources/test-documents/testWORD_diagramData.docx (edit) CHANGES.txt (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I used the SAX handler I wrote for pptx/docx body parts. Nick Burch, let me know what you think.

          There are some things we should put in POI, like the relationships...anything else?

          Show
          tallison@mitre.org Tim Allison added a comment - I used the SAX handler I wrote for pptx/docx body parts. Nick Burch , let me know what you think. There are some things we should put in POI, like the relationships...anything else?
          Hide
          oytun Oytun Tez added a comment -

          Tim Allison, Nick Burch, I confirm that the commit below has fixed the SmartArt text issue. We will continue using `master` rather than the latest release until v1.6 is released.

          https://github.com/apache/tika/commit/7842600560e02a4fd213d175301b4397bbe030a3

          Show
          oytun Oytun Tez added a comment - Tim Allison , Nick Burch , I confirm that the commit below has fixed the SmartArt text issue. We will continue using `master` rather than the latest release until v1.6 is released. https://github.com/apache/tika/commit/7842600560e02a4fd213d175301b4397bbe030a3
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Great. Thank you for the feedback!

          Show
          tallison@mitre.org Tim Allison added a comment - Great. Thank you for the feedback!

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              nicholasc Nick C
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development