Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2210

Add experimental SAX/Streaming XSLF/pptx extractor

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None

      Description

      On TIKA-2201, Seva Alekseyev shared a reasonably sized pptx that caused an OOM. While the SAX docx parser is still fresh in my mind, let's add one for pptx.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #194 (See https://builds.apache.org/job/tika-2.x/194/)
          TIKA-2210 – add experimental SAX parser for pptx and update (also (tallison: rev 68161573140cb584f8af136c57045fbca833fec5)

          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/ParagraphProperties.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_template.dotx
          • (edit) tika-app/src/test/java/org/apache/tika/parser/TestParsers.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_template.docx
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/RunProperties.java
          • (add) tika-test-resources/src/test/resources/test-documents/testPPTX_overlappingRelations.pptx
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java
          • (edit) CHANGES.txt
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_embedded_pics.docx
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFNumberingShim.java
          • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testPPT_various2.pptx
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/WordAndPowerPointTextPartHandler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #194 (See https://builds.apache.org/job/tika-2.x/194/ ) TIKA-2210 – add experimental SAX parser for pptx and update (also (tallison: rev 68161573140cb584f8af136c57045fbca833fec5) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/ParagraphProperties.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_template.dotx (edit) tika-app/src/test/java/org/apache/tika/parser/TestParsers.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_template.docx (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/RunProperties.java (add) tika-test-resources/src/test/resources/test-documents/testPPTX_overlappingRelations.pptx (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java (edit) CHANGES.txt (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLTikaBodyPartHandler.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_embedded_pics.docx (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFNumberingShim.java (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (add) tika-test-resources/src/test/resources/test-documents/testPPT_various2.pptx (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/WordAndPowerPointTextPartHandler.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Waited until this was in 2.0 before resolving.

          Show
          tallison@mitre.org Tim Allison added a comment - Waited until this was in 2.0 before resolving.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1160 (See https://builds.apache.org/job/Tika-trunk/1160/)
          TIKA-2210 – add experimental SAX parser for pptx – this is a first (tallison: rev 90cdf1f6a844e0d0541167bc0364bb3963f93b2d)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFRunProperties.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFTikaBodyPartHandler.java
          • (add) tika-parsers/src/test/resources/test-documents/testPPTX_overlappingRelations.pptx
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
          • (edit) CHANGES.txt
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractDocumentXMLBodyHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFDocumentXMLBodyHandler.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java
          • (add) tika-parsers/src/test/resources/test-documents/testPPT_various2.pptx
          • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/RunProperties.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/ParagraphProperties.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFParagraphProperties.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1160 (See https://builds.apache.org/job/Tika-trunk/1160/ ) TIKA-2210 – add experimental SAX parser for pptx – this is a first (tallison: rev 90cdf1f6a844e0d0541167bc0364bb3963f93b2d) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFRunProperties.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFTikaBodyPartHandler.java (add) tika-parsers/src/test/resources/test-documents/testPPTX_overlappingRelations.pptx (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java (edit) CHANGES.txt (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractDocumentXMLBodyHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFDocumentXMLBodyHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java (add) tika-parsers/src/test/resources/test-documents/testPPT_various2.pptx (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/RunProperties.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/ParagraphProperties.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFParagraphProperties.java

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development