Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2187

Align default behavior of experimental docx parser with that of doc parser in handling delText

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      Now that we can ignore delText via the experimental alternate SAXParser for .docx files, let's make that the default behavior to align with the expected behavior for our .doc parser (ignore deleted text).

      Let's also add the ability to include deleted text from .doc files.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1147 (See https://builds.apache.org/job/Tika-trunk/1147/)
          TIKA-2187 – change default behavior in experimental .docx parser to (tallison: rev fe20ecd83ea43e5ec6ad0e9fded9d803cb011251)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.doc
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1147 (See https://builds.apache.org/job/Tika-trunk/1147/ ) TIKA-2187 – change default behavior in experimental .docx parser to (tallison: rev fe20ecd83ea43e5ec6ad0e9fded9d803cb011251) (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) CHANGES.txt (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.doc (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Thank you Tim Allison for making it configurable!!!

          Show
          lfcnassif Luis Filipe Nassif added a comment - Thank you Tim Allison for making it configurable!!!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1148 (See https://builds.apache.org/job/Tika-trunk/1148/)
          TIKA-2187 – fixed test (tallison: rev 09931fe4227478516bf067bbb08056f49a506dfa)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1148 (See https://builds.apache.org/job/Tika-trunk/1148/ ) TIKA-2187 – fixed test (tallison: rev 09931fe4227478516bf067bbb08056f49a506dfa) (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Note that I also added extraction of deleted text back to .doc, also via configuration.

          Thanks to Thamme Gowda and Chris A. Mattmann for making the configurability so easy!

          Show
          tallison@mitre.org Tim Allison added a comment - Note that I also added extraction of deleted text back to .doc, also via configuration. Thanks to Thamme Gowda and Chris A. Mattmann for making the configurability so easy!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #81 (See https://builds.apache.org/job/tika-2.x-windows/81/)
          TIKA-2187 – make "ignore deleted" as the default in the experimental (tallison: rev 3d08da79febc75d1ca0fd3293a5f383983057b00)

          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.doc
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #81 (See https://builds.apache.org/job/tika-2.x-windows/81/ ) TIKA-2187 – make "ignore deleted" as the default in the experimental (tallison: rev 3d08da79febc75d1ca0fd3293a5f383983057b00) (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.doc (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #180 (See https://builds.apache.org/job/tika-2.x/180/)
          TIKA-2187 – make "ignore deleted" as the default in the experimental (tallison: rev 3d08da79febc75d1ca0fd3293a5f383983057b00)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.doc
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #180 (See https://builds.apache.org/job/tika-2.x/180/ ) TIKA-2187 – make "ignore deleted" as the default in the experimental (tallison: rev 3d08da79febc75d1ca0fd3293a5f383983057b00) (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.doc (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development