Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2096

Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      Currently, if users don't specify a Parser.class or an EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not be parsed. I propose that we add an AutoDetectParser automatically if a Parser or EmbeddedDocumentExtractor is not included in the ParseContext.

      If a user doesn't want to parse embedded objects, s/he could pass in an EmptyParser for the Parser.class.

      In short, let's make the default be "parse everything", and the user has to figure out how to parse only the container document if that's the desired behavior.

      This is a breaking change. I propose adding it to 2.0 only.

      We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been bitten by this. Kite is still suffering from this.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1146 (See https://builds.apache.org/job/Tika-trunk/1146/)
          TIKA-2096 – fix example of not including embedded docs (tallison: rev 7b4f6fa6c76430dbc0eeb4e6654b59e3afc38185)

          • (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1146 (See https://builds.apache.org/job/Tika-trunk/1146/ ) TIKA-2096 – fix example of not including embedded docs (tallison: rev 7b4f6fa6c76430dbc0eeb4e6654b59e3afc38185) (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #178 (See https://builds.apache.org/job/tika-2.x/178/)
          TIKA-2096 change default to extract embedded documents even if the user (tallison: rev e5e4d4d9193daa001821cdf7637c023d0abe072e)

          • (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java
          • (edit) tika-parser-modules/tika-parser-database-module/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java
          • (edit) tika-app/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java
          • (edit) CHANGES.txt
          • (add) tika-app/src/test/java/org/apache/tika/extractor/EmbeddedDocumentUtilTest.java
            TIKA-2096 – fix example, sorry... (tallison: rev de103c81fe225f08cdcbadf09437907bc3e4669b)
          • (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #178 (See https://builds.apache.org/job/tika-2.x/178/ ) TIKA-2096 change default to extract embedded documents even if the user (tallison: rev e5e4d4d9193daa001821cdf7637c023d0abe072e) (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java (edit) tika-parser-modules/tika-parser-database-module/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java (edit) tika-app/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java (edit) CHANGES.txt (add) tika-app/src/test/java/org/apache/tika/extractor/EmbeddedDocumentUtilTest.java TIKA-2096 – fix example, sorry... (tallison: rev de103c81fe225f08cdcbadf09437907bc3e4669b) (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #79 (See https://builds.apache.org/job/tika-2.x-windows/79/)
          TIKA-2096 change default to extract embedded documents even if the user (tallison: rev e5e4d4d9193daa001821cdf7637c023d0abe072e)

          • (add) tika-app/src/test/java/org/apache/tika/extractor/EmbeddedDocumentUtilTest.java
          • (edit) tika-app/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-database-module/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java
          • (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java
            TIKA-2096 – fix example, sorry... (tallison: rev de103c81fe225f08cdcbadf09437907bc3e4669b)
          • (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #79 (See https://builds.apache.org/job/tika-2.x-windows/79/ ) TIKA-2096 change default to extract embedded documents even if the user (tallison: rev e5e4d4d9193daa001821cdf7637c023d0abe072e) (add) tika-app/src/test/java/org/apache/tika/extractor/EmbeddedDocumentUtilTest.java (edit) tika-app/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-database-module/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java TIKA-2096 – fix example, sorry... (tallison: rev de103c81fe225f08cdcbadf09437907bc3e4669b) (edit) tika-example/src/main/java/org/apache/tika/example/ParsingExample.java
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build Tika-trunk #1145 (See https://builds.apache.org/job/Tika-trunk/1145/)
          TIKA-2096 – automatically add AutoDetectParser for embedded documents (tallison: rev 361ffa40a5cee9f37d01f40c2074a18b04c4a6fb)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java
          • (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java
          • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java
          • (add) tika-parsers/src/test/java/org/apache/tika/extractor/EmbeddedDocumentUtilTest.java
            TIKA-2096 – update CHANGES.txt (tallison: rev 1cfd250f8b337876464edf4b57f8ee62c361380b)
          • (edit) CHANGES.txt
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Tika-trunk #1145 (See https://builds.apache.org/job/Tika-trunk/1145/ ) TIKA-2096 – automatically add AutoDetectParser for embedded documents (tallison: rev 361ffa40a5cee9f37d01f40c2074a18b04c4a6fb) (edit) tika-parsers/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java (edit) tika-core/src/main/java/org/apache/tika/extractor/EmbeddedDocumentUtil.java (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/jdbc/SQLite3ParserTest.java (add) tika-parsers/src/test/java/org/apache/tika/extractor/EmbeddedDocumentUtilTest.java TIKA-2096 – update CHANGES.txt (tallison: rev 1cfd250f8b337876464edf4b57f8ee62c361380b) (edit) CHANGES.txt
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Sounds good. The change should be fairly straightforward with new EmbeddedDocumentUtil. If there are no objections, I'll commit this next week...unless Luis Filipe Nassif, you want to try out your new commit privileges before then

          Show
          tallison@mitre.org Tim Allison added a comment - Sounds good. The change should be fairly straightforward with new EmbeddedDocumentUtil. If there are no objections, I'll commit this next week...unless Luis Filipe Nassif , you want to try out your new commit privileges before then
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Hi Tim,

          I am ok to put it into 1.15. I think it is more an improvement than a breaking change, Tika will not loose any previously extracted data. And I think most users want that behaviour.

          Show
          lfcnassif Luis Filipe Nassif added a comment - Hi Tim, I am ok to put it into 1.15. I think it is more an improvement than a breaking change, Tika will not loose any previously extracted data. And I think most users want that behaviour.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          We may want to accelerate this and put it into Tika 1.15. I just found that the MailContentHandler was supplying an AutoDetectParser, but the others aren't. On TIKA-2159, I removed this from the MailContentHandler. Any objections, if we add this to all parsers now?

          Show
          tallison@mitre.org Tim Allison added a comment - We may want to accelerate this and put it into Tika 1.15. I just found that the MailContentHandler was supplying an AutoDetectParser, but the others aren't. On TIKA-2159 , I removed this from the MailContentHandler. Any objections, if we add this to all parsers now?

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development