Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2096

Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.15, 2.0.0
    • None
    • None

    Description

      Currently, if users don't specify a Parser.class or an EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not be parsed. I propose that we add an AutoDetectParser automatically if a Parser or EmbeddedDocumentExtractor is not included in the ParseContext.

      If a user doesn't want to parse embedded objects, s/he could pass in an EmptyParser for the Parser.class.

      In short, let's make the default be "parse everything", and the user has to figure out how to parse only the container document if that's the desired behavior.

      This is a breaking change. I propose adding it to 2.0 only.

      We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been bitten by this. Kite is still suffering from this.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: