Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2096

Supply AutoDetectParser for embedded documents if user forgets to pass it in via ParseContext

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      Currently, if users don't specify a Parser.class or an EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not be parsed. I propose that we add an AutoDetectParser automatically if a Parser or EmbeddedDocumentExtractor is not included in the ParseContext.

      If a user doesn't want to parse embedded objects, s/he could pass in an EmptyParser for the Parser.class.

      In short, let's make the default be "parse everything", and the user has to figure out how to parse only the container document if that's the desired behavior.

      This is a breaking change. I propose adding it to 2.0 only.

      We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been bitten by this. Kite is still suffering from this.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@mitre.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: