Description
Currently, if users don't specify a Parser.class or an EmbeddedDocumentExtractor in the ParseContext, then embedded documents will not be parsed. I propose that we add an AutoDetectParser automatically if a Parser or EmbeddedDocumentExtractor is not included in the ParseContext.
If a user doesn't want to parse embedded objects, s/he could pass in an EmptyParser for the Parser.class.
In short, let's make the default be "parse everything", and the user has to figure out how to parse only the container document if that's the desired behavior.
This is a breaking change. I propose adding it to 2.0 only.
We were bitten by this on tika-server (TIKA-1584). Solr (SOLR-7189) has been bitten by this. Kite is still suffering from this.
Attachments
Issue Links
- relates to
-
TIKA-2275 EmbeddedDocumentUtil should check parseContext for a TikaConfig
- Resolved