Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
New
Description
The single-token case for post-analysis DocValues is accounted for by Analyzer.normalize(...) (and formerly MultiTermAwareComponent); but there are cases where it would be desirable to have post-analysis DocValues based on multi-token fields.
The main use cases that I can think of are variants of faceting/terms aggregation. I understand that this could be viewed as "trappy" for the naive "Moby Dick word cloud" case; but:
- I think this can be supported fairly cleanly in Lucene
- Explicit user configuration of this option would help prevent people shooting themselves in the foot
- The current situation is arguably "trappy" as well; it just offloads the trappiness onto Lucene-external workarounds for systems/users that want to support this kind of behavior
- Integrating this functionality directly in Lucene would afford consistency guarantees that present opportunities for future optimizations (e.g., shared Terms dictionary between indexed terms and DocValues).
This issue proposes adding support for multi-token post-analysis DocValues directly to IndexingChain. The initial proposal involves extending the API to include IndexableFieldType.tokenDocValuesType() (in addition to existing IndexableFieldType.docValuesType()).
Attachments
Issue Links
- links to