[LUCENE-10023] Multi-token post-analysis DocValues - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

The single-token case for post-analysis DocValues is accounted for by Analyzer.normalize(...) (and formerly MultiTermAwareComponent); but there are cases where it would be desirable to have post-analysis DocValues based on multi-token fields.

The main use cases that I can think of are variants of faceting/terms aggregation. I understand that this could be viewed as "trappy" for the naive "Moby Dick word cloud" case; but:

I think this can be supported fairly cleanly in Lucene
Explicit user configuration of this option would help prevent people shooting themselves in the foot
The current situation is arguably "trappy" as well; it just offloads the trappiness onto Lucene-external workarounds for systems/users that want to support this kind of behavior
Integrating this functionality directly in Lucene would afford consistency guarantees that present opportunities for future optimizations (e.g., shared Terms dictionary between indexed terms and DocValues).

This issue proposes adding support for multi-token post-analysis DocValues directly to IndexingChain. The initial proposal involves extending the API to include IndexableFieldType.tokenDocValuesType() (in addition to existing IndexableFieldType.docValuesType()).

Attachments

Issue Links

links to

GitHub Pull Request #208

Activity

People

Assignee:: Unassigned

Reporter:: Michael Gibney

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Jul/21 13:57

Updated:: 28/Aug/22 16:23

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h