Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10023

Multi-token post-analysis DocValues

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • core/index
    • None
    • New

    Description

      The single-token case for post-analysis DocValues is accounted for by Analyzer.normalize(...) (and formerly MultiTermAwareComponent); but there are cases where it would be desirable to have post-analysis DocValues based on multi-token fields.

      The main use cases that I can think of are variants of faceting/terms aggregation. I understand that this could be viewed as "trappy" for the naive "Moby Dick word cloud" case; but:

      1. I think this can be supported fairly cleanly in Lucene
      2. Explicit user configuration of this option would help prevent people shooting themselves in the foot
      3. The current situation is arguably "trappy" as well; it just offloads the trappiness onto Lucene-external workarounds for systems/users that want to support this kind of behavior
      4. Integrating this functionality directly in Lucene would afford consistency guarantees that present opportunities for future optimizations (e.g., shared Terms dictionary between indexed terms and DocValues).

      This issue proposes adding support for multi-token post-analysis DocValues directly to IndexingChain. The initial proposal involves extending the API to include IndexableFieldType.tokenDocValuesType() (in addition to existing IndexableFieldType.docValuesType()).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              magibney Michael Gibney
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h