Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8403

Support 'filtered' term vectors - don't require all terms to be present

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None
    • New

    Description

      The genesis of this was a conversation and idea from David Smiley several years ago.

      In order to optimize term vector storage, we may not actually need all tokens to be present in the term vectors - and if so, ideally our codec could just opt not to store them.

      I attempted to fork the standard codec and override the TermVectorsFormat and TermVectorsWriter to ignore storing certain Terms within a field. This worked, however, CheckIndex checks that the terms present in the standard postings are also present in the TVs, if TVs enabled. So this then doesn't work as 'valid' according to CheckIndex.

      Can the TermVectorsFormat be made in such a way to support configuration of tokens that should not be stored (benefits: less storage, more optimal retrieval per doc)? Is this valuable to the wider community? Is there a way we can design this to not break CheckIndex's contract while at the same time lessening storage for unneeded tokens?

      Attachments

        1. LUCENE-8403.patch
          18 kB
          Atri Sharma

        Activity

          People

            Unassigned Unassigned
            mbraun688 Michael Braun
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment