Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None

      Description

      For applications with many indexed fields, the norms cause memory problems both during indexing and querying.
      This patch makes norms optional on a per-field basis, in the same way that term vectors are optional per-field.

      Overview of changes:

      • Field.omitNorms that defaults to false
      • backward compatible lucene file format change: FieldInfos.FieldBits has a bit for omitNorms
      • IndexReader.hasNorms() method
      • During merging, if any segment includes norms, then norms are included.
      • methods to get norms return the equivalent 1.0f array for backward compatibility

      The patch was designed for backward compatibility:

      • all current unit tests pass w/o any modifications required
      • compatible with old indexes since the default is omitNorms=false
      • compatible with older/custom subclasses of IndexReader since a default hasNorms() is provided
      • compatible with older/custom users of IndexReader such as Weight/Scorer/explain since a norm array is produced on demand, even if norms were not stored

      If this patch is accepted (or if the direction is acceptable), performance for scoring could be improved by assuming 1.0f when hasNorms(field)==false.

      1. omitNorms.txt
        18 kB
        Yonik Seeley

        Activity

        Hide
        Doug Cutting added a comment -

        +1

        This can greatly reduce the amount of memory used by indexes with lots of fields.

        It might be nice to add something like a Field.Index.NO_NORMS, that assumes un-tokenized...

        Show
        Doug Cutting added a comment - +1 This can greatly reduce the amount of memory used by indexes with lots of fields. It might be nice to add something like a Field.Index.NO_NORMS, that assumes un-tokenized...
        Hide
        Yonik Seeley added a comment -

        > It might be nice to add something like a Field.Index.NO_NORMS, that assumes un-tokenized...

        Good idea... un-tokenized fields don't need a lengthNorm anyway.

        Minor Q: Should fakeNorms() exist on IndexReader (as is now), or simply be private to both SegmentReader and MultiReader (the only two that need to generate fake norm arrays)?

        Very minor Q: Should the getter/setter currently named isOmitNorms()/setOmitNorms() be renamed... I followed the example of isStoreOffsetWithTermVector(), but omitNorms()/omitNorms(boolean) reads nicer in code.

        Show
        Yonik Seeley added a comment - > It might be nice to add something like a Field.Index.NO_NORMS, that assumes un-tokenized... Good idea... un-tokenized fields don't need a lengthNorm anyway. Minor Q: Should fakeNorms() exist on IndexReader (as is now), or simply be private to both SegmentReader and MultiReader (the only two that need to generate fake norm arrays)? Very minor Q: Should the getter/setter currently named isOmitNorms()/setOmitNorms() be renamed... I followed the example of isStoreOffsetWithTermVector(), but omitNorms()/omitNorms(boolean) reads nicer in code.
        Hide
        Doug Cutting added a comment -

        Un-tokenized fields don't need a lengthNorm, but they can be boosted. So it should be well documented that disabling norms disables boosting.

        I'd hide fakeNorms(). If user code shouldn't call it, then it shouldn't appear in the javadoc. You could make it package-private. Or, can you not make MultiReader.norms() rely on SegmentReader.norms() to create fake norms as needed?

        As for naming setter/getters: I don't feel strongly about this. I sometimes use get/set, even when I might prefer omitting them, simply because it is the fashion and the style police hassle me when I don't.

        Show
        Doug Cutting added a comment - Un-tokenized fields don't need a lengthNorm, but they can be boosted. So it should be well documented that disabling norms disables boosting. I'd hide fakeNorms(). If user code shouldn't call it, then it shouldn't appear in the javadoc. You could make it package-private. Or, can you not make MultiReader.norms() rely on SegmentReader.norms() to create fake norms as needed? As for naming setter/getters: I don't feel strongly about this. I sometimes use get/set, even when I might prefer omitting them, simply because it is the fashion and the style police hassle me when I don't.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Yonik Seeley
          • Votes:
            4 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development