Lucene - Core
  1. Lucene - Core
  2. LUCENE-6006

Replace FieldInfo.normsType with FieldInfo.hasNorms boolean

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I came across this precursor while working on LUCENE-6005:

      I think FieldInfo.normsType can only be null (field did not index
      norms) or DocValuesType.NUMERIC (it did). I'd like to simplify to
      just boolean hasNorms.

      This is a strange boolean, though: in theory it should be derived from
      indexed && omitNorms == false, but we have it for the exceptions
      case where every document in a segment hit an exception and never
      added norms. I think this is the only reason it exists? (In theory,
      such cases should result in 100% deleted segments, which IW should
      then drop ... but seems dangerous to "rely" on that).

      So I changed the indexing chain to just fill in the default (0) norms
      for all documents in such exceptional cases; this way going forward
      (starting with 5.0 indices) we really don't need this hasNorms. But
      we still need it for pre-5.0 indices...

      1. LUCENE-6006.patch
        75 kB
        Michael McCandless
      2. LUCENE-6006.patch
        61 kB
        Michael McCandless
      3. LUCENE-6006.patch
        36 kB
        Michael McCandless
      4. LUCENE-6006.patch
        43 kB
        Michael McCandless
      5. LUCENE-6006.patch
        29 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, I think it's ready. I think it means that once I commit to 5.x, then for trunk I can remove this hasNorms since 5.x indices will never have hasNorms=false when indexed=true and omitNorms=false.

        Show
        Michael McCandless added a comment - Patch, I think it's ready. I think it means that once I commit to 5.x, then for trunk I can remove this hasNorms since 5.x indices will never have hasNorms=false when indexed=true and omitNorms=false.
        Hide
        Robert Muir added a comment -

        but we have it for the exceptions
        case where every document in a segment hit an exception and never
        added norms. I think this is the only reason it exists?

        Don't think so. the main trouble is sparse fields actually. Add Doc with Field A, then go delete that doc...

        Show
        Robert Muir added a comment - but we have it for the exceptions case where every document in a segment hit an exception and never added norms. I think this is the only reason it exists? Don't think so. the main trouble is sparse fields actually. Add Doc with Field A, then go delete that doc...
        Hide
        Michael McCandless added a comment -

        Don't think so. the main trouble is sparse fields actually. Add Doc with Field A, then go delete that doc...

        Hmm, and then go and merge that segment (with other segments just like it), right? And then you have a segment that has not omitted norms, yet no documents in the segment actually have a norm.

        But do we really need to track that separate state? Why not just let all docs have the default (0) norm for such segments? Default codec does sparse encoding now for norms...

        Show
        Michael McCandless added a comment - Don't think so. the main trouble is sparse fields actually. Add Doc with Field A, then go delete that doc... Hmm, and then go and merge that segment (with other segments just like it), right? And then you have a segment that has not omitted norms, yet no documents in the segment actually have a norm. But do we really need to track that separate state? Why not just let all docs have the default (0) norm for such segments? Default codec does sparse encoding now for norms...
        Hide
        Robert Muir added a comment -

        Right, thats a very legit decision now. But it wasnt the case before. Relying on the codec to do sparse compression in such a case is a little awkward. A .. much more difficult... alternative would be for merging to not set indexed=true unless it actually sees a posting.

        Show
        Robert Muir added a comment - Right, thats a very legit decision now. But it wasnt the case before. Relying on the codec to do sparse compression in such a case is a little awkward. A .. much more difficult... alternative would be for merging to not set indexed=true unless it actually sees a posting.
        Hide
        Michael McCandless added a comment -

        I think that because 1) default codec now handles sparse norms well, and 2) this is an "exotic" case, we should in fact just drop the "hasNorms" in trunk. We must keep it in 5.x because 4.x indices have such segments.

        Relying on the codec to do sparse compression in such a case is a little awkward.

        Well, we can't keep compromising Lucene's internal design for the "least common denominator" of codecs out there. If you are one of the apps hitting this exotic use case, you'll need to use a codec that can sparse-encode your norms...

        A .. much more difficult... alternative would be for merging to not set indexed=true unless it actually sees a posting.

        Hmm that's true ... but I think such an optimization is not worth the added code complexity for the exotic case where you indexed only some documents and then later deleted them all.

        Show
        Michael McCandless added a comment - I think that because 1) default codec now handles sparse norms well, and 2) this is an "exotic" case, we should in fact just drop the "hasNorms" in trunk. We must keep it in 5.x because 4.x indices have such segments. Relying on the codec to do sparse compression in such a case is a little awkward. Well, we can't keep compromising Lucene's internal design for the "least common denominator" of codecs out there. If you are one of the apps hitting this exotic use case, you'll need to use a codec that can sparse-encode your norms... A .. much more difficult... alternative would be for merging to not set indexed=true unless it actually sees a posting. Hmm that's true ... but I think such an optimization is not worth the added code complexity for the exotic case where you indexed only some documents and then later deleted them all.
        Hide
        Robert Muir added a comment -

        Well, we can't keep compromising Lucene's internal design for the "least common denominator" of codecs out there. If you are one of the apps hitting this exotic use case, you'll need to use a codec that can sparse-encode your norms..

        I don't look at it as a design compromise, instead pushing the burden of all the "crazy things users do" into the codec: it makes one more difficult to write, because if it doesn't incorporate this optimization, users with 100,000 fields will complain.

        Even so, the patch might be the right way to go. I just am sad it does not actually go and clean this stuff up, instead waiting for it to "age away" in 6.x. I would instead nuke this boolean in 5.x. Just means 4.x fieldinfosreader needs to set a codec attribute for the "old boolean". 4.x norms readers need to look for such an attribute and if they see it, return DocValues.emptyNumeric() for the field.

        To be able to safely do things like this, we must fix LUCENE-5990 first, and ensure that "fieldinfos in" == "fieldinfos out" for everything in the codec API. Otherwise they cannot rely upon things like attributes and we can't do such cleanups.

        Show
        Robert Muir added a comment - Well, we can't keep compromising Lucene's internal design for the "least common denominator" of codecs out there. If you are one of the apps hitting this exotic use case, you'll need to use a codec that can sparse-encode your norms.. I don't look at it as a design compromise, instead pushing the burden of all the "crazy things users do" into the codec: it makes one more difficult to write, because if it doesn't incorporate this optimization, users with 100,000 fields will complain. Even so, the patch might be the right way to go. I just am sad it does not actually go and clean this stuff up, instead waiting for it to "age away" in 6.x. I would instead nuke this boolean in 5.x. Just means 4.x fieldinfosreader needs to set a codec attribute for the "old boolean". 4.x norms readers need to look for such an attribute and if they see it, return DocValues.emptyNumeric() for the field. To be able to safely do things like this, we must fix LUCENE-5990 first, and ensure that "fieldinfos in" == "fieldinfos out" for everything in the codec API. Otherwise they cannot rely upon things like attributes and we can't do such cleanups.
        Hide
        Michael McCandless added a comment -

        I just am sad it does not actually go and clean this stuff up, instead waiting for it to "age away" in 6.x. I would instead nuke this boolean in 5.x. Just means 4.x fieldinfosreader needs to set a codec attribute for the "old boolean". 4.x norms readers need to look for such an attribute and if they see it, return DocValues.emptyNumeric() for the field.

        Ooh I like that idea, let me try it. Then we can nuke this boolean on 5.x as well, and make these spooky "undead norms" codec private.

        Show
        Michael McCandless added a comment - I just am sad it does not actually go and clean this stuff up, instead waiting for it to "age away" in 6.x. I would instead nuke this boolean in 5.x. Just means 4.x fieldinfosreader needs to set a codec attribute for the "old boolean". 4.x norms readers need to look for such an attribute and if they see it, return DocValues.emptyNumeric() for the field. Ooh I like that idea, let me try it. Then we can nuke this boolean on 5.x as well, and make these spooky "undead norms" codec private.
        Hide
        Michael McCandless added a comment -

        New patch, removing the hasNorms boolean (I still left FieldInfo/s.hasNorms, which return indexed && omitNorms == false) and making the "undead norms" case codec private, but I'm nervous about those changes ... seems a bit evil having the FieldInfos format from one codec (Lucene46) communicating to the Norms format from another (Lucene42). And I have nocommits to somehow test that part ...

        Show
        Michael McCandless added a comment - New patch, removing the hasNorms boolean (I still left FieldInfo/s.hasNorms, which return indexed && omitNorms == false) and making the "undead norms" case codec private, but I'm nervous about those changes ... seems a bit evil having the FieldInfos format from one codec (Lucene46) communicating to the Norms format from another (Lucene42). And I have nocommits to somehow test that part ...
        Hide
        Michael McCandless added a comment -

        New patch, with cleaner private undead norms emulation in the legacy codecs ... I still need to work out a way to test this ...

        Show
        Michael McCandless added a comment - New patch, with cleaner private undead norms emulation in the legacy codecs ... I still need to work out a way to test this ...
        Hide
        Robert Muir added a comment -

        I like the latest patch. The back compat is contained, and the more I think about it, the more horrible I think this boolean is.

        Show
        Robert Muir added a comment - I like the latest patch. The back compat is contained, and the more I think about it, the more horrible I think this boolean is.
        Hide
        Michael McCandless added a comment -

        Another patch, improving javadocs and fixing the nocommits.

        To test this, I just did "the obvious thing" and created undead
        indices for testing.

        This then uncovered a corner case bug (now fixed!) in the previous
        patch when the entire index (not just one field) is undead.

        I think it's ready...

        Show
        Michael McCandless added a comment - Another patch, improving javadocs and fixing the nocommits. To test this, I just did "the obvious thing" and created undead indices for testing. This then uncovered a corner case bug (now fixed!) in the previous patch when the entire index (not just one field) is undead. I think it's ready...
        Hide
        Robert Muir added a comment -

        In this "undead army" case, where every field is undead, instead of having lots of null checks in the NormsProducers, could we just make an all-empty producer, used by these backwards codecs? It will die with them (can sit in bakcwards-codecs package). I think it might be cleaner?

        Show
        Robert Muir added a comment - In this "undead army" case, where every field is undead, instead of having lots of null checks in the NormsProducers, could we just make an all-empty producer, used by these backwards codecs? It will die with them (can sit in bakcwards-codecs package). I think it might be cleaner?
        Hide
        Michael McCandless added a comment -

        Good idea Rob ... I factored out the undead handling into a new UndeadNormsProducer, and added static methods that per-version formats invoke.

        I also added support & test for Lucene49NormsFormat, which can also be confronted by undead norms.

        I think it's ready. This is the world's most difficult-to-eliminate "useless" boolean!

        Show
        Michael McCandless added a comment - Good idea Rob ... I factored out the undead handling into a new UndeadNormsProducer, and added static methods that per-version formats invoke. I also added support & test for Lucene49NormsFormat, which can also be confronted by undead norms. I think it's ready. This is the world's most difficult-to-eliminate "useless" boolean!
        Hide
        Robert Muir added a comment -

        +1, thank you!

        Show
        Robert Muir added a comment - +1, thank you!
        Hide
        ASF subversion and git services added a comment -

        Commit 1632120 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1632120 ]

        LUCENE-6006: remove unnecessary FieldInfo.normType

        Show
        ASF subversion and git services added a comment - Commit 1632120 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1632120 ] LUCENE-6006 : remove unnecessary FieldInfo.normType
        Hide
        ASF subversion and git services added a comment -

        Commit 1632133 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1632133 ]

        LUCENE-6006: remove unnecessary FieldInfo.normType

        Show
        ASF subversion and git services added a comment - Commit 1632133 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1632133 ] LUCENE-6006 : remove unnecessary FieldInfo.normType
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development