Lucene - Core
  1. Lucene - Core
  2. LUCENE-3226

rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 3.2
    • Fix Version/s: 3.3, 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A 3.2 user recently asked if something was wrong because CheckIndex was reporting his (newly built) index version as...

      Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1]
      

      It seems like there are two very confusing pieces of information here...

      1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice. All other FORMAT_* constants in SegmentInfos are descriptive of the actual change made, and not specific to the version when they were introduced.

      2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it "Lucene 3.1", which is missleading since that format is alwasy used in 3.2 (and probably 3.3, etc...).

      I suggest:
      a) rename FORMAT_3_1 to something like "FORMAT_SEGMENT_RECORDS_VERSION"
      b) change CheckIndex so that the label for the "newest" format always ends with " and later" (ie: "Lucene 3.1 and later") so when we release versions w/o a format change we don't have to remember to manual list them in CheckIndex. when we do make format changes and update CheckIndex " and later" can be replaced with " to X.Y" and the new format can be added

      1. LUCENE-3226.patch
        4 kB
        Shai Erera
      2. LUCENE-3226.patch
        5 kB
        Shai Erera
      3. LUCENE-3226.patch
        8 kB
        Robert Muir

        Activity

        Hoss Man created issue -
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-3226.patch [ 12483371 ]
        Hide
        Robert Muir added a comment -

        I'd like to commit this if nobody objects, its trivial and will reduce confusion.

        Show
        Robert Muir added a comment - I'd like to commit this if nobody objects, its trivial and will reduce confusion.
        Robert Muir made changes -
        Assignee Robert Muir [ rcmuir ]
        Fix Version/s 4.0 [ 12314025 ]
        Hide
        Hoss Man added a comment -

        looks good to me.

        strictly speaking when backporting we probably need to leave FORMAT_3_1 as a deprecated alias for FORMAT_SEGMENT_RECORDS_VERSION since it's currently public (but that doesn't need to affect trunk)

        Show
        Hoss Man added a comment - looks good to me. strictly speaking when backporting we probably need to leave FORMAT_3_1 as a deprecated alias for FORMAT_SEGMENT_RECORDS_VERSION since it's currently public (but that doesn't need to affect trunk)
        Hide
        Robert Muir added a comment -

        strictly speaking when backporting we probably need to leave FORMAT_3_1 as a deprecated alias for FORMAT_SEGMENT_RECORDS_VERSION since it's currently public (but that doesn't need to affect trunk)

        Well, strictly strictly the class is @lucene.experimental... but I'll add the deprecated pointer just because it won't hurt.

        Show
        Robert Muir added a comment - strictly speaking when backporting we probably need to leave FORMAT_3_1 as a deprecated alias for FORMAT_SEGMENT_RECORDS_VERSION since it's currently public (but that doesn't need to affect trunk) Well, strictly strictly the class is @lucene.experimental... but I'll add the deprecated pointer just because it won't hurt.
        Hide
        Robert Muir added a comment -

        Thanks hossman

        Show
        Robert Muir added a comment - Thanks hossman
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Shai Erera added a comment -

        I'd like to commit this if nobody objects, its trivial and will reduce confusion.

        I would have (objected), only it all happened so fast while I was asleep .

        I called the constant like that because I didn't like the other constant names (and don't like the new one either). The problem is that when it comes to remove support for a certain format, it is very hard to understand from the constant name what index format it represents.

        Instead, I chose a meaningful constant name for the developer, with a documentation that explains what has changed. When we're on 5.0 and need to remove support for 3.x, it will be very easy to delete all the places in the code which reference FORMAT_3_1, rather than go ask Mike what version FORMAT_SEGMENT_RECORDS_VERSION relates to .

        Also, in LUCENE-2921 I plan to get rid of all those ridiculous constant names and track the index version at the segment level only. It will be easier, IMO, to have an easy to understand constant name when it comes to supporting an older index (or remove support for). Perhaps it's only me, but when I read those format constant names, I only did that when removing support for older indexes. Other than that, they are not very interesting ...

        What Hoss reported about CheckIndex is the real problem we should handle here. SegmentInfo prints in its toString the code version which created it, which is better than seeing -9 IMO, and that should be "3.1" or "3.2". If it's a 3.2.0 newly created index, you shouldn't see "3.1" reported from SegmentInfos.toString. Perhaps CheckIndex needs to be fixed to refer to Constants.LUCENE_MAIN_VERSION instead?

        Robert, shall we reopen the issue to discuss?

        Show
        Shai Erera added a comment - I'd like to commit this if nobody objects, its trivial and will reduce confusion. I would have (objected), only it all happened so fast while I was asleep . I called the constant like that because I didn't like the other constant names (and don't like the new one either). The problem is that when it comes to remove support for a certain format, it is very hard to understand from the constant name what index format it represents. Instead, I chose a meaningful constant name for the developer, with a documentation that explains what has changed. When we're on 5.0 and need to remove support for 3.x, it will be very easy to delete all the places in the code which reference FORMAT_3_1, rather than go ask Mike what version FORMAT_SEGMENT_RECORDS_VERSION relates to . Also, in LUCENE-2921 I plan to get rid of all those ridiculous constant names and track the index version at the segment level only. It will be easier, IMO, to have an easy to understand constant name when it comes to supporting an older index (or remove support for). Perhaps it's only me, but when I read those format constant names, I only did that when removing support for older indexes. Other than that, they are not very interesting ... What Hoss reported about CheckIndex is the real problem we should handle here. SegmentInfo prints in its toString the code version which created it, which is better than seeing -9 IMO, and that should be "3.1" or "3.2". If it's a 3.2.0 newly created index, you shouldn't see "3.1" reported from SegmentInfos.toString. Perhaps CheckIndex needs to be fixed to refer to Constants.LUCENE_MAIN_VERSION instead? Robert, shall we reopen the issue to discuss?
        Hide
        Robert Muir added a comment -

        Also, in LUCENE-2921 I plan to get rid of all those ridiculous constant names and track the index version at the segment level only. It will be easier, IMO, to have an easy to understand constant name when it comes to supporting an older index (or remove support for). Perhaps it's only me, but when I read those format constant names, I only did that when removing support for older indexes. Other than that, they are not very interesting ...

        What Hoss reported about CheckIndex is the real problem we should handle here. SegmentInfo prints in its toString the code version which created it, which is better than seeing -9 IMO, and that should be "3.1" or "3.2". If it's a 3.2.0 newly created index, you shouldn't see "3.1" reported from SegmentInfos.toString. Perhaps CheckIndex needs to be fixed to refer to Constants.LUCENE_MAIN_VERSION instead?

        Robert, shall we reopen the issue to discuss?

        We can reopen... but the issue will always exist here, LUCENE-2921 can't solve this particular case since its the segments file...

        Show
        Robert Muir added a comment - Also, in LUCENE-2921 I plan to get rid of all those ridiculous constant names and track the index version at the segment level only. It will be easier, IMO, to have an easy to understand constant name when it comes to supporting an older index (or remove support for). Perhaps it's only me, but when I read those format constant names, I only did that when removing support for older indexes. Other than that, they are not very interesting ... What Hoss reported about CheckIndex is the real problem we should handle here. SegmentInfo prints in its toString the code version which created it, which is better than seeing -9 IMO, and that should be "3.1" or "3.2". If it's a 3.2.0 newly created index, you shouldn't see "3.1" reported from SegmentInfos.toString. Perhaps CheckIndex needs to be fixed to refer to Constants.LUCENE_MAIN_VERSION instead? Robert, shall we reopen the issue to discuss? We can reopen... but the issue will always exist here, LUCENE-2921 can't solve this particular case since its the segments file...
        Robert Muir made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Assignee Robert Muir [ rcmuir ]
        Hide
        Shai Erera added a comment -

        How does renaming a constant solve the CheckIndex issue? I commented on the constant name, and I think it should reflect the code version it applies to, not the feature. Because if e.g. in the same version you add two features, incrementally, you wouldn't change the format number twice right? And then the constant name becomes meaningless again, or too complicated. It happened to me a while ago (can't remember the exact feature though, perhaps it was in TermInfos).

        I mentioned LUCENE-2921 only because I intended to name the constants exactly that (X_Y).

        I see you've already reverted the changes you made. I think that the changes to CheckIndex could remain though, adding the "3.1+" to the string?

        Show
        Shai Erera added a comment - How does renaming a constant solve the CheckIndex issue? I commented on the constant name, and I think it should reflect the code version it applies to, not the feature. Because if e.g. in the same version you add two features, incrementally, you wouldn't change the format number twice right? And then the constant name becomes meaningless again, or too complicated. It happened to me a while ago (can't remember the exact feature though, perhaps it was in TermInfos). I mentioned LUCENE-2921 only because I intended to name the constants exactly that (X_Y). I see you've already reverted the changes you made. I think that the changes to CheckIndex could remain though, adding the "3.1+" to the string?
        Hide
        Robert Muir added a comment -

        I see you've already reverted the changes you made. I think that the changes to CheckIndex could remain though, adding the "3.1+" to the string?

        this sounds good, but is it enough to reduce the confusion in checkindex?

        Maybe it has to do with how the System.out.println is formatted, currently its:

            msg("Segments file=" + segmentsFileName + " numSegments=" + numSegments + " version=" + sFormat + userDataString);
        

        but maybe instead of 'version=' it should say something more clear, like format=, or even segmentsFormat=

        Show
        Robert Muir added a comment - I see you've already reverted the changes you made. I think that the changes to CheckIndex could remain though, adding the "3.1+" to the string? this sounds good, but is it enough to reduce the confusion in checkindex? Maybe it has to do with how the System.out.println is formatted, currently its: msg("Segments file=" + segmentsFileName + " numSegments=" + numSegments + " version=" + sFormat + userDataString); but maybe instead of 'version=' it should say something more clear, like format=, or even segmentsFormat=
        Hide
        Shai Erera added a comment -

        how about printing the oldest and newest segment version?

        Show
        Shai Erera added a comment - how about printing the oldest and newest segment version?
        Hide
        Robert Muir added a comment -

        This would be good (as we can compute it from the segments file), but, we just have to think about how to display the case where this is null: we know its <= 3.0 in this case... but we don't know any more than that?

        Still we should do it, especially in 4.x when most indexes being checkIndexed will have this filled out (except 3.0 indexes)

        Show
        Robert Muir added a comment - This would be good (as we can compute it from the segments file), but, we just have to think about how to display the case where this is null: we know its <= 3.0 in this case... but we don't know any more than that? Still we should do it, especially in 4.x when most indexes being checkIndexed will have this filled out (except 3.0 indexes)
        Hide
        Shai Erera added a comment -

        We can print "pre-3.1".

        But, if somebody opened a 3.0 / 2.x index w/ 3.1+ and all segments were 'touched' by the 3.1+ code, then their version would be "3.0" or "2.x" (i.e., not null). So it could be that someone opens two indexes, and CheckIndex reports "oldVersion=pre-3.1" for one and "oldVersion=2.x" for the other. I think it's acceptable though.

        Show
        Shai Erera added a comment - We can print "pre-3.1". But, if somebody opened a 3.0 / 2.x index w/ 3.1+ and all segments were 'touched' by the 3.1+ code, then their version would be "3.0" or "2.x" (i.e., not null). So it could be that someone opens two indexes, and CheckIndex reports "oldVersion=pre-3.1" for one and "oldVersion=2.x" for the other. I think it's acceptable though.
        Hide
        Michael McCandless added a comment -

        Can we have the constant name be descriptive (reflect what actually changed) and then add a comment expressing the version when that change was made to Lucene?

        I think we should name them primarily for the benefit of developers working with the source code going forward, and only secondarily for the future developer who will some day need to remove them...

        Show
        Michael McCandless added a comment - Can we have the constant name be descriptive (reflect what actually changed) and then add a comment expressing the version when that change was made to Lucene? I think we should name them primarily for the benefit of developers working with the source code going forward, and only secondarily for the future developer who will some day need to remove them...
        Hide
        Shai Erera added a comment -

        What is the benefit of naming the constant according to what has changed? And what if two changes occur in the same release? These constants, IMO, are used only to detect code that is needed to support a certain version, and nothing more.

        And since the purpose of LUCENE-2921 is to move all index format tracking to be at the 'code'-level and not 'feature'-level, I'd assume the constants would be named accordingly.

        Show
        Shai Erera added a comment - What is the benefit of naming the constant according to what has changed? And what if two changes occur in the same release? These constants, IMO, are used only to detect code that is needed to support a certain version, and nothing more. And since the purpose of LUCENE-2921 is to move all index format tracking to be at the 'code'-level and not 'feature'-level, I'd assume the constants would be named accordingly.
        Hide
        Michael McCandless added a comment -

        What is the benefit of naming the constant according to what has changed?

        Because then devs trying to work w/ the code have some sense of what the change was? EG for debugging maybe it's helpful, eg if something has gone wrong, later, in how SegmentInfos is handling that version or what not.

        And what if two changes occur in the same release?

        Well, we can handle that case by case? I agree it's messy... maybe pick a name describing/subsuming both? Or favor one name (maybe the "bigger change") and use comments to explain the other change?
        But if there is a comment/comments above the constant containing this same information that's just as good...

        These constants, IMO, are used only to detect code that is needed to support a certain version, and nothing more.

        Right, but for the devs that need to revisit such code, it's helpful to know what "real" change occurred within that version... else, during debugging they'd have to eg go do some svn archaeology to understand the change.

        And since the purpose of LUCENE-2921 is to move all index format tracking to be at the 'code'-level and not 'feature'-level, I'd assume the constants would be named accordingly.

        True... so maybe we take this up under that issue? I would be OK with just having comments that describe what changed in each version...

        So for this issue maybe re-commit just the CheckIndex fix, and leave the constant naming fixes to LUCENE-2921?

        Show
        Michael McCandless added a comment - What is the benefit of naming the constant according to what has changed? Because then devs trying to work w/ the code have some sense of what the change was? EG for debugging maybe it's helpful, eg if something has gone wrong, later, in how SegmentInfos is handling that version or what not. And what if two changes occur in the same release? Well, we can handle that case by case? I agree it's messy... maybe pick a name describing/subsuming both? Or favor one name (maybe the "bigger change") and use comments to explain the other change? But if there is a comment/comments above the constant containing this same information that's just as good... These constants, IMO, are used only to detect code that is needed to support a certain version, and nothing more. Right, but for the devs that need to revisit such code, it's helpful to know what "real" change occurred within that version... else, during debugging they'd have to eg go do some svn archaeology to understand the change. And since the purpose of LUCENE-2921 is to move all index format tracking to be at the 'code'-level and not 'feature'-level, I'd assume the constants would be named accordingly. True... so maybe we take this up under that issue? I would be OK with just having comments that describe what changed in each version... So for this issue maybe re-commit just the CheckIndex fix, and leave the constant naming fixes to LUCENE-2921 ?
        Hide
        Shai Erera added a comment -

        I would be OK with just having comments that describe what changed in each version...

        Yeah, that's what I thought. Constant name denotes the code version, documentation denotes the actual changes.

        So for this issue maybe re-commit just the CheckIndex fix

        I think that that's what Robert and I agreed to do, and we moved to discuss what should be the actual message printed, so it's less confusing to the users.

        Show
        Shai Erera added a comment - I would be OK with just having comments that describe what changed in each version... Yeah, that's what I thought. Constant name denotes the code version, documentation denotes the actual changes. So for this issue maybe re-commit just the CheckIndex fix I think that that's what Robert and I agreed to do, and we moved to discuss what should be the actual message printed, so it's less confusing to the users.
        Hide
        Michael McCandless added a comment -

        OK I agree, I think

        Who will re-commit the CheckIndex fix here...?

        Show
        Michael McCandless added a comment - OK I agree, I think Who will re-commit the CheckIndex fix here...?
        Hide
        Shai Erera added a comment -

        Patch improves CheckIndex output (includes information about oldest/newest segments.

        On the way I fixed a bug in StringHelper.versionComparator (could overflow if Integer.MIN/MAX_VAL were used.

        The changes to TestDemo won't be committed. I just included them here so you can run the test and check the output.

        Show
        Shai Erera added a comment - Patch improves CheckIndex output (includes information about oldest/newest segments. On the way I fixed a bug in StringHelper.versionComparator (could overflow if Integer.MIN/MAX_VAL were used. The changes to TestDemo won't be committed. I just included them here so you can run the test and check the output.
        Shai Erera made changes -
        Attachment LUCENE-3226.patch [ 12483694 ]
        Hide
        Michael McCandless added a comment -

        Patch looks good – I like how CheckIndex now tells you version range of your segments.

        Show
        Michael McCandless added a comment - Patch looks good – I like how CheckIndex now tells you version range of your segments.
        Hide
        Shai Erera added a comment -

        Changed the message format a bit (Thanks Robert for the feedback). Now it prints 'version=x.y' if all segments are on the same version, or 'versions=[a.b .. c.d]' if there is more than one version.

        I plan to commit this.

        Show
        Shai Erera added a comment - Changed the message format a bit (Thanks Robert for the feedback). Now it prints 'version=x.y' if all segments are on the same version, or 'versions= [a.b .. c.d] ' if there is more than one version. I plan to commit this.
        Shai Erera made changes -
        Attachment LUCENE-3226.patch [ 12483697 ]
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Robert Muir added a comment -

        lets backport this to 3.3? a few issues have been found/fixed already, so i don't mind respinning with this one too, since i think it will eliminate confusion.

        Show
        Robert Muir added a comment - lets backport this to 3.3? a few issues have been found/fixed already, so i don't mind respinning with this one too, since i think it will eliminate confusion.
        Hide
        Shai Erera added a comment -

        Committed revision 1139284 (trunk).
        Committed revision 1139286 (3x).
        Committed revision 1139300 (3.3).

        Thanks Robert and Mike for the review !

        Show
        Shai Erera added a comment - Committed revision 1139284 (trunk). Committed revision 1139286 (3x). Committed revision 1139300 (3.3). Thanks Robert and Mike for the review !
        Shai Erera made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Lucene Fields [New] [New, Patch Available]
        Assignee Shai Erera [ shaie ]
        Fix Version/s 3.4 [ 12316675 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        bulk close for 3.3

        Show
        Robert Muir added a comment - bulk close for 3.3
        Robert Muir made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Shai Erera
            Reporter:
            Hoss Man
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development