Lucene - Core
  1. Lucene - Core
  2. LUCENE-3084

MergePolicy.OneMerge.segments should be List<SegmentInfo> not SegmentInfos, Remove Vector<SI> subclassing from SegmentInfos & more refactoring

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused.

      We should cutover to List<SI> instead.

      Also SegmentInfos subclasses Vector<SI>, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()).

      1. LUCENE-3084.patch
        12 kB
        Michael McCandless
      2. LUCENE-3084-3.x-only.patch
        34 kB
        Uwe Schindler
      3. LUCENE-3084-trunk-only.patch
        33 kB
        Uwe Schindler
      4. LUCENE-3084-trunk-only.patch
        33 kB
        Uwe Schindler
      5. LUCENE-3084-trunk-only.patch
        33 kB
        Uwe Schindler
      6. LUCENE-3084-trunk-only.patch
        31 kB
        Uwe Schindler
      7. LUCENE-3084-trunk-only.patch
        28 kB
        Uwe Schindler
      8. LUCENE-3084-trunk-only.patch
        24 kB
        Uwe Schindler
      9. LUCENE-3084-trunk-only.patch
        23 kB
        Uwe Schindler
      10. LUCENE-3084-trunk-only.patch
        7 kB
        Uwe Schindler
      11. LUCENE-3084-trunk-only.patch
        7 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        For 3.2 it would be a minor backwards break, but code using SegmentInfos whould still compile. Only drop-in replacements will not work.

        Show
        Uwe Schindler added a comment - For 3.2 it would be a minor backwards break, but code using SegmentInfos whould still compile. Only drop-in replacements will not work.
        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Michael McCandless added a comment -

        I think the minor break for 3.2 is acceptable...

        Show
        Michael McCandless added a comment - I think the minor break for 3.2 is acceptable...
        Hide
        Uwe Schindler added a comment -

        After some discussion with Mike we decided, to make some further API changes in 4.0:

        • No longer subclass java.util.Vector, instead ArrayList
        • rename SegmentInfos.range to cloneSubList() and let it also return List<SegmentInfo>
        • make OneMerge's list unmodifiable to protect against changes in consumers of the MergeSpecification (this item should in my opionion also backported to 3.x)

        I'll atach simple patch.

        Show
        Uwe Schindler added a comment - After some discussion with Mike we decided, to make some further API changes in 4.0: No longer subclass java.util.Vector, instead ArrayList rename SegmentInfos.range to cloneSubList() and let it also return List<SegmentInfo> make OneMerge's list unmodifiable to protect against changes in consumers of the MergeSpecification (this item should in my opionion also backported to 3.x) I'll atach simple patch.
        Hide
        Uwe Schindler added a comment -

        The above patch shows the problem with the current merge policy code: it seems that the list returned in OneMerge is sometimes modified, we should fix that (so patch not yet commitable)

        Show
        Uwe Schindler added a comment - The above patch shows the problem with the current merge policy code: it seems that the list returned in OneMerge is sometimes modified, we should fix that (so patch not yet commitable)
        Hide
        Earwin Burrfoot added a comment -
        • Speaking logically, merges operate on Sets of SIs, not List?
        • Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need.
        Show
        Earwin Burrfoot added a comment - Speaking logically, merges operate on Sets of SIs, not List? Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need.
        Hide
        Uwe Schindler added a comment -

        Different patch idea:

        • OneMerge clones the list, so all MergePolicys don't need to take care, that the list may change later (e.g. is it's based on SegmentInfos.subList() which changes later during merge)
        • Removed SIs.range() completely
        • Speaking logically, merges operate on Sets of SIs, not List?
        • Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need.

        Merges are ordered, so it must be an ordered Set like LinkedHashSet.

        SegmentInfos itsself must be list, as the segments of an index are ordered.

        All tests pass with this patch.

        The cloning should also be moved to OneMerge in 3.x. range() should simply delegate to subList (and no longer clone).

        Show
        Uwe Schindler added a comment - Different patch idea: OneMerge clones the list, so all MergePolicys don't need to take care, that the list may change later (e.g. is it's based on SegmentInfos.subList() which changes later during merge) Removed SIs.range() completely Speaking logically, merges operate on Sets of SIs, not List? Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need. Merges are ordered, so it must be an ordered Set like LinkedHashSet. SegmentInfos itsself must be list, as the segments of an index are ordered. All tests pass with this patch. The cloning should also be moved to OneMerge in 3.x. range() should simply delegate to subList (and no longer clone).
        Hide
        Earwin Burrfoot added a comment -

        Merges are ordered

        Hmm.. Why should they be?

        SegmentInfos itself must be list

        It may contain list as a field instead. And have a much cleaner API as a consequence.

        On another note, I wonder, is the fact that Vector is internally synchronized used somewhere within SegmentInfos client code?

        Show
        Earwin Burrfoot added a comment - Merges are ordered Hmm.. Why should they be? SegmentInfos itself must be list It may contain list as a field instead. And have a much cleaner API as a consequence. On another note, I wonder, is the fact that Vector is internally synchronized used somewhere within SegmentInfos client code?
        Hide
        Uwe Schindler added a comment -

        It may contain list as a field instead. And have a much cleaner API as a consequence.

        I agree, I dont like subclassing ArrayList (this is the only class in Lucene that subclasses java collection classes left over. We should rmeove.

        Problem: List access is heavy used in code, I have to first review all places. Alternatively we could make ist simply implement List<SegmentInfo>, so it does not get the extra methods from ArrayList that go beyond List. This would be a good way in the middle.

        On another note, I wonder, is the fact that Vector is internally synchronized used somewhere within SegmentInfos client code?

        Thats safe, synchronization is not an issue here, all access to SegmentInfos is protected by IndexWriter synchronization (see e.g. mergeInit()). The use of Vector is just a relict from earlier days. I should have removed that for 3.0 already but missed that exactly because I did not know if synchronization was needed.

        Show
        Uwe Schindler added a comment - It may contain list as a field instead. And have a much cleaner API as a consequence. I agree, I dont like subclassing ArrayList (this is the only class in Lucene that subclasses java collection classes left over. We should rmeove. Problem: List access is heavy used in code, I have to first review all places. Alternatively we could make ist simply implement List<SegmentInfo>, so it does not get the extra methods from ArrayList that go beyond List. This would be a good way in the middle. On another note, I wonder, is the fact that Vector is internally synchronized used somewhere within SegmentInfos client code? Thats safe, synchronization is not an issue here, all access to SegmentInfos is protected by IndexWriter synchronization (see e.g. mergeInit()). The use of Vector is just a relict from earlier days. I should have removed that for 3.0 already but missed that exactly because I did not know if synchronization was needed.
        Hide
        Uwe Schindler added a comment -

        Merges are ordered

        Hmm.. Why should they be?

        See also LUCENE-1076, in trunk we may remove this, but 3.x - not sure. There is currently nothing in Lucene that guarantees that docIDs of MultiReaders are not reoredered during merges, but lots of people expect/rely on this.

        Show
        Uwe Schindler added a comment - Merges are ordered Hmm.. Why should they be? See also LUCENE-1076 , in trunk we may remove this, but 3.x - not sure. There is currently nothing in Lucene that guarantees that docIDs of MultiReaders are not reoredered during merges, but lots of people expect/rely on this.
        Hide
        Michael McCandless added a comment -

        I would love to cutover to Set<SI>, but, I don't think we can. There are apps out there that want merges to remain contiguous (so docIDs keep their monotonicity).

        But I do think we should not keep that by default (I reopened LUCENE-1076 to switched to TieredMP in 3.x by default).

        Show
        Michael McCandless added a comment - I would love to cutover to Set<SI>, but, I don't think we can. There are apps out there that want merges to remain contiguous (so docIDs keep their monotonicity). But I do think we should not keep that by default (I reopened LUCENE-1076 to switched to TieredMP in 3.x by default).
        Hide
        Michael McCandless added a comment -

        Patch looks good – thanks Uwe!

        Show
        Michael McCandless added a comment - Patch looks good – thanks Uwe!
        Hide
        Simon Willnauer added a comment -

        Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need

        +1 we should if at all implement Iterable<SegmentInfo> instead of vector or arraylist.

        I would love to cutover to Set<SI>, but, I don't think we can. There are apps out there that want merges to remain contiguous (so docIDs keep their monotonicity).

        I think this should be feasible with a sorted set? It might make sense is the SIS case to hold a NavigableSet and I personally would prever that much over subclassing some random collection. If we refactor this we should refactor towards an interface not an implementation.

        Show
        Simon Willnauer added a comment - Let's stop subclassing random things? : ) SIS can contain a List of SIs (and maybe a Set, or whatever we need in the future), and only expose operations its clients really need +1 we should if at all implement Iterable<SegmentInfo> instead of vector or arraylist. I would love to cutover to Set<SI>, but, I don't think we can. There are apps out there that want merges to remain contiguous (so docIDs keep their monotonicity). I think this should be feasible with a sorted set? It might make sense is the SIS case to hold a NavigableSet and I personally would prever that much over subclassing some random collection. If we refactor this we should refactor towards an interface not an implementation.
        Hide
        Uwe Schindler added a comment -

        Just to note:

        Cutover to Set<SegmentInfos> was not intended for SegmentInfos itsself, it was proposed for the merge code (MergePolicy.OneMerge) only that returns which segments to merge. And there currently the interface is List, because of ordering (see LUCENE-1076).

        NavigableSet ist Java 6. SortedSet only works if the set ordering is defined by its contents, which is not the case for SegmentInfos (the ordering is given by the file on disk). The only thing that could work is combination of List and Set, the Set only to check for duplicates. SegmentInfos is still required to be List-style, but should not allow to add the same SegmentInfo two times.

        This is why I said let it implement List<SI>, this would also break no code.

        Show
        Uwe Schindler added a comment - Just to note: Cutover to Set<SegmentInfos> was not intended for SegmentInfos itsself, it was proposed for the merge code (MergePolicy.OneMerge) only that returns which segments to merge. And there currently the interface is List, because of ordering (see LUCENE-1076 ). NavigableSet ist Java 6. SortedSet only works if the set ordering is defined by its contents, which is not the case for SegmentInfos (the ordering is given by the file on disk). The only thing that could work is combination of List and Set, the Set only to check for duplicates. SegmentInfos is still required to be List-style, but should not allow to add the same SegmentInfo two times. This is why I said let it implement List<SI>, this would also break no code.
        Hide
        Simon Willnauer added a comment -

        NavigableSet ist Java 6. SortedSet only works if the set ordering is defined by its contents

        fair enough. I still think we can make it SortedSet as we know the ordering otherwise we could not build the list right?

        for now I think we should implement Iterable<SI> and offer a method List<SI> asList() to make it consistent with FIS and Document (conclusion from the discussion with uwe on IRC).

        Eventually I would prefer having a set here really.

        Show
        Simon Willnauer added a comment - NavigableSet ist Java 6. SortedSet only works if the set ordering is defined by its contents fair enough. I still think we can make it SortedSet as we know the ordering otherwise we could not build the list right? for now I think we should implement Iterable<SI> and offer a method List<SI> asList() to make it consistent with FIS and Document (conclusion from the discussion with uwe on IRC). Eventually I would prefer having a set here really.
        Hide
        Uwe Schindler added a comment -

        Here a patch that removes List interface.

        It needs te readd lots of methods of List also IndexWriter uses the SegmentsInfos class in my opinion quite often in a way, that it should not be used, so I am not happy. There are some improvements to be done.

        Also its stupid that addAll() does not accept Iterable, but that's a known problem (backwards compatibility of Java 5)!

        Show
        Uwe Schindler added a comment - Here a patch that removes List interface. It needs te readd lots of methods of List also IndexWriter uses the SegmentsInfos class in my opinion quite often in a way, that it should not be used, so I am not happy. There are some improvements to be done. Also its stupid that addAll() does not accept Iterable, but that's a known problem (backwards compatibility of Java 5)!
        Hide
        Uwe Schindler added a comment -

        Here updated patch that removes some List<SI> usage from DirectoryReader and IndexWriter for rollback when commit fails. I am still not happy with interacting of IndexWriter code directly with the list, but this should maybe fixed later.

        This patch could also be backported to cleanup 3.x, but for backwards compatibility, the SegmentInfos class should still extend Vector<SI>, but we can make the fields "segment" simply point to this. I am not sure how to "deprecated" extension of a class? A possibility would be to add each Vector method as a overridden one-liner and deprecated, but thats a non-brainer and stupid to do

        Show
        Uwe Schindler added a comment - Here updated patch that removes some List<SI> usage from DirectoryReader and IndexWriter for rollback when commit fails. I am still not happy with interacting of IndexWriter code directly with the list, but this should maybe fixed later. This patch could also be backported to cleanup 3.x, but for backwards compatibility, the SegmentInfos class should still extend Vector<SI>, but we can make the fields "segment" simply point to this. I am not sure how to "deprecated" extension of a class? A possibility would be to add each Vector method as a overridden one-liner and deprecated, but thats a non-brainer and stupid to do
        Hide
        Michael McCandless added a comment -

        Uwe, this looks like a great step forward? Even if there are other things to fix later, we should commit this first (progress not perfection)? Thanks!

        On backporting, this is an experimental API, and it's rather "expert" for code to be interacting with SegmentInfos, so I think we can just break it (and advertise we did so)?

        Show
        Michael McCandless added a comment - Uwe, this looks like a great step forward? Even if there are other things to fix later, we should commit this first (progress not perfection)? Thanks! On backporting, this is an experimental API, and it's rather "expert" for code to be interacting with SegmentInfos, so I think we can just break it (and advertise we did so)?
        Hide
        Uwe Schindler added a comment -

        New patch that also has BalancedMergePolicy from contrib refactored to new API (sorry that was missing).

        Show
        Uwe Schindler added a comment - New patch that also has BalancedMergePolicy from contrib refactored to new API (sorry that was missing).
        Hide
        Uwe Schindler added a comment -

        Further refactoring:

        • I was able to move more internal ArrayList-modifying code out of IndexWriter
        • the returned List "view" is now unmodifiable!
        • It's now possible to also add a Set view for better contains.

        ...working...

        Show
        Uwe Schindler added a comment - Further refactoring: I was able to move more internal ArrayList-modifying code out of IndexWriter the returned List "view" is now unmodifiable! It's now possible to also add a Set view for better contains. ...working...
        Hide
        Uwe Schindler added a comment -

        Now I improved SegmentInfos more:

        • It now uses a Map/Set to enforce that the SI only contains each segment one time.
        • Faster contains() because Set-backed

        As said before: asList() and asSet() are unmodifiable, so consistency between List and Set/Map is enforced.

        The Set is itsself a Map<SI,Integer>. The values contain the index of segment in the infos. This speeds up indexOf() calls, needed for asserts and remove(SI). As on remove or reorder operations the indexes are no longer correct, a separate boolean is used to mark the Map as inconsistent. It is then regenerated on the next indexOf() call. IndexOf is seldom, butthe keySet() is still consistent, so delaying this update is fine.

        All tests pass. I think the cleanup of SegmentInfos is ready to commit.

        Show
        Uwe Schindler added a comment - Now I improved SegmentInfos more: It now uses a Map/Set to enforce that the SI only contains each segment one time. Faster contains() because Set-backed As said before: asList() and asSet() are unmodifiable, so consistency between List and Set/Map is enforced. The Set is itsself a Map<SI,Integer>. The values contain the index of segment in the infos. This speeds up indexOf() calls, needed for asserts and remove(SI). As on remove or reorder operations the indexes are no longer correct, a separate boolean is used to mark the Map as inconsistent. It is then regenerated on the next indexOf() call. IndexOf is seldom, butthe keySet() is still consistent, so delaying this update is fine. All tests pass. I think the cleanup of SegmentInfos is ready to commit.
        Hide
        Uwe Schindler added a comment -

        This patch only additionally has a cache of the unmodified collections (like Java's core collections do). This prevents creation of new instance on each asList() call.

        Mike: Do you have any further comments, else I will commit in a day or two (before leaving to Lucene Rev).

        Show
        Uwe Schindler added a comment - This patch only additionally has a cache of the unmodified collections (like Java's core collections do). This prevents creation of new instance on each asList() call. Mike: Do you have any further comments, else I will commit in a day or two (before leaving to Lucene Rev).
        Hide
        Michael McCandless added a comment -

        Looks awesome Uwe! +1 to commit. Some small variable naming
        suggestions:

        • Rename cloneChilds -> cloneChildren (sis.createBackupSIS)
        • Maybe call it (and, invert) "mapIndexesValid" instead of
          "mapIndexesInvalid" (in SIS.java)? I generally prefer not putting
          not into boolean variables when possible, for sanity...
        Show
        Michael McCandless added a comment - Looks awesome Uwe! +1 to commit. Some small variable naming suggestions: Rename cloneChilds -> cloneChildren (sis.createBackupSIS) Maybe call it (and, invert) "mapIndexesValid" instead of "mapIndexesInvalid" (in SIS.java)? I generally prefer not putting not into boolean variables when possible, for sanity...
        Hide
        Uwe Schindler added a comment -

        OK! Thanks Mike

        mapIndexesInvalid

        I will remove the map again and replace by a simple Set. Using a map that maps to indexes is too complicated and does not bring us anything. contains() works without and indexOf() needs to rebuild the map whenever an insert or remove is done. Especially on remove(SI) it will rebuild the map two times in the badest case.

        A linear scan for indexOf is in my opinion fine. We can only optimize by doing a contains on the set first.

        Show
        Uwe Schindler added a comment - OK! Thanks Mike mapIndexesInvalid I will remove the map again and replace by a simple Set. Using a map that maps to indexes is too complicated and does not bring us anything. contains() works without and indexOf() needs to rebuild the map whenever an insert or remove is done. Especially on remove(SI) it will rebuild the map two times in the badest case. A linear scan for indexOf is in my opinion fine. We can only optimize by doing a contains on the set first.
        Hide
        Uwe Schindler added a comment -

        New patch with the renaming and removal of Map in favour of a simple Set.

        Again ready to commit.

        Show
        Uwe Schindler added a comment - New patch with the renaming and removal of Map in favour of a simple Set. Again ready to commit.
        Hide
        Michael McCandless added a comment -

        Patch looks great Uwe! +1 to commit. Thanks!

        Show
        Michael McCandless added a comment - Patch looks great Uwe! +1 to commit. Thanks!
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1124307, 1124316 (copy-paste error)

        Now backporting...

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1124307, 1124316 (copy-paste error) Now backporting...
        Hide
        Uwe Schindler added a comment -

        Merged patch. Will commit now.

        Show
        Uwe Schindler added a comment - Merged patch. Will commit now.
        Hide
        Uwe Schindler added a comment -

        Committed 3.x revision: 1124339

        Show
        Uwe Schindler added a comment - Committed 3.x revision: 1124339
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development