Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0), 6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today index sorting is a very expert feature. You need to use a custom merge policy, custom collectors, etc. I would like to explore making it a first-class citizen so that:

      • the sort order could be configured on IndexWriterConfig
      • segments would record the sort order that was used to write them
      • IndexSearcher could automatically early terminate when computing top docs on a sort order that is a prefix of the sort order of a segment (and if the user is not interested in totalHits).
      1. LUCENE-6766.patch
        540 kB
        Michael McCandless
      2. LUCENE-6766.patch
        481 kB
        Michael McCandless
      3. LUCENE-6766.patch
        173 kB
        Adrien Grand

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        jpountz Adrien Grand added a comment -

        Here is a first prototype that:

        • moves sorting logic from misc to core
        • removes SortingMergePolicy
        • adds an "indexSort" parameter to IndexWriterConfig and SegmentInfo, with null meaning that the index order is unspecified
        • SimpleTextCodec (de)serializes this indexOrder parameter, other codecs ignore it for now
        • refactors a bit the doc ID remapping logic in IndexWriter when there have been deletions while some segments were being merged

        Open question: how should we serialize the SortField objects? Should we have a fixed list of supported SortField parameters or should we allow SortField parameters to serialize themselves?

        There are lots of things we could do on the search side, but for now I'd like to focus on the indexing side and making sure the sort order of segments is easily accessible.

        Show
        jpountz Adrien Grand added a comment - Here is a first prototype that: moves sorting logic from misc to core removes SortingMergePolicy adds an "indexSort" parameter to IndexWriterConfig and SegmentInfo, with null meaning that the index order is unspecified SimpleTextCodec (de)serializes this indexOrder parameter, other codecs ignore it for now refactors a bit the doc ID remapping logic in IndexWriter when there have been deletions while some segments were being merged Open question: how should we serialize the SortField objects? Should we have a fixed list of supported SortField parameters or should we allow SortField parameters to serialize themselves? There are lots of things we could do on the search side, but for now I'd like to focus on the indexing side and making sure the sort order of segments is easily accessible.
        Hide
        ebradshaw Elliott Bradshaw added a comment -

        This would be great!

        +1

        Show
        ebradshaw Elliott Bradshaw added a comment - This would be great! +1
        Hide
        mikemccand Michael McCandless added a comment -

        This looks like a great patch! Probably we can make SortingLeafReader private?

        I think it's OK to restrict the allowed SortField that we need to support and serialize/deserialize?

        Can we fix IW to insist on open that the incoming index sort matches whatever the current index has (if the current index exists)?

        Since this patch, we moved SlowCompositeReaderWrapper out of core ... I wonder if we can 1) fix flush to also write new segments in correct sort order, and 2) fix default merge implementation to look at sort order? Merging should be an efficient merge sort (vs. what SortingLeafReader on top of SlowCompositeReaderWrapper does today).

        Show
        mikemccand Michael McCandless added a comment - This looks like a great patch! Probably we can make SortingLeafReader private? I think it's OK to restrict the allowed SortField that we need to support and serialize/deserialize? Can we fix IW to insist on open that the incoming index sort matches whatever the current index has (if the current index exists)? Since this patch, we moved SlowCompositeReaderWrapper out of core ... I wonder if we can 1) fix flush to also write new segments in correct sort order, and 2) fix default merge implementation to look at sort order? Merging should be an efficient merge sort (vs. what SortingLeafReader on top of SlowCompositeReaderWrapper does today).
        Hide
        jpountz Adrien Grand added a comment -

        I think a challenge to sorting flushed segments is how we write stored fields and term vectors directly to the directory at index time. We should somehow buffer them in memory and sort on flush when a non-default sort order is configured? Or do you see an easier way?

        I agree merge sorting feels like the right approach to this problem. The reason why I used SlowCompositeReaderWrapper in the first place was that merging can be quite tricky and using SlowCompositeReaderWrapper allowed me to reuse the existing merging logic of all codec components. But it is likely less efficient like you said.

        Show
        jpountz Adrien Grand added a comment - I think a challenge to sorting flushed segments is how we write stored fields and term vectors directly to the directory at index time. We should somehow buffer them in memory and sort on flush when a non-default sort order is configured? Or do you see an easier way? I agree merge sorting feels like the right approach to this problem. The reason why I used SlowCompositeReaderWrapper in the first place was that merging can be quite tricky and using SlowCompositeReaderWrapper allowed me to reuse the existing merging logic of all codec components. But it is likely less efficient like you said.
        Hide
        mikemccand Michael McCandless added a comment -

        I think a challenge to sorting flushed segments is how we write stored fields and term vectors directly to the directory at index time. We should somehow buffer them in memory and sort on flush when a non-default sort order is configured? Or do you see an easier way?

        Hmm tricky. Yeah, we could buffer in heap if IWC.indexSort is set, or ... we could just write as we do today, but then ask the codec for a stored fields (and term vectors) reader to do the sorting at flush time.

        Or we separate "sorting on flushed segments" out for the future, keeping SortingLeafReader, since the rest of this is already plenty hard, and focus here on making merging more efficient (don't use SlowCompositeReaderWrapper? I think it would mean fixing the default merge impls ... today they all assume they concatenate each segments document sequentially (mapping around deletions) but with indexSort in use, they just need to merge sort instead. Maybe we can abstract "concat vs merge sort" away so that all default merge impls could re-use it ... seems like it could be fairly clean maybe.

        Show
        mikemccand Michael McCandless added a comment - I think a challenge to sorting flushed segments is how we write stored fields and term vectors directly to the directory at index time. We should somehow buffer them in memory and sort on flush when a non-default sort order is configured? Or do you see an easier way? Hmm tricky. Yeah, we could buffer in heap if IWC.indexSort is set, or ... we could just write as we do today, but then ask the codec for a stored fields (and term vectors) reader to do the sorting at flush time. Or we separate "sorting on flushed segments" out for the future, keeping SortingLeafReader , since the rest of this is already plenty hard, and focus here on making merging more efficient (don't use SlowCompositeReaderWrapper ? I think it would mean fixing the default merge impls ... today they all assume they concatenate each segments document sequentially (mapping around deletions) but with indexSort in use, they just need to merge sort instead. Maybe we can abstract "concat vs merge sort" away so that all default merge impls could re-use it ... seems like it could be fairly clean maybe.
        Hide
        mikemccand Michael McCandless added a comment -

        Maybe we can abstract "concat vs merge sort" away

        I'm exploring this and it looks like it may be a promising baby step, hopefully letting us stop using SlowCompositeReaderWrapper for index sorting...

        Show
        mikemccand Michael McCandless added a comment - Maybe we can abstract "concat vs merge sort" away I'm exploring this and it looks like it may be a promising baby step, hopefully letting us stop using SlowCompositeReaderWrapper for index sorting...
        Hide
        jpountz Adrien Grand added a comment -

        Please let me know if you need help!

        Show
        jpountz Adrien Grand added a comment - Please let me know if you need help!
        Hide
        mikemccand Michael McCandless added a comment -

        I've been slowly iterating here and pushing changes to https://github.com/mikemccand/lucene-solr/tree/index_sort

        There are tons of nocommits, but tests do pass, including index sorting tests (though they still need improving).

        Some details:

        • I added a new DocIDMerger helper class, and the default merge impls use this to abstract away how to iterate the documents from the N sub-readers, whether they are simply concatenated or merge-sorted. I think this should be quite a bit more efficient than what SortingMergePolicy does today, but it does add some increase in code complexity, which I think is OK/contained.
        • SlowCompositeReader is no longer used for index sorting
        • Points now work fine w/ index sorting
        • CheckIndex verifies the claimed per-segment index sort is in fact true
        • IW gets angry if you open an existing index with a different index sort
        • Only simple sort types are allowed; no CUSTOM, SCORE or REWRITEABLE
        • I made a new Lucene62Codec, with a new Lucene62SegmentInfoFormat that supports index sorting.
        • I added LeafReader.getIndexSort so apps can check if a given segment was sorted
        • I disable bulk merge optos when index sorting is present

        IW flush still does not sort, and so at merge time we wrap such segments with SortingLeafReader. This is quite ugly, that an index can have some segments sorted and some not sorted. E.g. it means IW's check for whether the new index sort matches the existing one, is just best effort ... but this is already an enormous change so
        I think we really have to look into "sort on flush" (which is hairy by itself) later, separately

        Show
        mikemccand Michael McCandless added a comment - I've been slowly iterating here and pushing changes to https://github.com/mikemccand/lucene-solr/tree/index_sort There are tons of nocommits, but tests do pass, including index sorting tests (though they still need improving). Some details: I added a new DocIDMerger helper class, and the default merge impls use this to abstract away how to iterate the documents from the N sub-readers, whether they are simply concatenated or merge-sorted. I think this should be quite a bit more efficient than what SortingMergePolicy does today, but it does add some increase in code complexity, which I think is OK/contained. SlowCompositeReader is no longer used for index sorting Points now work fine w/ index sorting CheckIndex verifies the claimed per-segment index sort is in fact true IW gets angry if you open an existing index with a different index sort Only simple sort types are allowed; no CUSTOM, SCORE or REWRITEABLE I made a new Lucene62Codec , with a new Lucene62SegmentInfoFormat that supports index sorting. I added LeafReader.getIndexSort so apps can check if a given segment was sorted I disable bulk merge optos when index sorting is present IW flush still does not sort, and so at merge time we wrap such segments with SortingLeafReader . This is quite ugly, that an index can have some segments sorted and some not sorted. E.g. it means IW's check for whether the new index sort matches the existing one, is just best effort ... but this is already an enormous change so I think we really have to look into "sort on flush" (which is hairy by itself) later, separately
        Hide
        mikemccand Michael McCandless added a comment -

        Here's the current patch (generated from diffSources.py)...

        Show
        mikemccand Michael McCandless added a comment - Here's the current patch (generated from diffSources.py )...
        Show
        mikemccand Michael McCandless added a comment - And here's the same patch on github: https://github.com/apache/lucene-solr/compare/master...mikemccand:index_sort?expand=1
        Hide
        jpountz Adrien Grand added a comment -

        This looks great!

        it does add some increase in code complexity, which I think is OK/contained.

        Agreed. The only thing I am slightly worried about is how all optimized bulk mergers need to opt out if a sort order is configured. I am wondering if our base consumer classes should have two merge methods so that you would not have to check the sort order when overriding the method for regular merges? This is just an idea, it has drawbacks too since there would not be a single entry point to merging anymore and we would need another method in our API, but I'm suggesting it anyway hoping that it might give somebody a better idea.

        but this is already an enormous change so I think we really have to look into "sort on flush" (which is hairy by itself) later, separately

        +1

        +// nocommit if index time sorting is in use, don't try to bulk merge ... later we can make crazy bulk merger that looks for long runs from
        +// one sub?
        

        Maybe this one could be made a simple TODO. I think it is totally fine if index sorting always bypasses optimized bulk mergers, at least for now? Since we are still pulling a merge instance, it should not be too bad (no worse than merging across different codecs)?

         // nocommit in the unsorted case, this should map correctly, e.g. apply per segment docBase
        

        This seems to already be the case based on the code?

        // nocommit isn't liveDocs redundant?  docMap returns -1 for us?
        

        +1 I think it would be easier if this part of the code only used the docMap.

        // nocommit is it sub's job to skip deleted docs?
        

        I think it is since there is no mapped doc ID for deleted docs?

          // nocommit doesn't support index sorting?  or sorts must be the same?
          public void addIndexes(Directory... dirs) throws IOException {
        

        Can we do like the nocommit on addIndexes(CodecReader...) suggests and just make sure that we cannot end up with segments that have different sort orders in the index?

        // nocommit what about MergedReaderWrapper in here?
        

        I think we should still wrap with MergedReaderWrapper? This will help stored fields if two documents from the same block are read consecutively (which could likely happen if the order in which docs are indexed is somehow correlated to the index sort, like if sorting by timestamp)?

        +    Sort indexSort = null;
        +
             // build FieldInfos and fieldToReader map:
             for (final LeafReader reader : this.parallelReaders) {
        +      if (indexSort == null) {
        +        indexSort = reader.getIndexSort();
        +      } else if (indexSort.equals(reader.getIndexSort()) == false) {
        +        throw new IllegalArgumentException("cannot combine LeafReaders that have different index sorts: saw both sort=" + indexSort + " and " + reader.getIndexSort());
        +      }
        

        I think this is buggy since it ignores null sorts at the beginning of the list but not at the end, so the same list of readers may or may not raise an exception depending on the order in which readers are provided?

        // nocommit does search time "do the right thing" automatically when segment is sorted?
        

        Agreed it should. I see you also left nocommits about moving the early-terminating collectors from misc to core, but leveraging index sorting at search time looks like a big task to me so maybe we should defer it to a follow-up issue like sorting on flush?

        // nocommit just do assertReaderEquals, don't use @BeforeClass, etc.?
        

        +1!

        --- trunk/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java  2016-02-16 11:18:34.753021816 -0500
        +++ indexsort/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java      2016-05-06 19:17:29.893848515 -0400
        @@ -20,13 +20,14 @@
        
        +// nocommit what to do here?
        

        Let's remove it for now and later see whether this is something that could be added back?

        +    @Override
        +    public int nextDoc() {
        +      try {
        +        return postings.nextDoc();
        +      } catch (IOException ioe) {
        +        throw new RuntimeException(ioe);
        +      }
        +    }
        

        Should DocIdMerger.Sub.nextDoc throw an IOException?

        Show
        jpountz Adrien Grand added a comment - This looks great! it does add some increase in code complexity, which I think is OK/contained. Agreed. The only thing I am slightly worried about is how all optimized bulk mergers need to opt out if a sort order is configured. I am wondering if our base consumer classes should have two merge methods so that you would not have to check the sort order when overriding the method for regular merges? This is just an idea, it has drawbacks too since there would not be a single entry point to merging anymore and we would need another method in our API, but I'm suggesting it anyway hoping that it might give somebody a better idea. but this is already an enormous change so I think we really have to look into "sort on flush" (which is hairy by itself) later, separately +1 + // nocommit if index time sorting is in use, don't try to bulk merge ... later we can make crazy bulk merger that looks for long runs from + // one sub? Maybe this one could be made a simple TODO. I think it is totally fine if index sorting always bypasses optimized bulk mergers, at least for now? Since we are still pulling a merge instance, it should not be too bad (no worse than merging across different codecs)? // nocommit in the unsorted case , this should map correctly, e.g. apply per segment docBase This seems to already be the case based on the code? // nocommit isn't liveDocs redundant? docMap returns -1 for us? +1 I think it would be easier if this part of the code only used the docMap. // nocommit is it sub's job to skip deleted docs? I think it is since there is no mapped doc ID for deleted docs? // nocommit doesn't support index sorting? or sorts must be the same? public void addIndexes(Directory... dirs) throws IOException { Can we do like the nocommit on addIndexes(CodecReader...) suggests and just make sure that we cannot end up with segments that have different sort orders in the index? // nocommit what about MergedReaderWrapper in here? I think we should still wrap with MergedReaderWrapper? This will help stored fields if two documents from the same block are read consecutively (which could likely happen if the order in which docs are indexed is somehow correlated to the index sort, like if sorting by timestamp)? + Sort indexSort = null ; + // build FieldInfos and fieldToReader map: for ( final LeafReader reader : this .parallelReaders) { + if (indexSort == null ) { + indexSort = reader.getIndexSort(); + } else if (indexSort.equals(reader.getIndexSort()) == false ) { + throw new IllegalArgumentException( "cannot combine LeafReaders that have different index sorts: saw both sort=" + indexSort + " and " + reader.getIndexSort()); + } I think this is buggy since it ignores null sorts at the beginning of the list but not at the end, so the same list of readers may or may not raise an exception depending on the order in which readers are provided? // nocommit does search time " do the right thing" automatically when segment is sorted? Agreed it should. I see you also left nocommits about moving the early-terminating collectors from misc to core, but leveraging index sorting at search time looks like a big task to me so maybe we should defer it to a follow-up issue like sorting on flush? // nocommit just do assertReaderEquals, don't use @BeforeClass, etc.? +1! --- trunk/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java 2016-02-16 11:18:34.753021816 -0500 +++ indexsort/lucene/misc/src/java/org/apache/lucene/search/BlockJoinComparatorSource.java 2016-05-06 19:17:29.893848515 -0400 @@ -20,13 +20,14 @@ + // nocommit what to do here? Let's remove it for now and later see whether this is something that could be added back? + @Override + public int nextDoc() { + try { + return postings.nextDoc(); + } catch (IOException ioe) { + throw new RuntimeException(ioe); + } + } Should DocIdMerger.Sub.nextDoc throw an IOException?
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Adrien Grand!

        I folded in most of your feedback, except:

        The only thing I am slightly worried about is how all optimized bulk mergers need to opt out if a sort order is configured. I am wondering if our base consumer classes should have two merge methods so that you would not have to check the sort order when overriding the method for regular merges? This is just an idea, it has drawbacks too since there would not be a single entry point to merging anymore and we would need another method in our API, but I'm suggesting it anyway hoping that it might give somebody a better idea.

        I think it's OK to keep a single merge method? This merge method
        already must deal with wild per-segment variabilities, e.g. different
        fields across segments, some have deletions some don't, etc., so I
        don't think we need to single out "has an index sort" into a separate
        method?

        Also, implementing merge methods is really an uber-expert thing to
        do, so such devs should be up to the task of handling an incoming
        index sort, I think.

        I think this is buggy since it ignores null sorts at the beginning of the list but not at the end,

        Nice catch! I added test showing the bug, and then fixed it (pushed).

        Let's remove it for now and later see whether this is something that could be added back?

        OK I did that. I think at least there is a simple solution for doc-block
        users: just index a doc values field with the "id" for each block, and
        then sort on that.

        but leveraging index sorting at search time looks like a big task to me so maybe we should defer it to a follow-up issue like sorting on flush?

        I did move the early terminating to core, and I do think going forward
        we should make it easier to use this ... it should somehow be the
        default, and not a "make your own Collector" situation ...

        As Rob has pointed out, even today (before promoting index sorting)
        we could early-terminate in cases where the query is sorting on
        index order, such as collecting first N hits for a filter.

        But I agree we should do this separately. I will open follow-on issues
        for "can we sort on flush too" and "searching should take advantage
        of index sort by default".

        Should DocIdMerger.Sub.nextDoc throw an IOException?

        I tried this out, but it started to sprawl: the doc values all wrap
        `DocIdMerger` under a java `Iterator` which cannot throw `IOException`
        ... I could move the `try/except` up there, but there are many places
        I'd have to move this to, so leaving it where it is seemed like the
        lesser evil.

        Show
        mikemccand Michael McCandless added a comment - Thanks Adrien Grand ! I folded in most of your feedback, except: The only thing I am slightly worried about is how all optimized bulk mergers need to opt out if a sort order is configured. I am wondering if our base consumer classes should have two merge methods so that you would not have to check the sort order when overriding the method for regular merges? This is just an idea, it has drawbacks too since there would not be a single entry point to merging anymore and we would need another method in our API, but I'm suggesting it anyway hoping that it might give somebody a better idea. I think it's OK to keep a single merge method? This merge method already must deal with wild per-segment variabilities, e.g. different fields across segments, some have deletions some don't, etc., so I don't think we need to single out "has an index sort" into a separate method? Also, implementing merge methods is really an uber-expert thing to do, so such devs should be up to the task of handling an incoming index sort, I think. I think this is buggy since it ignores null sorts at the beginning of the list but not at the end, Nice catch! I added test showing the bug, and then fixed it (pushed). Let's remove it for now and later see whether this is something that could be added back? OK I did that. I think at least there is a simple solution for doc-block users: just index a doc values field with the "id" for each block, and then sort on that. but leveraging index sorting at search time looks like a big task to me so maybe we should defer it to a follow-up issue like sorting on flush? I did move the early terminating to core, and I do think going forward we should make it easier to use this ... it should somehow be the default, and not a "make your own Collector" situation ... As Rob has pointed out, even today (before promoting index sorting) we could early-terminate in cases where the query is sorting on index order, such as collecting first N hits for a filter. But I agree we should do this separately. I will open follow-on issues for "can we sort on flush too" and "searching should take advantage of index sort by default". Should DocIdMerger.Sub.nextDoc throw an IOException? I tried this out, but it started to sprawl: the doc values all wrap `DocIdMerger` under a java `Iterator` which cannot throw `IOException` ... I could move the `try/except` up there, but there are many places I'd have to move this to, so leaving it where it is seemed like the lesser evil.
        Hide
        mikemccand Michael McCandless added a comment -

        I think this is ready ... here's the current patch against master.

        I still need to run "first do no harm" indexing performance tests to
        make sure there is not too much of a hit when indexing without an
        index sort.

        I don't plan to rush this in for 6.1 ... I'll commit to master, and
        after we release 6.1 (Soon I think?: so many geo improvements!), I
        plan to backport for 6.2.

        Show
        mikemccand Michael McCandless added a comment - I think this is ready ... here's the current patch against master. I still need to run "first do no harm" indexing performance tests to make sure there is not too much of a hit when indexing without an index sort. I don't plan to rush this in for 6.1 ... I'll commit to master, and after we release 6.1 (Soon I think?: so many geo improvements!), I plan to backport for 6.2.
        Hide
        mikemccand Michael McCandless added a comment -

        I tested master vs patch indexing performance on luceneutil's "wikimedium10m" docs. I ran indexing 5 times each. This is just a "first do no harm test", i.e. in both cases I'm indexing without an index sort.

        I use SMS, and frequent flushing, so this is a very merge-heavy benchmark.

        Master:

        /l/logs/before0.log:Indexer: finished (675550 msec)
        /l/logs/before1.log:Indexer: finished (671058 msec)
        /l/logs/before2.log:Indexer: finished (683297 msec)
        /l/logs/before3.log:Indexer: finished (670856 msec)
        /l/logs/before4.log:Indexer: finished (671516 msec)
        

        Patch:

        /l/logs/after0.log:Indexer: finished (673302 msec)
        /l/logs/after1.log:Indexer: finished (674855 msec)
        /l/logs/after2.log:Indexer: finished (679655 msec)
        /l/logs/after3.log:Indexer: finished (680151 msec)
        /l/logs/after4.log:Indexer: finished (681921 msec)
        

        Net/net I think any performance hit is very small, well within measurement noise.

        Show
        mikemccand Michael McCandless added a comment - I tested master vs patch indexing performance on luceneutil's "wikimedium10m" docs. I ran indexing 5 times each. This is just a "first do no harm test", i.e. in both cases I'm indexing without an index sort. I use SMS, and frequent flushing, so this is a very merge-heavy benchmark. Master: /l/logs/before0.log:Indexer: finished (675550 msec) /l/logs/before1.log:Indexer: finished (671058 msec) /l/logs/before2.log:Indexer: finished (683297 msec) /l/logs/before3.log:Indexer: finished (670856 msec) /l/logs/before4.log:Indexer: finished (671516 msec) Patch: /l/logs/after0.log:Indexer: finished (673302 msec) /l/logs/after1.log:Indexer: finished (674855 msec) /l/logs/after2.log:Indexer: finished (679655 msec) /l/logs/after3.log:Indexer: finished (680151 msec) /l/logs/after4.log:Indexer: finished (681921 msec) Net/net I think any performance hit is very small, well within measurement noise.
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        mikemccand Michael McCandless added a comment -

        I tried sorting with the 10M wikipedia index.

        Sort by last-modified-date:

          Indexer: indexing done (900389 msec); total 10000000 docs
          Indexer: force merge done (took 134020 msec)
        

        Sort by title:

          Indexer: indexing done (907923 msec); total 10000000 docs
          Indexer: force merge done (took 135041 msec)
        

        vs. no sorting:

          Indexer: indexing done (702761 msec); total 10000000 docs
          Indexer: force merge done (took 65726 msec)
        

        Index size was about the same in all cases, ~3.1 GB.

        I also confirmed CheckIndex verifies the sorted indices are OK (it checks the sort order).

        So ~28% slower with sorting overall... but this uses a single thread, SerialMergeScheduler, and small IW buffer, so it's very merge-heavy.

        Show
        mikemccand Michael McCandless added a comment - I tried sorting with the 10M wikipedia index. Sort by last-modified-date: Indexer: indexing done (900389 msec); total 10000000 docs Indexer: force merge done (took 134020 msec) Sort by title: Indexer: indexing done (907923 msec); total 10000000 docs Indexer: force merge done (took 135041 msec) vs. no sorting: Indexer: indexing done (702761 msec); total 10000000 docs Indexer: force merge done (took 65726 msec) Index size was about the same in all cases, ~3.1 GB. I also confirmed CheckIndex verifies the sorted indices are OK (it checks the sort order). So ~28% slower with sorting overall... but this uses a single thread, SerialMergeScheduler, and small IW buffer, so it's very merge-heavy.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 87690f8b13b1def6c822ba36a42e4cb6939ab4c2 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=87690f8 ]

        LUCENE-6766: add another random test case; move early terminating collector to core

        Show
        jira-bot ASF subversion and git services added a comment - Commit 87690f8b13b1def6c822ba36a42e4cb6939ab4c2 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=87690f8 ] LUCENE-6766 : add another random test case; move early terminating collector to core
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit fa37241e784e0479da1637f863e07f1d909f40a9 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fa37241 ]

        LUCENE-6766: add deletions to random test

        Show
        jira-bot ASF subversion and git services added a comment - Commit fa37241e784e0479da1637f863e07f1d909f40a9 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fa37241 ] LUCENE-6766 : add deletions to random test
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1e82c13184621f6cefac35f8d10d8fe74d2a356c in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1e82c13 ]

        LUCENE-6766: resolve remaining nocommits; add more IW infoStream logging during merge

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1e82c13184621f6cefac35f8d10d8fe74d2a356c in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1e82c13 ] LUCENE-6766 : resolve remaining nocommits; add more IW infoStream logging during merge
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8361de87becd64c8b217313877b996ac20167856 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8361de8 ]

        LUCENE-6766: fix parallel reader's detection of conflicting index sort

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8361de87becd64c8b217313877b996ac20167856 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8361de8 ] LUCENE-6766 : fix parallel reader's detection of conflicting index sort
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e3ecc6a5361948c28679c7ac76161f167824e514 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e3ecc6a ]

        LUCENE-6766: merge master

        Show
        jira-bot ASF subversion and git services added a comment - Commit e3ecc6a5361948c28679c7ac76161f167824e514 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e3ecc6a ] LUCENE-6766 : merge master
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e283271aaf6da3033156f36b421d3241b5499d4e in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e283271 ]

        LUCENE-6766: more IW.infoStream logging around sorting; fix test bug

        Show
        jira-bot ASF subversion and git services added a comment - Commit e283271aaf6da3033156f36b421d3241b5499d4e in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e283271 ] LUCENE-6766 : more IW.infoStream logging around sorting; fix test bug
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5fb7413ccb9c690d3a59d7227b3cb194943290ef in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5fb7413 ]

        LUCENE-6766: remove leftover sop

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5fb7413ccb9c690d3a59d7227b3cb194943290ef in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5fb7413 ] LUCENE-6766 : remove leftover sop
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3cde9eb3d027b273a3c136e9eb284ae18f1824fe in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3cde9eb ]

        LUCENE-6766: keep SortingMergePolicy for solr back-compat; fix Solr tests; fix precommit failures

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3cde9eb3d027b273a3c136e9eb284ae18f1824fe in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3cde9eb ] LUCENE-6766 : keep SortingMergePolicy for solr back-compat; fix Solr tests; fix precommit failures
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d715210467a4907ca34e7f0fe1a438908737894f in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d715210 ]

        LUCENE-6766: merged

        Show
        jira-bot ASF subversion and git services added a comment - Commit d715210467a4907ca34e7f0fe1a438908737894f in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d715210 ] LUCENE-6766 : merged
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 9d5b834b09d4ff23e89755e5d1af407a2bd96c16 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9d5b834 ]

        LUCENE-6766: put Placeholder back so javadocs are OK; deprecate Lucene60Codec

        Show
        jira-bot ASF subversion and git services added a comment - Commit 9d5b834b09d4ff23e89755e5d1af407a2bd96c16 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9d5b834 ] LUCENE-6766 : put Placeholder back so javadocs are OK; deprecate Lucene60Codec
        Hide
        mikemccand Michael McCandless added a comment -

        I pushed this to master ... I will hold off on backporting to 6.x until we release 6.1, giving it time to bake.

        I'll go open a bunch of followon issues now.

        Show
        mikemccand Michael McCandless added a comment - I pushed this to master ... I will hold off on backporting to 6.x until we release 6.1, giving it time to bake. I'll go open a bunch of followon issues now.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c26bb87140eacbcdfa6c083a10714af275fe4ab6 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c26bb87 ]

        LUCENE-6766: simplify test case

        Show
        jira-bot ASF subversion and git services added a comment - Commit c26bb87140eacbcdfa6c083a10714af275fe4ab6 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c26bb87 ] LUCENE-6766 : simplify test case
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3010ffacafd5cc371f4d62413105294d0df37450 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3010ffa ]

        LUCENE-6766: add another random test case; move early terminating collector to core

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3010ffacafd5cc371f4d62413105294d0df37450 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3010ffa ] LUCENE-6766 : add another random test case; move early terminating collector to core
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a4722befb3f878faa0a5ee9752ae21070c771cf2 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4722be ]

        LUCENE-6766: add deletions to random test

        Show
        jira-bot ASF subversion and git services added a comment - Commit a4722befb3f878faa0a5ee9752ae21070c771cf2 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4722be ] LUCENE-6766 : add deletions to random test
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2703b827bf2316e8d39025666ed5f1d42ed70d64 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2703b82 ]

        LUCENE-6766: resolve remaining nocommits; add more IW infoStream logging during merge

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2703b827bf2316e8d39025666ed5f1d42ed70d64 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2703b82 ] LUCENE-6766 : resolve remaining nocommits; add more IW infoStream logging during merge
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 4740056f0987aef4eb727332d7ce9770964543c2 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4740056 ]

        LUCENE-6766: fix parallel reader's detection of conflicting index sort

        Show
        jira-bot ASF subversion and git services added a comment - Commit 4740056f0987aef4eb727332d7ce9770964543c2 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4740056 ] LUCENE-6766 : fix parallel reader's detection of conflicting index sort
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0dd65f6130dbcb1a9caae7963fed246c1068ebe0 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0dd65f6 ]

        LUCENE-6766: more IW.infoStream logging around sorting; fix test bug

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0dd65f6130dbcb1a9caae7963fed246c1068ebe0 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0dd65f6 ] LUCENE-6766 : more IW.infoStream logging around sorting; fix test bug
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2f6cdea9a9ec3bb62cf0d111768969c2a6275276 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f6cdea ]

        LUCENE-6766: remove leftover sop

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2f6cdea9a9ec3bb62cf0d111768969c2a6275276 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f6cdea ] LUCENE-6766 : remove leftover sop
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a3270ac6e64012ec0a5b6864cdfcf190a1a36346 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a3270ac ]

        LUCENE-6766: keep SortingMergePolicy for solr back-compat; fix Solr tests; fix precommit failures

        Show
        jira-bot ASF subversion and git services added a comment - Commit a3270ac6e64012ec0a5b6864cdfcf190a1a36346 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a3270ac ] LUCENE-6766 : keep SortingMergePolicy for solr back-compat; fix Solr tests; fix precommit failures
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ea26dd5855ec45dcdaa385dd240a6ef91aa1c4d9 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea26dd5 ]

        LUCENE-6766: finish 6.x backport

        Show
        jira-bot ASF subversion and git services added a comment - Commit ea26dd5855ec45dcdaa385dd240a6ef91aa1c4d9 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea26dd5 ] LUCENE-6766 : finish 6.x backport
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8bd27977dd993d4443be359a6f7ec92c7f012247 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8bd2797 ]

        LUCENE-6766: add changes

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8bd27977dd993d4443be359a6f7ec92c7f012247 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8bd2797 ] LUCENE-6766 : add changes
        Hide
        mikemccand Michael McCandless added a comment -

        I backported to 6.x

        Show
        mikemccand Michael McCandless added a comment - I backported to 6.x
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            3 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development