Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be awesome if Lucene could write the documents out in a segment based on a configurable order. This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together. This often applies to documents near each other in time, but also spatially.

      1. LUCENE-4752-2.patch
        17 kB
        Adrien Grand
      2. LUCENE-4752.patch
        22 kB
        Adrien Grand
      3. LUCENE-4752.patch
        99 kB
        Adrien Grand
      4. sorting_10M_ingestion.log
        234 kB
        Adrien Grand
      5. natural_10M_ingestion.log
        163 kB
        Adrien Grand
      6. LUCENE-4752.patch
        77 kB
        Adrien Grand
      7. LUCENE-4752.patch
        77 kB
        Adrien Grand
      8. LUCENE-4752.patch
        68 kB
        Adrien Grand
      9. LUCENE-4752.patch
        8 kB
        Adrien Grand
      10. LUCENE-4752.patch
        37 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          I don't think it can help: this feature only makes documents sorted per-segment so this can't be used to get documents globally sorted.

          Show
          Adrien Grand added a comment - I don't think it can help: this feature only makes documents sorted per-segment so this can't be used to get documents globally sorted.
          Hide
          Ertio Lew added a comment -

          I'm looking to retrieve the documents added to the index in the order they were added or sorted by descending order of doc ids. This seems to be a fix for that. What lucene version would this will available in ?

          Show
          Ertio Lew added a comment - I'm looking to retrieve the documents added to the index in the order they were added or sorted by descending order of doc ids. This seems to be a fix for that. What lucene version would this will available in ?
          Hide
          David Smiley added a comment -

          You guys rock for getting this done! My only contribution was to the conversation here; it was generous to list me in the CHANGES.txt. I need to create more JIRA issues for others to work on

          Show
          David Smiley added a comment - You guys rock for getting this done! My only contribution was to the conversation here; it was generous to list me in the CHANGES.txt. I need to create more JIRA issues for others to work on
          Hide
          Adrien Grand added a comment -

          Thank you for the review Mike, I hope it will pass tests now!

          Show
          Adrien Grand added a comment - Thank you for the review Mike, I hope it will pass tests now!
          Hide
          Michael McCandless added a comment -

          +1, tricky!

          Show
          Michael McCandless added a comment - +1, tricky!
          Hide
          Adrien Grand added a comment -

          Patch:

          • fixes the issue by allowing OneMerges to return a doc map that translates doc IDs to their new value so that IndexWriter can commit merged deletes,
          • TestSortingMergePolicy has been modified to make deletions more likely to happen concurrently with a merge.
          Show
          Adrien Grand added a comment - Patch: fixes the issue by allowing OneMerges to return a doc map that translates doc IDs to their new value so that IndexWriter can commit merged deletes, TestSortingMergePolicy has been modified to make deletions more likely to happen concurrently with a merge.
          Hide
          Adrien Grand added a comment -

          I just found what caused the last Jenkins failures: sometimes deletions happen concurrently with a merge. In this case, deletes are still applied to the old ReaderAndLiveDocs and once the merge is finished, IndexWriter runs commitMergedDeletes to apply deletes to the new segment too, but since it assumes doc IDs are assigned sequentially, it doesn't work with SortingMergePolicy. (This explains why the bug was hard to reproduce too.)

          Show
          Adrien Grand added a comment - I just found what caused the last Jenkins failures: sometimes deletions happen concurrently with a merge. In this case, deletes are still applied to the old ReaderAndLiveDocs and once the merge is finished, IndexWriter runs commitMergedDeletes to apply deletes to the new segment too, but since it assumes doc IDs are assigned sequentially, it doesn't work with SortingMergePolicy. (This explains why the bug was hard to reproduce too.)
          Hide
          Adrien Grand added a comment -

          Adrien, you didn't put your name in the CHANGES entry . +1 to commit.

          Fixed and committed. Thank you Shai!

          Show
          Adrien Grand added a comment - Adrien, you didn't put your name in the CHANGES entry . +1 to commit. Fixed and committed. Thank you Shai!
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1459794

          LUCENE-4752: Added SortingMergePolicy that sorts documents when merging.

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1459794 LUCENE-4752 : Added SortingMergePolicy that sorts documents when merging.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1459808

          LUCENE-4752: Added SortingMergePolicy that sorts documents when merging (merged from r1459794).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1459808 LUCENE-4752 : Added SortingMergePolicy that sorts documents when merging (merged from r1459794).
          Hide
          Shai Erera added a comment -

          Looks good! I like OneMerge.getMergeReaders() – it's one way to customize your readers as they are merged. This has been asked on the user list a couple times that I remember.

          Adrien, you didn't put your name in the CHANGES entry . +1 to commit.

          Show
          Shai Erera added a comment - Looks good! I like OneMerge.getMergeReaders() – it's one way to customize your readers as they are merged. This has been asked on the user list a couple times that I remember. Adrien, you didn't put your name in the CHANGES entry . +1 to commit.
          Hide
          Adrien Grand added a comment -

          I plan to commit it tomorrow unless someone objects.

          Show
          Adrien Grand added a comment - I plan to commit it tomorrow unless someone objects.
          Hide
          Adrien Grand added a comment -

          New patch, focused on SortingMergePolicy, ready to be reviewed!

          Show
          Adrien Grand added a comment - New patch, focused on SortingMergePolicy, ready to be reviewed!
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1459037

          LUCENE-4752: Preliminaries:

          • move useful assert*Equals from TestDuelingCodecs to LuceneTestCase,
          • rename sort to wrap in SortingAtomicReader to better suggest that the returned reader is a view.
          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1459037 LUCENE-4752 : Preliminaries: move useful assert*Equals from TestDuelingCodecs to LuceneTestCase, rename sort to wrap in SortingAtomicReader to better suggest that the returned reader is a view.
          Hide
          Adrien Grand added a comment -

          I opened LUCENE-4858 to deal with early query termination (as you suggested earlier) so that we can concentrate on sorting in this issue.

          Adrien, perhaps in order to keep the patch small, commit separately the changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change)

          I'll do that soon if nobody objects.

          Show
          Adrien Grand added a comment - I opened LUCENE-4858 to deal with early query termination (as you suggested earlier) so that we can concentrate on sorting in this issue. Adrien, perhaps in order to keep the patch small, commit separately the changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change) I'll do that soon if nobody objects.
          Hide
          Shai Erera added a comment -

          What in the patch guarantees that any segment with more than maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which ensures every such segment gets picked up by SortingMP, however didn't find it.

          I don't think that in general we should make assumptions based on a maxBufferedDocs setting because the default setting in IWC is per RAM consumption and also it seems slightly unrelated. I.e. if a segment is sorted, but has deletions such that numDocs < maxBufferedDocs, we do full collection, while we can early terminate as usual?

          EarlyTerminatingCollector, I think, need not have getFullCollector. Rather it should wrap any other Collector (not limited to top doc) and if it detects a sorted segment in setNextReader (we still need to figure out how to detect that), early terminate after enough docs were seen, otherwise keep on calling in.collect()? It's the app's responsibility to wrap its collector (which could be ChainingCollector too) with this collector, and make sure that its early termination logic fits with its collectors. And so I don't think we need EarlyTerminationTopDocsCollector, but rather a concrete EarlyTerminatingCollector. BTW, EarlyTerminationTopDocsCollector has an uninitialized and unused maxUnsortedSize?

          And hopefully we can stuff the early termination logic down to IndexSearcher eventually. There are other scenarios for early termination, such as time limit, and therefore I think it's ok if we have an EarlyTerminationException which IndexSearcher responds to.

          Adrien, perhaps in order to keep the patch small, commit separately the changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change)? These are good changes to commit anyway, and they only bloat out the patch and mask the actual issue's development? Is it possible?

          Show
          Shai Erera added a comment - What in the patch guarantees that any segment with more than maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which ensures every such segment gets picked up by SortingMP, however didn't find it. I don't think that in general we should make assumptions based on a maxBufferedDocs setting because the default setting in IWC is per RAM consumption and also it seems slightly unrelated. I.e. if a segment is sorted, but has deletions such that numDocs < maxBufferedDocs, we do full collection, while we can early terminate as usual? EarlyTerminatingCollector, I think, need not have getFullCollector. Rather it should wrap any other Collector (not limited to top doc) and if it detects a sorted segment in setNextReader (we still need to figure out how to detect that), early terminate after enough docs were seen, otherwise keep on calling in.collect()? It's the app's responsibility to wrap its collector (which could be ChainingCollector too) with this collector, and make sure that its early termination logic fits with its collectors. And so I don't think we need EarlyTerminationTopDocsCollector, but rather a concrete EarlyTerminatingCollector. BTW, EarlyTerminationTopDocsCollector has an uninitialized and unused maxUnsortedSize? And hopefully we can stuff the early termination logic down to IndexSearcher eventually. There are other scenarios for early termination, such as time limit, and therefore I think it's ok if we have an EarlyTerminationException which IndexSearcher responds to. Adrien, perhaps in order to keep the patch small, commit separately the changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change)? These are good changes to commit anyway, and they only bloat out the patch and mask the actual issue's development? Is it possible?
          Hide
          Adrien Grand added a comment -

          I think these are not bad numbers.

          Me neither! I'm rather happy with them actually.

          As for search, perhaps we can quickly hack up IndexSearcher to allow terminating per-segment and then compare two Collectors TopFields and TopSortedFields [...] but in order to do that, we must make sure that each segment is sorted (i.e. those that are not hit by MP are still in random order), or we somehow mark on each segment whether it's sorted or not

          The attached patch contains a different approach, the idea is to use together SortingMergePolicy and IndexWriterConfig.getMaxBufferedDocs: this guarantees that all segments whose size is above maxBufferedDocs are sorted. Then there is a new EarlyTerminationIndexSearcher that extends search to collect normally segments in random order and to early terminate collection on segments which are sorted.

          Accessing "close" documents together ... we can make an artificial test which accesses documents with sort-by-value in a specific range. But that's a too artificial test, not sure what it will tell us.

          Yes, I think the important thing to validate here is that merging does not get exponentially slower as segments grow. Other checks are just bonus.

          Show
          Adrien Grand added a comment - I think these are not bad numbers. Me neither! I'm rather happy with them actually. As for search, perhaps we can quickly hack up IndexSearcher to allow terminating per-segment and then compare two Collectors TopFields and TopSortedFields [...] but in order to do that, we must make sure that each segment is sorted (i.e. those that are not hit by MP are still in random order), or we somehow mark on each segment whether it's sorted or not The attached patch contains a different approach, the idea is to use together SortingMergePolicy and IndexWriterConfig.getMaxBufferedDocs: this guarantees that all segments whose size is above maxBufferedDocs are sorted. Then there is a new EarlyTerminationIndexSearcher that extends search to collect normally segments in random order and to early terminate collection on segments which are sorted. Accessing "close" documents together ... we can make an artificial test which accesses documents with sort-by-value in a specific range. But that's a too artificial test, not sure what it will tell us. Yes, I think the important thing to validate here is that merging does not get exponentially slower as segments grow. Other checks are just bonus.
          Hide
          Shai Erera added a comment -

          I think these are not bad numbers. Sorting is costly, no one argues about that. We've discussed in this issue few possible usage for ending up with sorted segments: compression, accessing "close" documents together and early termination.

          What are your observations regarding compression? I am not sure the way you tested it can make interesting observations since you sort by random field (which means the index may be randomly sorted), but perhaps if you sorted by Wikipedia's date value or something else? Just curious.

          As for search, perhaps we can quickly hack up IndexSearcher to allow terminating per-segment and then compare two Collectors TopFields and TopSortedFields (new), the latter terminates after vising first N docs in each segment? Hmm, but in order to do that, we must make sure that each segment is sorted (i.e. those that are not hit by MP are still in random order), or we somehow mark on each segment whether it's sorted or not. If we can hack this comparison, I think it's worth to note here the differences. Actually enabling per-segment termination should happen on a separate issue.

          Accessing "close" documents together ... we can make an artificial test which accesses documents with sort-by-value in a specific range. But that's a too artificial test, not sure what it will tell us.

          Show
          Shai Erera added a comment - I think these are not bad numbers. Sorting is costly, no one argues about that. We've discussed in this issue few possible usage for ending up with sorted segments: compression, accessing "close" documents together and early termination. What are your observations regarding compression? I am not sure the way you tested it can make interesting observations since you sort by random field (which means the index may be randomly sorted), but perhaps if you sorted by Wikipedia's date value or something else? Just curious. As for search, perhaps we can quickly hack up IndexSearcher to allow terminating per-segment and then compare two Collectors TopFields and TopSortedFields (new), the latter terminates after vising first N docs in each segment? Hmm, but in order to do that, we must make sure that each segment is sorted (i.e. those that are not hit by MP are still in random order), or we somehow mark on each segment whether it's sorted or not. If we can hack this comparison, I think it's worth to note here the differences. Actually enabling per-segment termination should happen on a separate issue. Accessing "close" documents together ... we can make an artificial test which accesses documents with sort-by-value in a specific range. But that's a too artificial test, not sure what it will tell us.
          Hide
          Adrien Grand added a comment -

          Maybe just put a comment in IW where it calls merge.getReaders() why we don't access the readers list directly

          Done.

          I started working on this (LUCENE-4830 for memory and LUCENE-4839 for complexity) and will run some indexing benchmarks with the Wikipedia corpus to see how it behaves compared to natural merging.

          Now that SortingAtomicReader uses TimSort to compute the doc ID mapping and sort postigs lists, using SortingMergePolicy only increases the merge complexity by constant factors compared to a natural merge if the readers to merge are sorted (I'm assuming the number of segments to merge is bounded). I think this makes online sorting a viable option.

          I ran some indexing benchmarks to see how slower indexing is with SortingMergePolicy. To do this I quickly patched luceneutil to add a random NumericDocValuesField to all documents and wrap the merge policy with SortingMergePolicy. Indexing 10M docs from the wikimedium collection was 2x slower with SortingMergePolicy (see ingestion rate logs attached). To measure pure merge performance, I ran a forceMerge(1) on those indexes and SortingMergePolicy made this forceMerge 3.5x slower (856415 ms vs 250054 ms). If you're curious, here is where the merging time is spent with SortingMergePolicy according to my profiler:

          • 32%: CompressingStoredField.visitDocument (vs. < 1% when using a regular merge policy)
          • 17%: TimSort: to sort the doc mapping and postings lists
          • 6%: Sorter.DocMap.oldToNew: used by SortingDocsEnum to map the old IDs to the new ones

          Most of the time is not spent into actual sorting but in visitDocument because the codec-specific merge routine can't be used, so the stored fields format decompresses every chunk multiple times (a few hundred times given that my docs are really small, this would be less noticeable with larger docs).

          I think it's close, what do you think?

          Show
          Adrien Grand added a comment - Maybe just put a comment in IW where it calls merge.getReaders() why we don't access the readers list directly Done. I started working on this ( LUCENE-4830 for memory and LUCENE-4839 for complexity) and will run some indexing benchmarks with the Wikipedia corpus to see how it behaves compared to natural merging. Now that SortingAtomicReader uses TimSort to compute the doc ID mapping and sort postigs lists, using SortingMergePolicy only increases the merge complexity by constant factors compared to a natural merge if the readers to merge are sorted (I'm assuming the number of segments to merge is bounded). I think this makes online sorting a viable option. I ran some indexing benchmarks to see how slower indexing is with SortingMergePolicy. To do this I quickly patched luceneutil to add a random NumericDocValuesField to all documents and wrap the merge policy with SortingMergePolicy. Indexing 10M docs from the wikimedium collection was 2x slower with SortingMergePolicy (see ingestion rate logs attached). To measure pure merge performance, I ran a forceMerge(1) on those indexes and SortingMergePolicy made this forceMerge 3.5x slower (856415 ms vs 250054 ms). If you're curious, here is where the merging time is spent with SortingMergePolicy according to my profiler: 32%: CompressingStoredField.visitDocument (vs. < 1% when using a regular merge policy) 17%: TimSort: to sort the doc mapping and postings lists 6%: Sorter.DocMap.oldToNew: used by SortingDocsEnum to map the old IDs to the new ones Most of the time is not spent into actual sorting but in visitDocument because the codec-specific merge routine can't be used, so the stored fields format decompresses every chunk multiple times (a few hundred times given that my docs are really small, this would be less noticeable with larger docs). I think it's close, what do you think?
          Hide
          Shai Erera added a comment -

          Ok I see that IW needs to access OneMerge.readers for various other reasons, so let's leave it like that. Maybe just put a comment in IW where it calls merge.getReaders() why we don't access the readers list directly here and go through the method?

          Show
          Shai Erera added a comment - Ok I see that IW needs to access OneMerge.readers for various other reasons, so let's leave it like that. Maybe just put a comment in IW where it calls merge.getReaders() why we don't access the readers list directly here and go through the method?
          Hide
          Shai Erera added a comment -

          Why is the size of the class a concern?

          Just to keep things readable. But it's not critical, however you prefer.

          About OneMerge.readers and IW, not sure if you disagree with what I wrote? In the patch, IW calls OneMerge.getMergeReaders to add them to SegmentMerger. It also does merge.readers.add() and I propose to replace that call by merge.add(reader). So OneMerge.readers is not exposed directly. Will that be a problem?

          Show
          Shai Erera added a comment - Why is the size of the class a concern? Just to keep things readable. But it's not critical, however you prefer. About OneMerge.readers and IW, not sure if you disagree with what I wrote? In the patch, IW calls OneMerge.getMergeReaders to add them to SegmentMerger. It also does merge.readers.add() and I propose to replace that call by merge.add(reader). So OneMerge.readers is not exposed directly. Will that be a problem?
          Hide
          Michael McCandless added a comment -

          This approach looks nice: it's minimally invasive; MP just overrides OneMerge.getMergeReaders.

          And it's really nice to have those asserts more accessible: various tests have their own [different!] impls for these equality tests.

          Show
          Michael McCandless added a comment - This approach looks nice: it's minimally invasive; MP just overrides OneMerge.getMergeReaders. And it's really nice to have those asserts more accessible: various tests have their own [different!] impls for these equality tests.
          Hide
          Adrien Grand added a comment -

          But, since LTC is quite big, perhaps we can move these methods to a util, e.g. CompareIndexes?

          Why is the size of the class a concern? I think it's more convenient to have all assert*Equals methods in the same class? (LuceneTestCase already has many assert*Equals methods inherited from Assert.) And it makes these methods easier to find when writing a test?

          Can we make OneMerge.readers private and add OneMerge.add(AtomicReader) for IW to use? It looks odd that IW manipulates OneMerge.readers directly, but then calls OneMerge.getMergeReaders()

          I think it would be odd if getMergeReaders was just a getter but it is more than that since it filters out empty readers and can even return an arbitrary view over the readers to merge. But here it is just a method that computes data based on the class members, like segString?

          Can we remove SegmentMerger.add()

          Good point, I updated the patch.

          Show
          Adrien Grand added a comment - But, since LTC is quite big, perhaps we can move these methods to a util, e.g. CompareIndexes? Why is the size of the class a concern? I think it's more convenient to have all assert*Equals methods in the same class? (LuceneTestCase already has many assert*Equals methods inherited from Assert.) And it makes these methods easier to find when writing a test? Can we make OneMerge.readers private and add OneMerge.add(AtomicReader) for IW to use? It looks odd that IW manipulates OneMerge.readers directly, but then calls OneMerge.getMergeReaders() I think it would be odd if getMergeReaders was just a getter but it is more than that since it filters out empty readers and can even return an arbitrary view over the readers to merge. But here it is just a method that computes data based on the class members, like segString? Can we remove SegmentMerger.add() Good point, I updated the patch.
          Hide
          Shai Erera added a comment -

          Patch looks awesome! Few comments:

          • LuceneTestCase: looks like the changes are that you ported the assertXYZ methods from TestDuelingCodecs, so I didn't review them. But, since LTC is quite big, perhaps we can move these methods to a util, e.g. CompareIndexes? I remember that when I wrote the sorting tests, I was looking for such methods!
          • Can we make OneMerge.readers private and add OneMerge.add(AtomicReader) for IW to use? It looks odd that IW manipulates OneMerge.readers directly, but then calls OneMerge.getMergeReaders(). If you do that, the put a comment on readers why it's private, so that we don't forget .
          • Can we remove SegmentMerger.add() methods in favor of a single merge(List<AtomicReader>)? Don't go overboard with it, only if it's trivial (as it's not directly related to this issue).
          Show
          Shai Erera added a comment - Patch looks awesome! Few comments: LuceneTestCase: looks like the changes are that you ported the assertXYZ methods from TestDuelingCodecs, so I didn't review them. But, since LTC is quite big, perhaps we can move these methods to a util, e.g. CompareIndexes? I remember that when I wrote the sorting tests, I was looking for such methods! Can we make OneMerge.readers private and add OneMerge.add(AtomicReader) for IW to use? It looks odd that IW manipulates OneMerge.readers directly, but then calls OneMerge.getMergeReaders(). If you do that, the put a comment on readers why it's private, so that we don't forget . Can we remove SegmentMerger.add() methods in favor of a single merge(List<AtomicReader>)? Don't go overboard with it, only if it's trivial (as it's not directly related to this issue).
          Hide
          Adrien Grand added a comment -

          Patch with tests that makes OneMerge responsible for reordering doc IDs. Thoughts?

          Show
          Adrien Grand added a comment - Patch with tests that makes OneMerge responsible for reordering doc IDs. Thoughts?
          Hide
          Adrien Grand added a comment -

          Good point! I'll update the patch!

          Show
          Adrien Grand added a comment - Good point! I'll update the patch!
          Hide
          Shai Erera added a comment -

          I actually think that whoever wants to sort during addIndexes should pass a SortingAR directly? We have that code example in SortingAR javadocs. I didn't think we need to handle IW.addIndexes at all here.

          Show
          Shai Erera added a comment - I actually think that whoever wants to sort during addIndexes should pass a SortingAR directly? We have that code example in SortingAR javadocs. I didn't think we need to handle IW.addIndexes at all here.
          Hide
          Adrien Grand added a comment -

          This looks less invasive indeed, but I feel that MP.reorder() is kind of out of the blue. Maybe we should find a way to stuff it into OneMerge?

          Indeed, I thought about OneMerge too and liked this option better but I think this is a problem for addIndexes(IndexReader...): this method doesn't need to find merges and as a consequence doesn't manipulate OnMerge instances. How would we make addIndexes(IndexReader...) sort doc IDs?

          Show
          Adrien Grand added a comment - This looks less invasive indeed, but I feel that MP.reorder() is kind of out of the blue. Maybe we should find a way to stuff it into OneMerge? Indeed, I thought about OneMerge too and liked this option better but I think this is a problem for addIndexes(IndexReader...): this method doesn't need to find merges and as a consequence doesn't manipulate OnMerge instances. How would we make addIndexes(IndexReader...) sort doc IDs?
          Hide
          Shai Erera added a comment -

          This looks less invasive indeed, but I feel that MP.reorder() is kind of out of the blue. Maybe we should find a way to stuff it into OneMerge? I.e.:

          • Make OneMerge readers private and add a method add(AtomicReader) which will be called by IW
          • Add OneMerge.getReaders() returns a List<AtomicReader>
          • IW.mergeMiddle() won't add the readers to SegmentMerger directly, but rather accumulate them in OneMerge and then just before it calls merger.merge() it will either do merger.add(oneMerge.getReaders()) (requires changing SM to take a list of readers, not one at a time), or if we don't want to touch SM, just add the returned readers one at a time.
          • Then SortingMP will return its own OneMerge that wraps all the readers by SortingAR.

          What do you think?

          Show
          Shai Erera added a comment - This looks less invasive indeed, but I feel that MP.reorder() is kind of out of the blue. Maybe we should find a way to stuff it into OneMerge? I.e.: Make OneMerge readers private and add a method add(AtomicReader) which will be called by IW Add OneMerge.getReaders() returns a List<AtomicReader> IW.mergeMiddle() won't add the readers to SegmentMerger directly, but rather accumulate them in OneMerge and then just before it calls merger.merge() it will either do merger.add(oneMerge.getReaders()) (requires changing SM to take a list of readers, not one at a time), or if we don't want to touch SM, just add the returned readers one at a time. Then SortingMP will return its own OneMerge that wraps all the readers by SortingAR. What do you think?
          Hide
          Adrien Grand added a comment -

          i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. like a proper pluggable api (e.g. codec is an example of this) versus letting people just subclass concrete things.

          Here is a patch that allows for reordering via a simple hook instead of having to subclass a class that does concrete things like SegmentMerger. The hook is on MergePolicy because I felt like it makes sense to think about doc ID reordering at merging time as part of a "merge policy" but it could also be put somewhere else or have its own class. (The patch is just here to gather some API feedback, I haven't tried to run anything with it yet). Does it look more reasonable?

          Show
          Adrien Grand added a comment - i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. like a proper pluggable api (e.g. codec is an example of this) versus letting people just subclass concrete things. Here is a patch that allows for reordering via a simple hook instead of having to subclass a class that does concrete things like SegmentMerger. The hook is on MergePolicy because I felt like it makes sense to think about doc ID reordering at merging time as part of a "merge policy" but it could also be put somewhere else or have its own class. (The patch is just here to gather some API feedback, I haven't tried to run anything with it yet). Does it look more reasonable?
          Hide
          Adrien Grand added a comment -

          Is it possible to make fieldInfos final?

          Sure. I removed the final keyword because it was easier to hack up a quick patch but this can definitely be fixed.

          Adrien, perhaps add a SortingSegmentMerger to the sorter package? Or at least add a test that verifies merges keep things sorted?

          I'll do that in the next patches!

          And finally i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger.

          I'm fine with that option too, I need to think more about how to name it and where to plug it.

          In addition to the API, I think something important to validate is whether sorting the segments to merge is viable and doesn't blow up memory or indexing time... I started working on this (LUCENE-4830 for memory and LUCENE-4839 for complexity) and will run some indexing benchmarks with the Wikipedia corpus to see how it behaves compared to natural merging.

          Show
          Adrien Grand added a comment - Is it possible to make fieldInfos final? Sure. I removed the final keyword because it was easier to hack up a quick patch but this can definitely be fixed. Adrien, perhaps add a SortingSegmentMerger to the sorter package? Or at least add a test that verifies merges keep things sorted? I'll do that in the next patches! And finally i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. I'm fine with that option too, I need to think more about how to name it and where to plug it. In addition to the API, I think something important to validate is whether sorting the segments to merge is viable and doesn't blow up memory or indexing time... I started working on this ( LUCENE-4830 for memory and LUCENE-4839 for complexity) and will run some indexing benchmarks with the Wikipedia corpus to see how it behaves compared to natural merging.
          Hide
          Shai Erera added a comment -

          Adrien, I think that the change to LTC is a bit of an overkill. It doesn't test any new functionality except that you can sub-class SegmentMerger. Yet it did cause you to track down tests that might by affected b/c e.g. they are already slow tests and it will make them slower. Therefore I think that we should test this extensibility in a separate, dedicated test, which e.g. sub-classes SM and say reorders the list of atomic readers, so that we're sure this sub-classing not only compiled, but actually did something different than if we didn't sub-class.

          Also, Robert made this comment:

          SegmentWriteState's fieldinfos should be final. Some of these little things took a long time to accomplish to give us a great deal of safety inside lucene internally.

          Is it possible to make fieldInfos final? Since sorting segments has nothing to do with field infos (all that we need is to wrap the readers with SortingAR), we should only modify what's absolutely necessary to accomplish that.

          Show
          Shai Erera added a comment - Adrien, I think that the change to LTC is a bit of an overkill. It doesn't test any new functionality except that you can sub-class SegmentMerger. Yet it did cause you to track down tests that might by affected b/c e.g. they are already slow tests and it will make them slower. Therefore I think that we should test this extensibility in a separate, dedicated test, which e.g. sub-classes SM and say reorders the list of atomic readers, so that we're sure this sub-classing not only compiled, but actually did something different than if we didn't sub-class. Also, Robert made this comment: SegmentWriteState's fieldinfos should be final. Some of these little things took a long time to accomplish to give us a great deal of safety inside lucene internally. Is it possible to make fieldInfos final? Since sorting segments has nothing to do with field infos (all that we need is to wrap the readers with SortingAR), we should only modify what's absolutely necessary to accomplish that.
          Hide
          Robert Muir added a comment -

          If there isn't a better (or any other) way to achieve that, then I think we should expose SegmentMerger. We shouldn't tell people "Lucene doesn't support online sorting of an index because we didn't want to expose super-expert API".

          We absolutely should.

          Show
          Robert Muir added a comment - If there isn't a better (or any other) way to achieve that, then I think we should expose SegmentMerger. We shouldn't tell people "Lucene doesn't support online sorting of an index because we didn't want to expose super-expert API". We absolutely should.
          Hide
          Shai Erera added a comment -

          Perhaps instead of telling us what we can't do, make suggestions on what we can do? You realize that we care about Lucene API at least as much as you do? And that if we propose to expose SegmentMerger it's not because we want to kill the project, but to allow flexibility (which has been expressed on the list few times) which eventually drives innovation and brings more progress to Lucene?

          If you have a better idea, e.g. how can this be done with Codec, would you mind sharing with us? Nothing has been committed yet, we played with an idea, Adrien hacked it together ... that's it.

          If there isn't a better (or any other) way to achieve that, then I think we should expose SegmentMerger. We shouldn't tell people "Lucene doesn't support online sorting of an index because we didn't want to expose super-expert API". That seems wrong to me. SegmentMerger can be marked @lucene.internal which gives you the freedom to do whatever you want with it. And given that all SortingSM will do is replace all ARs by a single SortingAR, I find it hard to see how overhauling (even rewriting from scratch) SM could possibly make that tiny hook SortingSM requires, a challenge.

          Show
          Shai Erera added a comment - Perhaps instead of telling us what we can't do, make suggestions on what we can do? You realize that we care about Lucene API at least as much as you do? And that if we propose to expose SegmentMerger it's not because we want to kill the project, but to allow flexibility (which has been expressed on the list few times) which eventually drives innovation and brings more progress to Lucene? If you have a better idea, e.g. how can this be done with Codec, would you mind sharing with us? Nothing has been committed yet, we played with an idea, Adrien hacked it together ... that's it. If there isn't a better (or any other) way to achieve that, then I think we should expose SegmentMerger. We shouldn't tell people "Lucene doesn't support online sorting of an index because we didn't want to expose super-expert API". That seems wrong to me. SegmentMerger can be marked @lucene.internal which gives you the freedom to do whatever you want with it. And given that all SortingSM will do is replace all ARs by a single SortingAR, I find it hard to see how overhauling (even rewriting from scratch) SM could possibly make that tiny hook SortingSM requires, a challenge.
          Hide
          Robert Muir added a comment -

          And finally i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. like a proper pluggable api (e.g. codec is an example of this) versus letting people just subclass concrete things.

          Show
          Robert Muir added a comment - And finally i think it would be way better to provide whatever 'hook' is needed for this kinda stuff rather than allow subclassing of segmentmerger. like a proper pluggable api (e.g. codec is an example of this) versus letting people just subclass concrete things.
          Hide
          Robert Muir added a comment -

          Another example of a 'super expert api' that i think is ok, is indexingchain, where it stays package private, and the IWConfig setter is also package private. i think something like this might be reasonable.

          Show
          Robert Muir added a comment - Another example of a 'super expert api' that i think is ok, is indexingchain, where it stays package private, and the IWConfig setter is also package private. i think something like this might be reasonable.
          Hide
          Robert Muir added a comment -

          I disagree... and I guess I'm willing to go to bat for this.

          There is real cost in exposing stuff like this. I'm already frustrated about the amount of stuff around this area that is 'public' solely due to packaging (e.g. the .codecs package and the .index package both need it).

          Finally, if we have code in lucene itself that relies upon the inner details because it e.g. subclasses segmentmerger, this makes it harder to refactor core lucene and evolve it in the future because we have modules doing sorting or shuffling or god knows what that rely upon its api. in such a case where i want to refactor SM, i could just eradicate those modules and nobody would complain, right?

          I don't think we should do it.

          Show
          Robert Muir added a comment - I disagree... and I guess I'm willing to go to bat for this. There is real cost in exposing stuff like this. I'm already frustrated about the amount of stuff around this area that is 'public' solely due to packaging (e.g. the .codecs package and the .index package both need it). Finally, if we have code in lucene itself that relies upon the inner details because it e.g. subclasses segmentmerger, this makes it harder to refactor core lucene and evolve it in the future because we have modules doing sorting or shuffling or god knows what that rely upon its api. in such a case where i want to refactor SM, i could just eradicate those modules and nobody would complain, right? I don't think we should do it.
          Hide
          Shai Erera added a comment -

          This looks great. I think that it's fine that we let people override SegmentMerger ... it's super expert API, no sane person would ever want to do that, but those that want to, it's good to have the option.

          Adrien, perhaps add a SortingSegmentMerger to the sorter package? Or at least add a test that verifies merges keep things sorted?

          Show
          Shai Erera added a comment - This looks great. I think that it's fine that we let people override SegmentMerger ... it's super expert API, no sane person would ever want to do that, but those that want to, it's good to have the option. Adrien, perhaps add a SortingSegmentMerger to the sorter package? Or at least add a test that verifies merges keep things sorted?
          Hide
          Robert Muir added a comment -

          SegmentWriteState's fieldinfos should be final. Some of these little things took a long time to accomplish to give us a great deal of safety inside lucene internally.

          Personally i dont think stuff like segmentmerger should be exposed publicly.

          Show
          Robert Muir added a comment - SegmentWriteState's fieldinfos should be final. Some of these little things took a long time to accomplish to give us a great deal of safety inside lucene internally. Personally i dont think stuff like segmentmerger should be exposed publicly.
          Hide
          Adrien Grand added a comment -

          I've tried playing with SegmentMerger to make it configurable. This could be used to reorder document IDs (if you look at the diff in LuceneTestCase, all that is needed to reorder doc IDs is to wrap the SlowCompositeReaderWrapper with a SortingAtomicReader). Do you think it is a step in the right direction?

          Show
          Adrien Grand added a comment - I've tried playing with SegmentMerger to make it configurable. This could be used to reorder document IDs (if you look at the diff in LuceneTestCase, all that is needed to reorder doc IDs is to wrap the SlowCompositeReaderWrapper with a SortingAtomicReader). Do you think it is a step in the right direction?
          Hide
          Shai Erera added a comment -

          Well, first if you want to keep the index sorted even per-segment, you need some hooks so that segments that are merged stay sorted, otherwise, very quickly they'll just become arbitrarily sorted again.

          About global sorting, I think that if we allow to terminate per-segment, then global sorting has lower priority (though still would be good if we can tackle that nicely) because it will be just like doing distributed search on a single index .. no big deal.

          Show
          Shai Erera added a comment - Well, first if you want to keep the index sorted even per-segment, you need some hooks so that segments that are merged stay sorted, otherwise, very quickly they'll just become arbitrarily sorted again. About global sorting, I think that if we allow to terminate per-segment, then global sorting has lower priority (though still would be good if we can tackle that nicely) because it will be just like doing distributed search on a single index .. no big deal.
          Hide
          Adrien Grand added a comment -

          the SortingSegmentMerger will accumulate the readers in add(SegmentReader) and open a SortingAtomicReader over a MultiReader of all SegReaders... what do you think?

          I think this is a good idea!

          However, I don't understand this global sorting issue. What would it bring?

          Show
          Adrien Grand added a comment - the SortingSegmentMerger will accumulate the readers in add(SegmentReader) and open a SortingAtomicReader over a MultiReader of all SegReaders... what do you think? I think this is a good idea! However, I don't understand this global sorting issue. What would it bring?
          Hide
          Shai Erera added a comment -

          Let's define first what do we want to achieve? SortingAtomicReader lets you sort a single segment as-is, or by wrapping a MultiReader, sort a bunch of segments while merging them together. However, SortingAR is not usable today in an online sorting scenario, as it cannot be used by IndexWriter during merge. For that, we need to open up SegmentMerger to provide a SortingSegmentMerger.

          That gets you in a state of an index that is sorted per-segment, however not globally sorted. So for compression scenarios, that's enough. For early termination, that should be close to enough, provided you can "abort collection on a per-segment basis" – so you'll accumulate N first docs from each segment and then return top-N docs globally. Kind of like how distributed search works.

          Now you raise a different scenario - accessing N docs together, which may be located in different segments ... that's a new thing. If we want to achieve that, we have to have a relationship between segments such that seg1 > seg2 > seg3, and then you potentially visit one or only few segments when accessing them. Of course, if we achieve that, then you could also early terminate after exactly first N docs, no matter from which segment they came.

          But that's not at all easy .. I wonder if there is a good way to achieve that without globally sorting the index (i.e. what IndexSorter set out to do in the first place). For example, if you sort the docs according to their PageRank measure, then doc Y (somewhere near IR.maxDoc()) might be preferred over all previously existing documents ... how will you pull it from that segment? You'll need to sort the index globally so that it bubbles up to the start of the index.

          If you sort the documents by date, and assuming docs come at a relatively already-sorted order (i.e. today you'll meet newer docs than you met yesterday, and older than you'll meet tomorrow), then you could do some tricks to stay in a globally-sorted (or close to) state.

          I think that in light of that, online global sorting of an index is a challenging task, one that will need a SortingSegmentMerger / SortingMP or whatever anyway, so this issue is still valid, and perhaps we indeed need a separate issue to discuss how to achieve the global sorting (something I'm trying to achieve in a side-project today, although I have the easy case - sort-by-date). BTW, to keep an index globally sorted you cannot use TieredMP (at least, you must merge consecutive segments together) and might also need to reverse the order of the segments in the index (as viewed by the reader), depending on your sort criteria.

          Show
          Shai Erera added a comment - Let's define first what do we want to achieve? SortingAtomicReader lets you sort a single segment as-is, or by wrapping a MultiReader, sort a bunch of segments while merging them together. However, SortingAR is not usable today in an online sorting scenario, as it cannot be used by IndexWriter during merge. For that, we need to open up SegmentMerger to provide a SortingSegmentMerger. That gets you in a state of an index that is sorted per-segment, however not globally sorted. So for compression scenarios, that's enough. For early termination, that should be close to enough, provided you can "abort collection on a per-segment basis" – so you'll accumulate N first docs from each segment and then return top-N docs globally. Kind of like how distributed search works. Now you raise a different scenario - accessing N docs together, which may be located in different segments ... that's a new thing. If we want to achieve that, we have to have a relationship between segments such that seg1 > seg2 > seg3, and then you potentially visit one or only few segments when accessing them. Of course, if we achieve that, then you could also early terminate after exactly first N docs, no matter from which segment they came. But that's not at all easy .. I wonder if there is a good way to achieve that without globally sorting the index (i.e. what IndexSorter set out to do in the first place). For example, if you sort the docs according to their PageRank measure, then doc Y (somewhere near IR.maxDoc()) might be preferred over all previously existing documents ... how will you pull it from that segment? You'll need to sort the index globally so that it bubbles up to the start of the index. If you sort the documents by date, and assuming docs come at a relatively already-sorted order (i.e. today you'll meet newer docs than you met yesterday, and older than you'll meet tomorrow), then you could do some tricks to stay in a globally-sorted (or close to) state. I think that in light of that, online global sorting of an index is a challenging task, one that will need a SortingSegmentMerger / SortingMP or whatever anyway, so this issue is still valid, and perhaps we indeed need a separate issue to discuss how to achieve the global sorting (something I'm trying to achieve in a side-project today, although I have the easy case - sort-by-date). BTW, to keep an index globally sorted you cannot use TieredMP (at least, you must merge consecutive segments together) and might also need to reverse the order of the segments in the index (as viewed by the reader), depending on your sort criteria.
          Hide
          David Smiley added a comment -

          Shai,
          I don't think its enough to have a SortingSegmentMerger. That's the simple part, thanks to your just-finished SortingAtomicReader. You pointed out that sorted segments alone will result in segments that are internally sorted and thus should compress a little better, but are effectively random-access between segments. So if some thousand adjacent documents need to be retrieved, it'll probably have to touch every segment, versus one or two. Granted the newly rewritten title I chose for this issue "merge segments to sort them" is limited to just making a SortingSegmentMerger. Should another issue be filed with a title such as "MergePolicy to sort across segments"? The key word being "across" – I'm not sure how that could be clarified in a succinct title.

          Show
          David Smiley added a comment - Shai, I don't think its enough to have a SortingSegmentMerger. That's the simple part, thanks to your just-finished SortingAtomicReader. You pointed out that sorted segments alone will result in segments that are internally sorted and thus should compress a little better, but are effectively random-access between segments. So if some thousand adjacent documents need to be retrieved, it'll probably have to touch every segment, versus one or two. Granted the newly rewritten title I chose for this issue "merge segments to sort them" is limited to just making a SortingSegmentMerger. Should another issue be filed with a title such as "MergePolicy to sort across segments"? The key word being "across" – I'm not sure how that could be clarified in a succinct title.
          Hide
          Shai Erera added a comment -

          I traced the code and it seems like OneMerge holds the list of SegInfoPerCommit, and IW is the one that populates OneMerge.readers list by using its readersPool, which in turn opens a SegmentReader. I think that SegmentMerger is therefore the right entry point – the SortingSegmentMerger will accumulate the readers in add(SegmentReader) and open a SortingAtomicReader over a MultiReader of all SegReaders... what do you think?

          Show
          Shai Erera added a comment - I traced the code and it seems like OneMerge holds the list of SegInfoPerCommit, and IW is the one that populates OneMerge.readers list by using its readersPool, which in turn opens a SegmentReader. I think that SegmentMerger is therefore the right entry point – the SortingSegmentMerger will accumulate the readers in add(SegmentReader) and open a SortingAtomicReader over a MultiReader of all SegReaders... what do you think?
          Hide
          Adrien Grand added a comment -

          I took a look at MP API, and maybe if we change OneMerge from holding a List<SegmentReader> to List<AtomicReader>, we could write an MP which sorts segments together by opening a SortingAtomicReader over the segments that were picked for merge?

          Making the sorting stuff part of MergePolicy makes sense. However, I think that the (package-private) List<SegmentReader> in MergePolicy is only used to track the list of segment readers being used while merging (this reference is only used in IndexWriter). What MP actually manipulates is a list of SegmentInfoPerCommit, it is possible that no reader is open for a segment when MergePolicy picks it, and Lucene should not force a reader to be open until the merge actually starts. So maybe we should have an additional method in MergePolicy (or OneMerge for finer-grained control?) to tell IndexWriter how to view a list of segment readers? (either sequentially as today or a sorted view as suggested in this issue description).

          Show
          Adrien Grand added a comment - I took a look at MP API, and maybe if we change OneMerge from holding a List<SegmentReader> to List<AtomicReader>, we could write an MP which sorts segments together by opening a SortingAtomicReader over the segments that were picked for merge? Making the sorting stuff part of MergePolicy makes sense. However, I think that the (package-private) List<SegmentReader> in MergePolicy is only used to track the list of segment readers being used while merging (this reference is only used in IndexWriter). What MP actually manipulates is a list of SegmentInfoPerCommit, it is possible that no reader is open for a segment when MergePolicy picks it, and Lucene should not force a reader to be open until the merge actually starts. So maybe we should have an additional method in MergePolicy (or OneMerge for finer-grained control?) to tell IndexWriter how to view a list of segment readers? (either sequentially as today or a sorted view as suggested in this issue description).
          Hide
          Shai Erera added a comment -

          When I thought about AtomicReaderFactory I thought about something that returns a single AtomicReader. IW-wise, that makes more sense (as an API) than something that reorders a list of segments. Reordering can be left to things like MergePolicy. In fact, I took a look at MP API, and maybe if we change OneMerge from holding a List<SegmentReader> to List<AtomicReader>, we could write an MP which sorts segments together by opening a SortingAtomicReader over the segments that were picked for merge?

          I think, if we do that, we don't need any hook on IWC (AtomicReaderFactory), although that might be useful in other cases, but outside the scope of this issue. All we'll need is a SortingMP?

          Show
          Shai Erera added a comment - When I thought about AtomicReaderFactory I thought about something that returns a single AtomicReader. IW-wise, that makes more sense (as an API) than something that reorders a list of segments. Reordering can be left to things like MergePolicy. In fact, I took a look at MP API, and maybe if we change OneMerge from holding a List<SegmentReader> to List<AtomicReader> , we could write an MP which sorts segments together by opening a SortingAtomicReader over the segments that were picked for merge? I think, if we do that, we don't need any hook on IWC ( AtomicReaderFactory ), although that might be useful in other cases, but outside the scope of this issue. All we'll need is a SortingMP?
          Hide
          Adrien Grand added a comment -

          How can you early terminate a query for a single segment? [...] To early terminate efficiently, you must have the segments in a consistent order, e.g. S1 > S2 > S3.

          I think this is just an API limitation? Segments being processed independently, we should be able to terminate collection on a per-segment basis?

          instead of stuffing into IWC what seems like a random setting (pick-segments-for-sorting), we should have something more generic, like AtomicReaderFactory

          I didn't mean this should be a boolean. Of course it should be something more flexible/configurable! I'm very bad at picking names, but following your naming suggestion, we could have something like

          abstract class AtomicReaderFactory {
            abstract List<AtomicReader> reorder(List<SegmentReader> segmentReaders);
          }
          

          ?

          The default impl would be the identity whereas the sorting impl would return a singleton containing a sorted view over the segment readers?

          Also, a custom SegmentMerger to implement the zig-zag merge would help too.

          This is another option. I actually started exploring this option when David opened this issue, but it can become complicated, especially for postings lists merging, whereas reusing the sorted view from LUCENE-3918 would make merging trivial.

          Show
          Adrien Grand added a comment - How can you early terminate a query for a single segment? [...] To early terminate efficiently, you must have the segments in a consistent order, e.g. S1 > S2 > S3. I think this is just an API limitation? Segments being processed independently, we should be able to terminate collection on a per-segment basis? instead of stuffing into IWC what seems like a random setting (pick-segments-for-sorting), we should have something more generic, like AtomicReaderFactory I didn't mean this should be a boolean. Of course it should be something more flexible/configurable! I'm very bad at picking names, but following your naming suggestion, we could have something like abstract class AtomicReaderFactory { abstract List<AtomicReader> reorder(List<SegmentReader> segmentReaders); } ? The default impl would be the identity whereas the sorting impl would return a singleton containing a sorted view over the segment readers? Also, a custom SegmentMerger to implement the zig-zag merge would help too. This is another option. I actually started exploring this option when David opened this issue, but it can become complicated, especially for postings lists merging, whereas reusing the sorted view from LUCENE-3918 would make merging trivial.
          Hide
          David Smiley added a comment -

          I wonder what other big-data software is also sorting its data files both within-file and across them (the latter being the tricker part I think)? Cassandra, HBase, or Accumulo? The code details are going to be specific to the platform but I'm interested in the scheduling / merging algorithm, which seems like the biggest challenge to me. I bet this has been solved before. My initial attempts at coming up with an algorithm on my notepad isn't showing much promise.

          Show
          David Smiley added a comment - I wonder what other big-data software is also sorting its data files both within-file and across them (the latter being the tricker part I think)? Cassandra, HBase, or Accumulo? The code details are going to be specific to the platform but I'm interested in the scheduling / merging algorithm, which seems like the biggest challenge to me. I bet this has been solved before. My initial attempts at coming up with an algorithm on my notepad isn't showing much promise.
          Hide
          Shai Erera added a comment -

          I was commenting on Adrien's statement about how this would benefit early termination. If we open up the hooks (AtomicReaderFactory and custom SegmentMerger), I think that will open the way for many interesting things to do during merging, including merge-sorting segments.

          Show
          Shai Erera added a comment - I was commenting on Adrien's statement about how this would benefit early termination. If we open up the hooks (AtomicReaderFactory and custom SegmentMerger), I think that will open the way for many interesting things to do during merging, including merge-sorting segments.
          Hide
          David Smiley added a comment -

          Shai, Early-termination is a non-goal for this issue. Locality of reference is, with further index compression being a bonus.

          Show
          David Smiley added a comment - Shai, Early-termination is a non-goal for this issue. Locality of reference is, with further index compression being a bonus.
          Hide
          Shai Erera added a comment -

          How can you early terminate a query for a single segment? Say that you have 3 sorted segments (individually) and your query asks to get the top-10 of some criteria. The top-10 may come from the 3 segments as follows: seg1=4, seg2=4, seg3=2. But you don't know that until you processed all 3 segments right? While you could make a decision on a per-segment basis to 'terminate', there's no mechanism today to tell IndexSearcher "I'm done w/ that segment, move on". Today, if you want to early terminate, you need to throw an exception from the Collector, and catch it outside, in your application code?

          To early terminate efficiently, you must have the segments in a consistent order, e.g. S1 > S2 > S3. Then, after you've processed enough elements from S1, you can early terminate the entire query because you're guaranteed that successive documents will be "smaller".

          Unless, you add to your Collector.collect() an "if (done) return" and consider that a no-op, or make your own IndexSearcher logic ... then per-segment early termination is doable.

          As for the approach you describe, I think that instead of stuffing into IWC what seems like a random setting (pick-segments-for-sorting), we should have something more generic, like AtomicReaderFactory, which IW will use instead of always loading SegmentReader. That will let you load your custom AtomicReader? Or, perhaps this can be a SortingCodec? Also, a custom SegmentMerger to implement the zig-zag merge would help too.

          Show
          Shai Erera added a comment - How can you early terminate a query for a single segment? Say that you have 3 sorted segments (individually) and your query asks to get the top-10 of some criteria. The top-10 may come from the 3 segments as follows: seg1=4, seg2=4, seg3=2. But you don't know that until you processed all 3 segments right? While you could make a decision on a per-segment basis to 'terminate', there's no mechanism today to tell IndexSearcher "I'm done w/ that segment, move on". Today, if you want to early terminate, you need to throw an exception from the Collector, and catch it outside, in your application code? To early terminate efficiently, you must have the segments in a consistent order, e.g. S1 > S2 > S3. Then, after you've processed enough elements from S1, you can early terminate the entire query because you're guaranteed that successive documents will be "smaller". Unless, you add to your Collector.collect() an "if (done) return" and consider that a no-op, or make your own IndexSearcher logic ... then per-segment early termination is doable. As for the approach you describe, I think that instead of stuffing into IWC what seems like a random setting (pick-segments-for-sorting), we should have something more generic, like AtomicReaderFactory, which IW will use instead of always loading SegmentReader. That will let you load your custom AtomicReader? Or, perhaps this can be a SortingCodec? Also, a custom SegmentMerger to implement the zig-zag merge would help too.
          Hide
          Adrien Grand added a comment -

          I think a very simple first step could be have an experimental IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view (easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger instead of the segments themselves (see calls to SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments would remain unsorted, but the big ones, those that are interesting for both index compression and early query termination, would be sorted.

          It can seem inefficient to sort segments over and over but I don't think we should worry too much:

          • if we are merging "initial" segments (those created from IndexWriter.flush), they would be small so sorting/merging them would be fast?
          • if we are merging big segments, I think that the following reasons could make merging slower than a regular merge:
            1. computing the new doc ID mapping,
            2. random I/O access,
            3. not being able to use the specialized codec merging methods.

          But if the segments to merge are sorted, computing the new doc ID mapping could be actually fast (some sorting algorithms such as TimSort perform in O when the input is a succession of sorted sequences), and the access patterns to the individual segments would be I/O cache-friendly (because each segment would be read sequentially). So I think this approach could be fast enough?

          Show
          Adrien Grand added a comment - I think a very simple first step could be have an experimental IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view (easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger instead of the segments themselves (see calls to SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments would remain unsorted, but the big ones, those that are interesting for both index compression and early query termination, would be sorted. It can seem inefficient to sort segments over and over but I don't think we should worry too much: if we are merging "initial" segments (those created from IndexWriter.flush), they would be small so sorting/merging them would be fast? if we are merging big segments, I think that the following reasons could make merging slower than a regular merge: 1. computing the new doc ID mapping, 2. random I/O access, 3. not being able to use the specialized codec merging methods. But if the segments to merge are sorted, computing the new doc ID mapping could be actually fast (some sorting algorithms such as TimSort perform in O when the input is a succession of sorted sequences), and the access patterns to the individual segments would be I/O cache-friendly (because each segment would be read sequentially). So I think this approach could be fast enough?
          Hide
          David Smiley added a comment -

          In response to Shai's comment https://issues.apache.org/jira/browse/LUCENE-3918?focusedCommentId=13591774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13591774

          I suppose there's little point in writing a sorted segment if they are going to have to be merged any way. In the scope of a document's lifespan, it will live shortly in its initial segment, and that segment will be the smallest. And sorting it before writing slows-down real-time-search requirements. So this issue should be about merging to create sorted segments.

          Shai pointed out that if we simply merge to create sorted segments, the result doesn't improve very much since the documents are effectively randomly striped across the segments, even if a segment itself is sorted.

          So I think the aim should be to create segments that are not only internally sorted but are sorted across them. This is hard but I think it's doable.

          So assume the segments are initially written unsorted. Eventually there are too many of these segments and we need to start merging. So if we have N such segments, we want N/2 (or use the merge factor but assume 1/2 for the discussion) segments that are sorted individually and across. I think this could be done by looking at a merged DocValues view of the sorted field (using RAM of course and I suspect plenty of existing Lucene code that does this for cache/search context) then dividing up the value space, and then begin to pluck out documents from these segments in order to generate the first segment, then second, etc. Thinking of how to do this for a particular range of DocValues that will become a segment, I think you first generate a bitset of those docids. Then perhaps you use a SlowCompositeReaderWrapper to see a merged view of the applicable segments using the bitset as a filter for the documents. I expect I'm overlooking challenges but I'm sure other smart people will point them out

          Show
          David Smiley added a comment - In response to Shai's comment https://issues.apache.org/jira/browse/LUCENE-3918?focusedCommentId=13591774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13591774 I suppose there's little point in writing a sorted segment if they are going to have to be merged any way. In the scope of a document's lifespan, it will live shortly in its initial segment, and that segment will be the smallest. And sorting it before writing slows-down real-time-search requirements. So this issue should be about merging to create sorted segments. Shai pointed out that if we simply merge to create sorted segments, the result doesn't improve very much since the documents are effectively randomly striped across the segments, even if a segment itself is sorted. So I think the aim should be to create segments that are not only internally sorted but are sorted across them. This is hard but I think it's doable. So assume the segments are initially written unsorted. Eventually there are too many of these segments and we need to start merging. So if we have N such segments, we want N/2 (or use the merge factor but assume 1/2 for the discussion) segments that are sorted individually and across. I think this could be done by looking at a merged DocValues view of the sorted field (using RAM of course and I suspect plenty of existing Lucene code that does this for cache/search context) then dividing up the value space, and then begin to pluck out documents from these segments in order to generate the first segment, then second, etc. Thinking of how to do this for a particular range of DocValues that will become a segment, I think you first generate a bitset of those docids. Then perhaps you use a SlowCompositeReaderWrapper to see a merged view of the applicable segments using the bitset as a filter for the documents. I expect I'm overlooking challenges but I'm sure other smart people will point them out

            People

            • Assignee:
              Adrien Grand
              Reporter:
              David Smiley
            • Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development