Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.3, Trunk
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
      functionality to 4.0 apis.

      1. LUCENE-3918.patch
        51 kB
        Shai Erera
      2. LUCENE-3918.patch
        51 kB
        Shai Erera
      3. LUCENE-3918.patch
        50 kB
        Shai Erera
      4. LUCENE-3918.patch
        68 kB
        Adrien Grand
      5. LUCENE-3918.patch
        67 kB
        Robert Muir
      6. LUCENE-3918.patch
        67 kB
        Shai Erera
      7. LUCENE-3918.patch
        76 kB
        Shai Erera
      8. LUCENE-3918.patch
        75 kB
        Shai Erera
      9. LUCENE-3918.patch
        70 kB
        Shai Erera
      10. LUCENE-3918.patch
        103 kB
        Robert Muir
      11. LUCENE-3918.patch
        104 kB
        Shai Erera
      12. LUCENE-3918.patch
        107 kB
        Shai Erera
      13. LUCENE-3918.patch
        108 kB
        Shai Erera
      14. LUCENE-3918.patch
        108 kB
        Shai Erera
      15. LUCENE-3918.patch
        147 kB
        Anat
      16. LUCENE-3918.patch
        115 kB
        Anat

        Issue Links

          Activity

          Hide
          Anat added a comment -

          We have been working on porting IndexSorter to trunk API as well as added some more features to it.
          The IndexSorter sorts documents in the index according to Id's, DocValues or Payloads, each of these is a diffrent implementation
          of the Sorter interface. In addition we provide a way to correctly access and iterate over documents in a sorted index
          such as SortingIndexReader, SortingFields, SortingTerms etc'.

          Show
          Anat added a comment - We have been working on porting IndexSorter to trunk API as well as added some more features to it. The IndexSorter sorts documents in the index according to Id's, DocValues or Payloads, each of these is a diffrent implementation of the Sorter interface. In addition we provide a way to correctly access and iterate over documents in a sorted index such as SortingIndexReader, SortingFields, SortingTerms etc'.
          Hide
          Michael McCandless added a comment -

          Wow, big patch! It looks good! Some comments:

          • Can you change indent to 2 spaces (not tab)?
          • I see two compilation errors:
                [javac] Compiling 31 source files to /l/indexsorter/lucene/build/misc/classes/java
                [javac] /l/indexsorter/lucene/misc/src/java/org/apache/lucene/index/sorter/SortingDocsAndPositionsEnum.java:49: error: diamond operator is not supported in -source 1.6
                [javac] 	private final Map<Integer, Position[]> newPositions = new HashMap<>();
                [javac] 	                                                                  ^
                [javac]   (use -source 7 or higher to enable diamond operator)
                [javac] /l/indexsorter/lucene/misc/src/java/org/apache/lucene/index/sorter/SortingDocsEnum.java:37: error: diamond operator is not supported in -source 1.6
                [javac] 		final List<Integer> newDocIds = new ArrayList<>();
                [javac] 		                                              ^
                [javac]   (use -source 7 or higher to enable diamond operator)
            
          • I'm confused: shouldn't SortingIndexReader.document use a
            newToOld[docID] instead of oldToNew[docId]? Same with other
            places, eg SortingSource.getBytes/Float/etc. (At least the 3.6.x
            version seems to do that...).
          • These new impl seems to be ... rather memory intensive? Eg
            SortingDocsEnum allocates List<Integer>, reads in all docs for
            this term up front, sorts them, and keeps Integer[] for iterating
            over. Sorter.oldToNew impls allocate new Doc[maxDoc] ... the 3.6
            impl seemed somewhat more RAM efficient (left things on disk and
            seek'd on each lookup). I don't think this is a showstopper
            ... but it may limit how large an index this can practically run
            on ... but I guess requiring high RAM to sort may be OK ...
          • Some places seem to open a DirectoryReader but not close it, eg
            DocumentSorter.oldToNew.
          • It's nice that you can sort by anything now ... first payload, doc
            values, stored fields, etc.
          Show
          Michael McCandless added a comment - Wow, big patch! It looks good! Some comments: Can you change indent to 2 spaces (not tab)? I see two compilation errors: [javac] Compiling 31 source files to /l/indexsorter/lucene/build/misc/classes/java [javac] /l/indexsorter/lucene/misc/src/java/org/apache/lucene/index/sorter/SortingDocsAndPositionsEnum.java:49: error: diamond operator is not supported in -source 1.6 [javac] private final Map<Integer, Position[]> newPositions = new HashMap<>(); [javac] ^ [javac] (use -source 7 or higher to enable diamond operator) [javac] /l/indexsorter/lucene/misc/src/java/org/apache/lucene/index/sorter/SortingDocsEnum.java:37: error: diamond operator is not supported in -source 1.6 [javac] final List<Integer> newDocIds = new ArrayList<>(); [javac] ^ [javac] (use -source 7 or higher to enable diamond operator) I'm confused: shouldn't SortingIndexReader.document use a newToOld [docID] instead of oldToNew [docId] ? Same with other places, eg SortingSource.getBytes/Float/etc. (At least the 3.6.x version seems to do that...). These new impl seems to be ... rather memory intensive? Eg SortingDocsEnum allocates List<Integer>, reads in all docs for this term up front, sorts them, and keeps Integer[] for iterating over. Sorter.oldToNew impls allocate new Doc [maxDoc] ... the 3.6 impl seemed somewhat more RAM efficient (left things on disk and seek'd on each lookup). I don't think this is a showstopper ... but it may limit how large an index this can practically run on ... but I guess requiring high RAM to sort may be OK ... Some places seem to open a DirectoryReader but not close it, eg DocumentSorter.oldToNew. It's nice that you can sort by anything now ... first payload, doc values, stored fields, etc.
          Hide
          Shai Erera added a comment -

          the 3.6 impl seemed somewhat more RAM efficient (left things on disk and seek'd on each lookup)

          From what I remember (and I looked at the code again now), the 3.6 impl read the posting into a RAMOutputStream:

              private static final String TEMP_FILE = "temp";
              private final RAMDirectory tempDir = new RAMDirectory();
              private RAMOutputStream out;
          

          You can look at how seek(TermsEnum) iterates on original, writing everything to out and then calls to next() are served from this in-memory file.
          That concept has not changed in this version.

          I agree that List<Integer> seems an overkill here, and a growable int[] would be better. Since the array is sorted, perhaps we can grow the array (using ArrayUtil.grow), but keep an 'upto' and fill the extra values in the array with Integer.MAX_VALUE, so that they are always sorted last (instead of trimming the array). Then nextDoc() will just iterate up to 'upto'.

          Some places seem to open a DirectoryReader but not close it, eg DocumentSorter.oldToNew.

          I think first that tests need to extend LuceneTestCase (from what I can tell, they don't now) and then use newDirectory(). This should catch all these places I think?

          Also, from a quick at the tests:

          • They don't need a package.html, it can be removed
          • You can omit the Assert.* from all assert calls. E.g. rather than calling Assert.assertEquals, just call assertEquals.
          • Tests should use MockAnalyzer, rather than Standard (this thing should not add a dependency on analyzers module)
          Show
          Shai Erera added a comment - the 3.6 impl seemed somewhat more RAM efficient (left things on disk and seek'd on each lookup) From what I remember (and I looked at the code again now), the 3.6 impl read the posting into a RAMOutputStream: private static final String TEMP_FILE = "temp" ; private final RAMDirectory tempDir = new RAMDirectory(); private RAMOutputStream out; You can look at how seek(TermsEnum) iterates on original , writing everything to out and then calls to next() are served from this in-memory file. That concept has not changed in this version. I agree that List<Integer> seems an overkill here, and a growable int[] would be better. Since the array is sorted, perhaps we can grow the array (using ArrayUtil.grow), but keep an 'upto' and fill the extra values in the array with Integer.MAX_VALUE, so that they are always sorted last (instead of trimming the array). Then nextDoc() will just iterate up to 'upto'. Some places seem to open a DirectoryReader but not close it, eg DocumentSorter.oldToNew. I think first that tests need to extend LuceneTestCase (from what I can tell, they don't now) and then use newDirectory() . This should catch all these places I think? Also, from a quick at the tests: They don't need a package.html, it can be removed You can omit the Assert.* from all assert calls. E.g. rather than calling Assert.assertEquals, just call assertEquals. Tests should use MockAnalyzer, rather than Standard (this thing should not add a dependency on analyzers module)
          Hide
          Michael McCandless added a comment -

          From what I remember (and I looked at the code again now), the 3.6 impl read the posting into a RAMOutputStream:

          OK but that approach is quite memory efficient (the postings are stored compressed in RAM) ... vs creating java objects per-doc and per-position (as SortingD&PEnum does in the patch).

          Show
          Michael McCandless added a comment - From what I remember (and I looked at the code again now), the 3.6 impl read the posting into a RAMOutputStream: OK but that approach is quite memory efficient (the postings are stored compressed in RAM) ... vs creating java objects per-doc and per-position (as SortingD&PEnum does in the patch).
          Hide
          Shai Erera added a comment -

          Yeah, I see what you mean. Instead of Map<Integer,Position>, we could encode the positions (including payload) in a stream more efficiently and then sort a Position class which contains the docId + offset in the encoded stream. I'm guessing we could do that using a growable DataOutput (instead of RAMOutputStream which uses arrays of 1024 bytes) or something like that.

          I think that will indeed be more efficient. Also the Map is a bit of an overkill

          Show
          Shai Erera added a comment - Yeah, I see what you mean. Instead of Map<Integer,Position>, we could encode the positions (including payload) in a stream more efficiently and then sort a Position class which contains the docId + offset in the encoded stream. I'm guessing we could do that using a growable DataOutput (instead of RAMOutputStream which uses arrays of 1024 bytes) or something like that. I think that will indeed be more efficient. Also the Map is a bit of an overkill
          Hide
          Anat added a comment -

          Thanks for the feedback, I'll look into it and post back when I have any updates.

          Show
          Anat added a comment - Thanks for the feedback, I'll look into it and post back when I have any updates.
          Hide
          David Smiley added a comment -

          I was thinking today that it would be nice if the a document's stored values file could be sorted by some field (e.g. a geohash or time or...). Wouldn't that be cool? Would such a feature obsolete the use of the index sorter that this issue speaks of?

          Show
          David Smiley added a comment - I was thinking today that it would be nice if the a document's stored values file could be sorted by some field (e.g. a geohash or time or...). Wouldn't that be cool? Would such a feature obsolete the use of the index sorter that this issue speaks of?
          Hide
          Shai Erera added a comment -

          One of the usecases for IndexSorter is for when you want to e.g. do quick early termination for a query. Then, if you sort the index (say by date), and you're interested in documents sorted by date, then you can do a fast early termination after visiting exactly N docs.

          I'm not sure that I understand your proposal though and whether it'll help such usecases?

          Show
          Shai Erera added a comment - One of the usecases for IndexSorter is for when you want to e.g. do quick early termination for a query. Then, if you sort the index (say by date), and you're interested in documents sorted by date, then you can do a fast early termination after visiting exactly N docs. I'm not sure that I understand your proposal though and whether it'll help such usecases?
          Hide
          David Smiley added a comment -

          Oh, that's interesting. My proposal doesn't help your early termination query use-case. It does help cache-locality for spatial oriented queries, or queries in the vicinity of a time "recent" or whatever.

          Show
          David Smiley added a comment - Oh, that's interesting. My proposal doesn't help your early termination query use-case. It does help cache-locality for spatial oriented queries, or queries in the vicinity of a time "recent" or whatever.
          Hide
          Adrien Grand added a comment -

          I didn't know about IndexSorter. It would be nice to have a codec wrapper that would override all *Writer/Consumer.merge methods to make the merged segments sorted. This could help make the I/O cache work happily with stored fields/term vectors and early terminate queries after having visited min(maxBufferedDocs, numHits) documents without requiring the index to be read-only nor to have to store a doc ID mapping in memory (but at a higher indexing/merging cost)?

          Show
          Adrien Grand added a comment - I didn't know about IndexSorter. It would be nice to have a codec wrapper that would override all *Writer/Consumer.merge methods to make the merged segments sorted. This could help make the I/O cache work happily with stored fields/term vectors and early terminate queries after having visited min(maxBufferedDocs, numHits) documents without requiring the index to be read-only nor to have to store a doc ID mapping in memory (but at a higher indexing/merging cost)?
          Hide
          Eks Dev added a comment -

          this is the right way to give some really good meaning to venerable optimize call

          We were, and are sorting our data before indexing just to achieve exactly this, improvement in locality of reference. Depending on data (has to be somehow sortable, e.g. hierarchical structure, on url...), speedup (and likely compression Adrian made) gains are sometimes unbelievable...

          Show
          Eks Dev added a comment - this is the right way to give some really good meaning to venerable optimize call We were, and are sorting our data before indexing just to achieve exactly this, improvement in locality of reference. Depending on data (has to be somehow sortable, e.g. hierarchical structure, on url...), speedup (and likely compression Adrian made) gains are sometimes unbelievable...
          Hide
          David Smiley added a comment -

          Then lets do this! (the right way, in memory during indexing before it hits the disk, not re-ordering existing on-disk segments after the fact). I created LUCENE-4752 for this. I won't be the one to implement it, at least not now.

          Show
          David Smiley added a comment - Then lets do this! (the right way, in memory during indexing before it hits the disk, not re-ordering existing on-disk segments after the fact). I created LUCENE-4752 for this. I won't be the one to implement it, at least not now.
          Hide
          Anat added a comment -

          Added an updated version of the patch containing changes relating to the comments.

          The major changes are:
          Tests now extend LuceneTestCase
          List<Integer> was replaced with a growable int array.
          Instead of using a java Map object we are now encoding the positions information in a DataOutput stream.

          Show
          Anat added a comment - Added an updated version of the patch containing changes relating to the comments. The major changes are: Tests now extend LuceneTestCase List<Integer> was replaced with a growable int array. Instead of using a java Map object we are now encoding the positions information in a DataOutput stream.
          Hide
          Adrien Grand added a comment -

          This patch seems to have been created against an old 4.x branch (4.0 or 4.1 maybe?). We usually commit new features to trunk before backporting them to branch_4x, could you update the patch so that it can apply on top of trunk? Some comments on this new patch:

          • All Sorter implementations open the provided Directory and close it before returning, shouldn't this interface directly take an IndexReader as an argument?
          • SorterUtil.sort uses the stored fields API to create a new sorted index, this won't work in a few cases, especially if fields are not stored. I think it should rather use IndexWriter.addIndexes(IndexReader...).
          • SortingIndexReader constructor expects a CompositeIndexReader and calls new SlowCompositeReaderWrapper() to have an atomic view of this reader. I think it should take any index reader and wrap it using SlowCompositeReaderWrapper.wrap (compared to new SlowCompositeReaderWrapper(), this optimizes the case where the composite reader only wraps a single atomic reader).
          • Why does SortingIndexReader.getLiveDocs always return null?
          Show
          Adrien Grand added a comment - This patch seems to have been created against an old 4.x branch (4.0 or 4.1 maybe?). We usually commit new features to trunk before backporting them to branch_4x, could you update the patch so that it can apply on top of trunk? Some comments on this new patch: All Sorter implementations open the provided Directory and close it before returning, shouldn't this interface directly take an IndexReader as an argument? SorterUtil.sort uses the stored fields API to create a new sorted index, this won't work in a few cases, especially if fields are not stored. I think it should rather use IndexWriter.addIndexes(IndexReader...) . SortingIndexReader constructor expects a CompositeIndexReader and calls new SlowCompositeReaderWrapper() to have an atomic view of this reader. I think it should take any index reader and wrap it using SlowCompositeReaderWrapper.wrap (compared to new SlowCompositeReaderWrapper() , this optimizes the case where the composite reader only wraps a single atomic reader). Why does SortingIndexReader.getLiveDocs always return null?
          Hide
          Shai Erera added a comment -

          Patch addresses the following:

          • Updated patch to trunk API, following the DocValues changes. As a result, I implemented only NumericDocValues sorting, throwing UnsupportedOperationException for the rest. We can add respective impls later.
          • Cleaned up all dependencies on analysis module.
          • Replaced PostingMap[] by two parallel int[] (docs) and long[] (offsets) arrays. To sort them, I wrote a simple SorterTemplate extension which swaps the entries in both arrays when asked.
          • Replaced Position by direct members (no need to allocate a new instance on every nextPosition().
          • addPosition wrote the BytesRef.offset – we don't need to write the offset. Rather, nextPosition always returns a BytesRef with offset=0.
            • In that regard, modified the payload BytesRef to initialize once and reused in all nextPosition calls, growing its array as needed.
          • Cleaned up tests:
            • Rather then try-catch(Exception) followed by fail, test methods can throw Exception. Cleaner code.
            • Handled some warnings.
            • Used TEST_VERSION_CURRENT.

          Also,

          ...shouldn't this interface directly take an IndexReader as an argument?

          You're right, fixed in this patch too!

          SorterUtil.sort uses the stored fields API...

          Fixed in the patch!

          SortingIndexReader constructor expects a CompositeIndexReader...

          Fixed!

          Why does SortingIndexReader.getLiveDocs always return null?

          I don't remember at the moment why, but this utility currently doesn't handle source indexes with deleted documents ... again, not sure why, maybe the old sorter didn't support them too? Anyway, I made SortingIndexReader yell at you if you pass such a reader, and we can separately (or still in this issue) enhance it to support such indexes. I will think about it.

          I wonder if perhaps instead of To we should have an extension point on each of the sorters, to parse the relevant information, returning a Comparable?

          Show
          Shai Erera added a comment - Patch addresses the following: Updated patch to trunk API, following the DocValues changes. As a result, I implemented only NumericDocValues sorting, throwing UnsupportedOperationException for the rest. We can add respective impls later. Cleaned up all dependencies on analysis module. Replaced PostingMap[] by two parallel int[] (docs) and long[] (offsets) arrays. To sort them, I wrote a simple SorterTemplate extension which swaps the entries in both arrays when asked. Replaced Position by direct members (no need to allocate a new instance on every nextPosition(). addPosition wrote the BytesRef.offset – we don't need to write the offset. Rather, nextPosition always returns a BytesRef with offset=0. In that regard, modified the payload BytesRef to initialize once and reused in all nextPosition calls, growing its array as needed. Cleaned up tests: Rather then try-catch(Exception) followed by fail, test methods can throw Exception. Cleaner code. Handled some warnings. Used TEST_VERSION_CURRENT. Also, ...shouldn't this interface directly take an IndexReader as an argument? You're right, fixed in this patch too! SorterUtil.sort uses the stored fields API... Fixed in the patch! SortingIndexReader constructor expects a CompositeIndexReader... Fixed! Why does SortingIndexReader.getLiveDocs always return null? I don't remember at the moment why, but this utility currently doesn't handle source indexes with deleted documents ... again, not sure why, maybe the old sorter didn't support them too? Anyway, I made SortingIndexReader yell at you if you pass such a reader, and we can separately (or still in this issue) enhance it to support such indexes. I will think about it. I wonder if perhaps instead of To we should have an extension point on each of the sorters, to parse the relevant information, returning a Comparable?
          Hide
          Shai Erera added a comment -

          Then lets do this! (the right way, in memory during indexing before it hits the disk, not re-ordering existing on-disk segments after the fact)

          This issue is about porting the previous IndexSorter implementation to trunk API. The previous one offered a one-time sorting of an index, so is this one. While that doesn't mean we shouldn't explore alternatives, I find it a much lower hanging fruit than LUCENE-4752, especially as no one yet assigned the issue to himself, nor it looks like any progress was made. If LUCENE-4752 will eventually see the light of day, I don't mind if we nuke IndexSorter completely (by a SortingCodec I guess?), but until then, I think that offering users A way to sort their index is valuable too.

          Also, it's not clear to me at the moment (but I admit I haven't thought about it much) how can you sort documents during indexing, while the values to be sorted by may still be unknown? I.e. what if your sort-by-key is a NumericDocValues which the Codec hasn't seen yet? How should it write posting lists, stored fields etc.? Does this mean the Codec must cache the entire to-be-written segment in RAM? That will consume much more RAM than the approach in this issue ...

          I think that online sorting is much more powerful than a one-time sort, but there's work to do to make it happen, and efficiently. Therefore until then, I think that we should proceed with this offline sorting strategy, which is better than nothing.

          Show
          Shai Erera added a comment - Then lets do this! (the right way, in memory during indexing before it hits the disk, not re-ordering existing on-disk segments after the fact) This issue is about porting the previous IndexSorter implementation to trunk API. The previous one offered a one-time sorting of an index, so is this one. While that doesn't mean we shouldn't explore alternatives, I find it a much lower hanging fruit than LUCENE-4752 , especially as no one yet assigned the issue to himself, nor it looks like any progress was made. If LUCENE-4752 will eventually see the light of day, I don't mind if we nuke IndexSorter completely (by a SortingCodec I guess?), but until then, I think that offering users A way to sort their index is valuable too. Also, it's not clear to me at the moment (but I admit I haven't thought about it much) how can you sort documents during indexing, while the values to be sorted by may still be unknown? I.e. what if your sort-by-key is a NumericDocValues which the Codec hasn't seen yet? How should it write posting lists, stored fields etc.? Does this mean the Codec must cache the entire to-be-written segment in RAM? That will consume much more RAM than the approach in this issue ... I think that online sorting is much more powerful than a one-time sort, but there's work to do to make it happen, and efficiently. Therefore until then, I think that we should proceed with this offline sorting strategy, which is better than nothing.
          Hide
          Uwe Schindler added a comment -

          I would rename SortingIndexReader to SortingAtomicReader for consistency. We did this with all 3.x IndexReaders.

          Show
          Uwe Schindler added a comment - I would rename SortingIndexReader to SortingAtomicReader for consistency. We did this with all 3.x IndexReaders.
          Hide
          Shai Erera added a comment -

          Thanks Uwe, good idea! I also made it take a Sorter and not int[] old2New, which is too low-level IMO.

          Show
          Shai Erera added a comment - Thanks Uwe, good idea! I also made it take a Sorter and not int[] old2New , which is too low-level IMO.
          Hide
          Shai Erera added a comment -

          Got rid of To. StoredDocumentSorter and PayloadSorter are now abstract classes, with an abstract parse() method, taking the relevant input and return a Comparable.

          Show
          Shai Erera added a comment - Got rid of To. StoredDocumentSorter and PayloadSorter are now abstract classes, with an abstract parse() method, taking the relevant input and return a Comparable.
          Hide
          Robert Muir added a comment -

          Also, it's not clear to me at the moment (but I admit I haven't thought about it much) how can you sort documents during indexing, while the values to be sorted by may still be unknown? I.e. what if your sort-by-key is a NumericDocValues which the Codec hasn't seen yet? How should it write posting lists, stored fields etc.? Does this mean the Codec must cache the entire to-be-written segment in RAM? That will consume much more RAM than the approach in this issue ...

          I totally agree with Shai.

          If someone opened and issue to provide an option where IndexWriter constantly optimized in the background, i would be totally against that. Same goes for it doing sorting in the background: unless someone comes up with some serious magic, i dont see this happening with acceptable runtime.

          Show
          Robert Muir added a comment - Also, it's not clear to me at the moment (but I admit I haven't thought about it much) how can you sort documents during indexing, while the values to be sorted by may still be unknown? I.e. what if your sort-by-key is a NumericDocValues which the Codec hasn't seen yet? How should it write posting lists, stored fields etc.? Does this mean the Codec must cache the entire to-be-written segment in RAM? That will consume much more RAM than the approach in this issue ... I totally agree with Shai. If someone opened and issue to provide an option where IndexWriter constantly optimized in the background, i would be totally against that. Same goes for it doing sorting in the background: unless someone comes up with some serious magic, i dont see this happening with acceptable runtime.
          Hide
          Shai Erera added a comment -

          I noticed that package.html got seriously outdated, so I shortened it and removed class references from it. From my experience, it's not easy to maintain these list of classes in package.html, especially on an experimental feature.

          Show
          Shai Erera added a comment - I noticed that package.html got seriously outdated, so I shortened it and removed class references from it. From my experience, it's not easy to maintain these list of classes in package.html, especially on an experimental feature.
          Hide
          Robert Muir added a comment -

          patch with the other docvalues types too.

          Show
          Robert Muir added a comment - patch with the other docvalues types too.
          Hide
          Shai Erera added a comment -

          Well, in theory, one could do the following:

          • Let segments be flushed unsorted
          • During merge, open an AtomicReader which detects that it's an unsorted segment and build the old2New[], reading the documents in the right order
          • Provide his own SegmentMerger which doesn't append seg2 after seg1, but rather does a true "merge" of the two, by comparing the sort-by-value of each document

          But that's all in theory, none of the API exist today. Also, even if you did that, your index would still be unsorted, because a sorted segment says nothing about the total sort order of the index...

          Show
          Shai Erera added a comment - Well, in theory, one could do the following: Let segments be flushed unsorted During merge, open an AtomicReader which detects that it's an unsorted segment and build the old2New[], reading the documents in the right order Provide his own SegmentMerger which doesn't append seg2 after seg1, but rather does a true "merge" of the two, by comparing the sort-by-value of each document But that's all in theory, none of the API exist today. Also, even if you did that, your index would still be unsorted, because a sorted segment says nothing about the total sort order of the index...
          Hide
          Shai Erera added a comment -

          patch with the other docvalues types too

          Thanks Rob!

          Indeed, I didn't separate between the ability to sort by other DocValues types, to reading them from an index, given some permutation on the docs. So now we have full support for reading such values from the index, given a permutation on the docs.

          For sorting by such values, we offer for now only NumericDVSorter. Others can be added later on demand.

          I think that in terms of the API and functionality, this is quite ready. However, we should beef up the tests .. for instance:

          • Move to use newDirectory(), newIWC(), RandomIndexWriter etc. to incorporate some more randomness.
            • I'm sure that most tests will fail since currently none seem to close the directory.
          • Write some heavier tests which index random values to sort by (in different sources - payload, stored field ...), and lots of documents, to make sure our handling of the in-memory cached stuff works correctly.
          • Write a test which creates a large index (few GBs) to make sure we're not limited to e.g. 2GB indexes while sorting ...

          We don't need multi-threaded tests, as IndexSorter is not written for use in such scenarios.

          Show
          Shai Erera added a comment - patch with the other docvalues types too Thanks Rob! Indeed, I didn't separate between the ability to sort by other DocValues types, to reading them from an index, given some permutation on the docs. So now we have full support for reading such values from the index, given a permutation on the docs. For sorting by such values, we offer for now only NumericDVSorter. Others can be added later on demand. I think that in terms of the API and functionality, this is quite ready. However, we should beef up the tests .. for instance: Move to use newDirectory(), newIWC(), RandomIndexWriter etc. to incorporate some more randomness. I'm sure that most tests will fail since currently none seem to close the directory. Write some heavier tests which index random values to sort by (in different sources - payload, stored field ...), and lots of documents, to make sure our handling of the in-memory cached stuff works correctly. Write a test which creates a large index (few GBs) to make sure we're not limited to e.g. 2GB indexes while sorting ... We don't need multi-threaded tests, as IndexSorter is not written for use in such scenarios.
          Hide
          Robert Muir added a comment -

          in my mind: the most important thing testing such tools is to ensure that they won't lead to IndexWriter.addIndexes making a corrupt index.

          previously you could only really check this by calling IW.addIndexes and checkIndex on the resulting index.

          but now we also have _TestUtil.checkReader(IndexReader) so you can verify consistency of the reader itself: makes it easier to debug.

          Show
          Robert Muir added a comment - in my mind: the most important thing testing such tools is to ensure that they won't lead to IndexWriter.addIndexes making a corrupt index. previously you could only really check this by calling IW.addIndexes and checkIndex on the resulting index. but now we also have _TestUtil.checkReader(IndexReader) so you can verify consistency of the reader itself: makes it easier to debug.
          Hide
          Shai Erera added a comment -

          Ok, I'm making some changes to the tests now, compact them a little bit and add some randomness. I'll factor in checkReader too!

          Show
          Shai Erera added a comment - Ok, I'm making some changes to the tests now, compact them a little bit and add some randomness. I'll factor in checkReader too!
          Hide
          Robert Muir added a comment -

          the slowwrapping in the sortingatomicreader must be removed before committing as well.

          Show
          Robert Muir added a comment - the slowwrapping in the sortingatomicreader must be removed before committing as well.
          Hide
          Shai Erera added a comment -

          the slowwrapping in the sortingatomicreader must be removed before committing as well.

          How? This utility sorts an entire index ... it doesn't work per-segment. The old2New mapping is a global mapping.
          I thought about this, but I didn't see any real choice. I.e., even if we sort each segment separately, we'd still need to sort the index globally (see my comment about the online sorting).

          If u have an idea how it can be per-segment, I'll be happy to hear.

          Show
          Shai Erera added a comment - the slowwrapping in the sortingatomicreader must be removed before committing as well. How? This utility sorts an entire index ... it doesn't work per-segment. The old2New mapping is a global mapping. I thought about this, but I didn't see any real choice. I.e., even if we sort each segment separately, we'd still need to sort the index globally (see my comment about the online sorting). If u have an idea how it can be per-segment, I'll be happy to hear.
          Hide
          Robert Muir added a comment -

          please, make it only take an atomic one in its ctor. user can slow-wrap themself.

          The current api is bogus.

          Show
          Robert Muir added a comment - please, make it only take an atomic one in its ctor. user can slow-wrap themself. The current api is bogus.
          Hide
          Shai Erera added a comment -

          Ok I will. That way, one can use it to achieve per-segment sorting .. if it means something for anyone (perhaps for better compression).

          Show
          Shai Erera added a comment - Ok I will. That way, one can use it to achieve per-segment sorting .. if it means something for anyone (perhaps for better compression).
          Hide
          David Smiley added a comment -

          Shai, why do you think index sorting is done after it has been written (this patch) v.s. done before writing (LUCENE-4752 proposal)? Is it harder and more complicated to write it sorted in the first place? Or maybe it would slow writing down when this is the kind of thing that could be scheduled to a nightly job? Just guessing. The lack of support for deleted documents is surprising; is this API typically used in a batch job that first deletes then sorts?

          Show
          David Smiley added a comment - Shai, why do you think index sorting is done after it has been written (this patch) v.s. done before writing ( LUCENE-4752 proposal)? Is it harder and more complicated to write it sorted in the first place? Or maybe it would slow writing down when this is the kind of thing that could be scheduled to a nightly job? Just guessing. The lack of support for deleted documents is surprising; is this API typically used in a batch job that first deletes then sorts?
          Hide
          Shai Erera added a comment -

          The approach taken by this issue (well, originally by the 3x IndexSorter) is that you do an offline sorting of an entire index. So if you e.g. have a 10-segments index, you end up with a single segment, totally sorted across all documents.

          At least from my understanding of how the online sorting would work (LUCENE-4752), the Codec would need to determine beforehand the permutation on the documents, or build an in-memory segment and then when it's done, sort it and write it sorted, right? Otherwise, I don't understand how it can handle these series of addDocuments (assume the value denotes the location of the document in the sorted index): doc(2), doc(1), doc(7), doc(0)...? The stored fields and term-vectors are not cached in-memory today. The location of the document in the sorted index is unknown until all keys (by which you sort) are encountered, which may be too late for the Codec?

          And even if you get passed that hurdle (say you're willing to cache everything in-memory and then flush to disk sorted), how will you handle merges? So now you have an index with segments 1,2,3 (each sorted). How do you merge-sort them? Today, you don't have the API for it, so let's say that we add it (plugging-in your own SegmentMerger). Now MP selects segments 1,2 for merge, so you end up with segments 3,4, which are again each sorted separately, but the index is not globally sorted, right? In a sorted index, the segments need to have a consistent > (or <) relationship between the segments .. or otherwise you're just traversing documents in random order.

          In short, if you do come up with a reasonable way to do online index sorting (on LUCENE-4752), I'll be all for it. And if it will make sense, we can even drop the offline index sorter too. But I think that there are many challenges in getting it right, and efficiently. It's not a mere Codec trick IMO.

          Also, note that as far as memory consumption for offline sorting, we only cache in memory the current posting lists that's sorted (the rest relies on pre-existing random access API).

          But, I could be totally missing your idea for online sorting, in which case I'd appreciate if you elaborate how you think it can be done. But I prefer that we discuss that on LUCENE-4752.

          Show
          Shai Erera added a comment - The approach taken by this issue (well, originally by the 3x IndexSorter) is that you do an offline sorting of an entire index. So if you e.g. have a 10-segments index, you end up with a single segment, totally sorted across all documents. At least from my understanding of how the online sorting would work ( LUCENE-4752 ), the Codec would need to determine beforehand the permutation on the documents, or build an in-memory segment and then when it's done, sort it and write it sorted, right? Otherwise, I don't understand how it can handle these series of addDocuments (assume the value denotes the location of the document in the sorted index): doc(2), doc(1), doc(7), doc(0)...? The stored fields and term-vectors are not cached in-memory today. The location of the document in the sorted index is unknown until all keys (by which you sort) are encountered, which may be too late for the Codec? And even if you get passed that hurdle (say you're willing to cache everything in-memory and then flush to disk sorted), how will you handle merges? So now you have an index with segments 1,2,3 (each sorted). How do you merge-sort them? Today, you don't have the API for it, so let's say that we add it (plugging-in your own SegmentMerger). Now MP selects segments 1,2 for merge, so you end up with segments 3,4, which are again each sorted separately, but the index is not globally sorted, right? In a sorted index, the segments need to have a consistent > (or <) relationship between the segments .. or otherwise you're just traversing documents in random order. In short, if you do come up with a reasonable way to do online index sorting (on LUCENE-4752 ), I'll be all for it. And if it will make sense, we can even drop the offline index sorter too. But I think that there are many challenges in getting it right, and efficiently. It's not a mere Codec trick IMO. Also, note that as far as memory consumption for offline sorting, we only cache in memory the current posting lists that's sorted (the rest relies on pre-existing random access API). But, I could be totally missing your idea for online sorting, in which case I'd appreciate if you elaborate how you think it can be done. But I prefer that we discuss that on LUCENE-4752 .
          Hide
          David Smiley added a comment -

          Thanks so much for your insights Shai. I responded on LUCENE-4752 and I added you to the watcher list there.

          Show
          David Smiley added a comment - Thanks so much for your insights Shai. I responded on LUCENE-4752 and I added you to the watcher list there.
          Hide
          Andrzej Bialecki added a comment -

          Shai,

          That way, one can use it to achieve per-segment sorting .. if it means something for anyone (perhaps for better compression).

          Global ordering over a multi-segment index seems less useful than the local per-segment ordering - after all, the main use cases are compression improvements and early termination strategies, both of which operate on a per-segment basis.

          Show
          Andrzej Bialecki added a comment - Shai, That way, one can use it to achieve per-segment sorting .. if it means something for anyone (perhaps for better compression). Global ordering over a multi-segment index seems less useful than the local per-segment ordering - after all, the main use cases are compression improvements and early termination strategies, both of which operate on a per-segment basis.
          Hide
          Shai Erera added a comment -

          Patch:

          • Consolidates tests into less classes, also creates the test index in BeforeClass
          • Randomizing tests: #docs, sort permutations
          • Added _TestUtil.checkIndex and checkReader (see below) - this uncovered few bugs
          • Moved relevant classes to inherit from their FilterXYZ counterpart - reduced code

          In SortingAtomicReaderTest I had to call _TestUtil.checkReader with crossCheckTermVectors=false. I put a comment why, but I'll repeat it here. Let's say that your permutation is [2,0,1] which means doc 0 goes to 2, 1 goes to 0 and 2 goes to 1. You then ask for termVectors(0), and receive, correctly, those of document 2. Now, when crossCheck validation occurs, it iterates on the terms of the TV, expecting to find each term associated with document 0 (since CheckIndex doesn't know about sorting reader, and that in fact it works now w/ doc #2). However, the terms of doc #2 are mapped, again - correctly, to document #1 (as the permutation specifies). Therefore CheckIndex fails since it expects doc 0 but receives doc 1.

          Unless I misunderstand how this cross-checking should work, I think that it's ok to disable cross-checking on the reader, in this case. In SorterUtilTest, where I addIndexes, I do checkIndex with cross-checking, and there it passes (as it should).

          All that's missing is CHANGES, but I'll let others review ...

          Show
          Shai Erera added a comment - Patch: Consolidates tests into less classes, also creates the test index in BeforeClass Randomizing tests: #docs, sort permutations Added _TestUtil.checkIndex and checkReader (see below) - this uncovered few bugs Moved relevant classes to inherit from their FilterXYZ counterpart - reduced code In SortingAtomicReaderTest I had to call _TestUtil.checkReader with crossCheckTermVectors=false. I put a comment why, but I'll repeat it here. Let's say that your permutation is [2,0,1] which means doc 0 goes to 2, 1 goes to 0 and 2 goes to 1. You then ask for termVectors(0), and receive, correctly, those of document 2. Now, when crossCheck validation occurs, it iterates on the terms of the TV, expecting to find each term associated with document 0 (since CheckIndex doesn't know about sorting reader, and that in fact it works now w/ doc #2). However, the terms of doc #2 are mapped, again - correctly, to document #1 (as the permutation specifies). Therefore CheckIndex fails since it expects doc 0 but receives doc 1. Unless I misunderstand how this cross-checking should work, I think that it's ok to disable cross-checking on the reader, in this case. In SorterUtilTest, where I addIndexes, I do checkIndex with cross-checking, and there it passes (as it should). All that's missing is CHANGES, but I'll let others review ...
          Hide
          Shai Erera added a comment -

          I agree that per-segment sorting may improve compression, but I don't see how it can help with early termination. If Lucene allowed you to terminate on a per-segment basis then maybe, but today (as far as I know), the only way to early terminate a query is by throwing an exception and catching it outside, right?

          At any rate, per-segment sorting doesn't help much when segments are merged, because then you lose sorting again. Unless, we add some API to IW / SegmentMerger for you to control how the segments are merged...

          Show
          Shai Erera added a comment - I agree that per-segment sorting may improve compression, but I don't see how it can help with early termination. If Lucene allowed you to terminate on a per-segment basis then maybe, but today (as far as I know), the only way to early terminate a query is by throwing an exception and catching it outside, right? At any rate, per-segment sorting doesn't help much when segments are merged, because then you lose sorting again. Unless, we add some API to IW / SegmentMerger for you to control how the segments are merged...
          Hide
          David Smiley added a comment -

          Global ordering over a multi-segment index seems less useful than the local per-segment ordering - after all, the main use cases are compression improvements and early termination strategies, both of which operate on a per-segment basis.

          Those are two "main use-cases" but there is a third that (to me) is far more important than either – locality of reference. Retrieving documents near each other (according to the configured sort order) should ideally have their stored fields data very close together. The early-termination use-case is clearly important to you but is not the goal of LUCENE-4752.

          Show
          David Smiley added a comment - Global ordering over a multi-segment index seems less useful than the local per-segment ordering - after all, the main use cases are compression improvements and early termination strategies, both of which operate on a per-segment basis. Those are two "main use-cases" but there is a third that (to me) is far more important than either – locality of reference. Retrieving documents near each other (according to the configured sort order) should ideally have their stored fields data very close together. The early-termination use-case is clearly important to you but is not the goal of LUCENE-4752 .
          Hide
          Shai Erera added a comment -

          Patch adds support for indexes with deleted documents too. While doing that, I noticed that the implementation was slightly buggy – it only used an old2new mapping, which is important for mapping the posting lists, but it also had to use a new2old mapping, for "random access" methods such as document(), termVectors() etc...

          I enhanced the tests to cover these cases and now there's a:

          • SorterTestBase which creates an index in BeforeClass with all needed information in the documents
            • It also contains all the test methods.
          • SortingAtomicReaderTest opens a SortingAtomicReader in its BeforeClass, setting it as the tests' reader.
            • It also now successfully passes CheckIndex(reader). (so I was wrong in my previous analysis, which was misled by the buggy code before)
          • SorterUtilTest sorts the source index by calling IW.addIndexes, and then tests run on the sorted index (via a regular reader).
            • It also CheckIndex(directory) to make sure the sorted index is valid as well.

          I put a comment on Sorter.old2new that implementations must return a mapping for deleted documents too. it is important for SortingAtomicReader to operate correctly, however these documents are dropped when the index is actually sorted.

          I would like to give the tests some more runs (seems that everytime I thought I'm done, a new seed uncovered another bug).

          I already know that the tests cannot work w/ Lucene40 and Lucene41 codecs (because of SortedDV), but also with SepPostingsWriter (since it doesn't support indexing offsets, and I test that too). How can I disable it for the tests? I didn't see a SepCodec or something that I can add to SuppressCodecs...

          Show
          Shai Erera added a comment - Patch adds support for indexes with deleted documents too. While doing that, I noticed that the implementation was slightly buggy – it only used an old2new mapping, which is important for mapping the posting lists, but it also had to use a new2old mapping, for "random access" methods such as document(), termVectors() etc... I enhanced the tests to cover these cases and now there's a: SorterTestBase which creates an index in BeforeClass with all needed information in the documents It also contains all the test methods. SortingAtomicReaderTest opens a SortingAtomicReader in its BeforeClass, setting it as the tests' reader. It also now successfully passes CheckIndex(reader). (so I was wrong in my previous analysis, which was misled by the buggy code before) SorterUtilTest sorts the source index by calling IW.addIndexes, and then tests run on the sorted index (via a regular reader). It also CheckIndex(directory) to make sure the sorted index is valid as well. I put a comment on Sorter.old2new that implementations must return a mapping for deleted documents too. it is important for SortingAtomicReader to operate correctly, however these documents are dropped when the index is actually sorted. I would like to give the tests some more runs (seems that everytime I thought I'm done, a new seed uncovered another bug). I already know that the tests cannot work w/ Lucene40 and Lucene41 codecs (because of SortedDV), but also with SepPostingsWriter (since it doesn't support indexing offsets, and I test that too). How can I disable it for the tests? I didn't see a SepCodec or something that I can add to SuppressCodecs...
          Hide
          Michael McCandless added a comment -

          I already know that the tests cannot work w/ Lucene40 and Lucene41 codecs (because of SortedDV), but also with SepPostingsWriter (since it doesn't support indexing offsets, and I test that too). How can I disable it for the tests? I didn't see a SepCodec or something that I can add to SuppressCodecs...

          Can you just add "MockSep", "MockFixedIntBlock", "MockVariableIntBlock" to the @SuppressCodecs (it can contain posting format names too...). Hmm maybe also "MockRandom".

          Show
          Michael McCandless added a comment - I already know that the tests cannot work w/ Lucene40 and Lucene41 codecs (because of SortedDV), but also with SepPostingsWriter (since it doesn't support indexing offsets, and I test that too). How can I disable it for the tests? I didn't see a SepCodec or something that I can add to SuppressCodecs... Can you just add "MockSep", "MockFixedIntBlock", "MockVariableIntBlock" to the @SuppressCodecs (it can contain posting format names too...). Hmm maybe also "MockRandom".
          Hide
          Shai Erera added a comment -

          I added to SuppressCodecs the ones Mike mentioned plus MockRandom since it also picks randomly SepWriter.

          Improved some javadocs, more cleanups. Added CHANGES.

          I think it's ready, would appreciate if someone can take another look.

          Show
          Shai Erera added a comment - I added to SuppressCodecs the ones Mike mentioned plus MockRandom since it also picks randomly SepWriter. Improved some javadocs, more cleanups. Added CHANGES. I think it's ready, would appreciate if someone can take another look.
          Hide
          Adrien Grand added a comment -

          Thanks for your work Shai. Indeed it looks really good now! Here a a few suggestions/questions:

          • Are there actual use-cases for sorting by stored fields or payloads? If not I think we should remove StoredFieldsSorter and PayloadSorter?
          • Remove IndexSorter.java and make SortDoc package-private?
          +      // we cannot reuse the given DocsAndPositionsEnum because we return our
          +      // own wrapper, and not all Codecs like it.
          

          Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum?

          Show
          Adrien Grand added a comment - Thanks for your work Shai. Indeed it looks really good now! Here a a few suggestions/questions: Are there actual use-cases for sorting by stored fields or payloads? If not I think we should remove StoredFieldsSorter and PayloadSorter? Remove IndexSorter.java and make SortDoc package-private? + // we cannot reuse the given DocsAndPositionsEnum because we return our + // own wrapper, and not all Codecs like it. Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum?
          Hide
          Shai Erera added a comment -

          Are there actual use-cases for sorting by stored fields or payloads? If not I think we should remove StoredFieldsSorter and PayloadSorter?

          The original 3x implementation allowed sorting only by stored fields, so we thought if it was useful to someone then, it might be useful now. And sort by payload value was useful in the past, before DV days, and e.g. we still have code that stores sort-by values in payloads. I don't think it's harmful to keep PayloadSorter, but I don't mind removing it as well.

          Remove IndexSorter.java

          IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it? On the opposite, I was thinking if we want to make it more of an instance utility, i.e. that you do something like new IndexSorter().sort() and perhaps allow to sort a Directory in-place (by opening an IW, deleteAll() and addIndexes SortingReader...).

          make SortDoc package-private

          SortDoc, while not visible on the API today, might be useful for Sorter implementations, as a convenient way to create the old2new[]. Otherwise, if you need to sort both 'docs' and 'values', you need to work with something similar to SorterTemplate. I can remove it and implement sorting docs[] and values[] by each Sorter impl, or make Sorter an abstract class (parameterized with T exnteds Comparable<T> which does the sorting for you?

          Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum?

          Yeah, I've been thinking about it while writing the comment. Since FilterDocsEnum doesn't expose 'in' in a public way, I was wondering if I should expose the wrapped 'in' in SortingXYZ just for that purpose. How important is it to reuse DocsEnum/AndPositions?

          Show
          Shai Erera added a comment - Are there actual use-cases for sorting by stored fields or payloads? If not I think we should remove StoredFieldsSorter and PayloadSorter? The original 3x implementation allowed sorting only by stored fields, so we thought if it was useful to someone then, it might be useful now. And sort by payload value was useful in the past, before DV days, and e.g. we still have code that stores sort-by values in payloads. I don't think it's harmful to keep PayloadSorter, but I don't mind removing it as well. Remove IndexSorter.java IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it? On the opposite, I was thinking if we want to make it more of an instance utility, i.e. that you do something like new IndexSorter().sort() and perhaps allow to sort a Directory in-place (by opening an IW, deleteAll() and addIndexes SortingReader...). make SortDoc package-private SortDoc, while not visible on the API today, might be useful for Sorter implementations, as a convenient way to create the old2new[]. Otherwise, if you need to sort both 'docs' and 'values', you need to work with something similar to SorterTemplate. I can remove it and implement sorting docs[] and values[] by each Sorter impl, or make Sorter an abstract class (parameterized with T exnteds Comparable<T> which does the sorting for you? Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum? Yeah, I've been thinking about it while writing the comment. Since FilterDocsEnum doesn't expose 'in' in a public way, I was wondering if I should expose the wrapped 'in' in SortingXYZ just for that purpose. How important is it to reuse DocsEnum/AndPositions?
          Hide
          Shai Erera added a comment -

          or make Sorter an abstract class (parameterized with T exnteds Comparable<T> which does the sorting for you?

          Hmm, Sorter should not be parameterized since it only returns an old2new[] mapping. The interface doesn't care about the values at all.

          Show
          Shai Erera added a comment - or make Sorter an abstract class (parameterized with T exnteds Comparable<T> which does the sorting for you? Hmm, Sorter should not be parameterized since it only returns an old2new[] mapping. The interface doesn't care about the values at all.
          Hide
          Shai Erera added a comment -

          Patch does:

          • Get rid of SortDoc. Sorter is now abstract class with a helper int[] compute(int[] docs, T[] values) method which sorter implementations can call to implement oldToNew().
            • Thanks Uwe for helping me sort out the generics signatures!!
          • Folded all SortingXYZ classes as private inner classes of SortingAtomicReader. While it increases SortingAR's code size, I feel that's the right thing to do, as these classes have no business on their own (even as package-private). If in the future we'll want to open them up for extension, we can make them public or something.
          • Add getWrapped to both SortingDocsEnum and Positions, and implemented reuse as suggested by Adrien.
          Show
          Shai Erera added a comment - Patch does: Get rid of SortDoc. Sorter is now abstract class with a helper int[] compute(int[] docs, T[] values) method which sorter implementations can call to implement oldToNew(). Thanks Uwe for helping me sort out the generics signatures!! Folded all SortingXYZ classes as private inner classes of SortingAtomicReader. While it increases SortingAR's code size, I feel that's the right thing to do, as these classes have no business on their own (even as package-private). If in the future we'll want to open them up for extension, we can make them public or something. Add getWrapped to both SortingDocsEnum and Positions, and implemented reuse as suggested by Adrien.
          Hide
          Robert Muir added a comment -

          The current tests suppress a long list of codecs because some methods test full functionality including some newer index features.

          I think if we are going to test this way, then sorting reader itself should throw unsupported operation exception (refuse to sort segments from 3.x, 4.0, 4.1 codecs).

          Because its really unsupported/untested. otherwise, we should pull these newer methods out.

          Show
          Robert Muir added a comment - The current tests suppress a long list of codecs because some methods test full functionality including some newer index features. I think if we are going to test this way, then sorting reader itself should throw unsupported operation exception (refuse to sort segments from 3.x, 4.0, 4.1 codecs). Because its really unsupported/untested. otherwise, we should pull these newer methods out.
          Hide
          Shai Erera added a comment -

          Good idea. Is there a quick way to identify such indexes? Basically I need to iterate over leaves() and check version/Codec of each? Since we get an AtomicReader, what will leaves() return for a SlowWrapper?

          Show
          Shai Erera added a comment - Good idea. Is there a quick way to identify such indexes? Basically I need to iterate over leaves() and check version/Codec of each? Since we get an AtomicReader, what will leaves() return for a SlowWrapper?
          Hide
          Uwe Schindler added a comment -

          Since we get an AtomicReader, what will leaves() return for a SlowWrapper?

          Itsself - because its atomic With a slow wrapper you have no chance to get the underlying codec information. With atomic readers implemented by sementreader there might be solutions.

          Show
          Uwe Schindler added a comment - Since we get an AtomicReader, what will leaves() return for a SlowWrapper? Itsself - because its atomic With a slow wrapper you have no chance to get the underlying codec information. With atomic readers implemented by sementreader there might be solutions.
          Hide
          Adrien Grand added a comment -

          Regarding PayloadSorter and StoredFieldsSorter I'm just afraid that the fact that they exist might make users think these are viable options...

          IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it?

          I think taking an AtomicReader as an argument (instead of a Directory) and feeding an IndexWriter (instead of another Directory) would be much more flexible but then it would just be a call to IndexWriter.addIndexes... If we want an utility to sort indexes, maybe it should rather be something callable from command-line? (java oal.index.sorter.IndexSorter fromDir toDir sortField)

          Get rid of SortDoc. Sorter is now abstract class with a helper int[] compute(int[] docs, T[] values)

          I think it's better! Maybe a List instead of an array would be even better so that NumericDocValuesSorter could use a view over the doc values instead of loading all of them into memory?

          Reusage of DocsEnum looks great!

          Show
          Adrien Grand added a comment - Regarding PayloadSorter and StoredFieldsSorter I'm just afraid that the fact that they exist might make users think these are viable options... IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it? I think taking an AtomicReader as an argument (instead of a Directory) and feeding an IndexWriter (instead of another Directory) would be much more flexible but then it would just be a call to IndexWriter.addIndexes... If we want an utility to sort indexes, maybe it should rather be something callable from command-line? (java oal.index.sorter.IndexSorter fromDir toDir sortField) Get rid of SortDoc. Sorter is now abstract class with a helper int[] compute(int[] docs, T[] values) I think it's better! Maybe a List instead of an array would be even better so that NumericDocValuesSorter could use a view over the doc values instead of loading all of them into memory? Reusage of DocsEnum looks great!
          Hide
          Shai Erera added a comment -

          Robert, now that I think about it again, I'm not sure if SortingAtomicReader should do anything to prevent usage over such Codecs. The list of codecs that do not support indexing offsets are there because the test indexes offsets. But SortingAtomicReader itself doesn't care about whether the Codec supported or not offsets – it will get -1 for offsets (that's what happened before I set IndexOptions correctly!). I think that's fine?

          As for the new DV features, again, the reader itself is immune to this? I.e. you would get same exception if you opened a 4.0 index and called getSortedSetDV?

          I think it's fine that the reader doesn't yell at you, and I wish we could test those codecs somehow, ignoring the unsupported methods, but I also think that these tests are not the ones that just must test all Codecs. It's a generic filter reader and it will yell at you just like any other filter reader will, if you call unsupported API?

          Show
          Shai Erera added a comment - Robert, now that I think about it again, I'm not sure if SortingAtomicReader should do anything to prevent usage over such Codecs. The list of codecs that do not support indexing offsets are there because the test indexes offsets. But SortingAtomicReader itself doesn't care about whether the Codec supported or not offsets – it will get -1 for offsets (that's what happened before I set IndexOptions correctly!). I think that's fine? As for the new DV features, again, the reader itself is immune to this? I.e. you would get same exception if you opened a 4.0 index and called getSortedSetDV ? I think it's fine that the reader doesn't yell at you, and I wish we could test those codecs somehow, ignoring the unsupported methods, but I also think that these tests are not the ones that just must test all Codecs. It's a generic filter reader and it will yell at you just like any other filter reader will, if you call unsupported API?
          Hide
          Shai Erera added a comment -

          Maybe a List instead of an array would be even better

          The thing is, I use two parallel arrays to sort the documents (docs and values) and SorterTemplate swaps entries in the arrays, so I think that kinda means we need to load everything into memory (well, certainly we cannot swap entries on disk)? But this is just the initial loading used to determine the docs order ... and also, someone can implement a Sorter.oldToNew without loading anything into RAM, but I doubt if anyone will.

          As for IndexSorter utility, I thought about adding main(), but then sortField is not enough (only works for NumericDV). On the other hand, I do get your point that there's not much use for a Java API that takes Directory-ies... and I don't want to add API that takes reader/writer and only calls .addIndexes. So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works.

          I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options. At least, stored field is, I can let go of the payload, if that really bothers anyone.

          Show
          Shai Erera added a comment - Maybe a List instead of an array would be even better The thing is, I use two parallel arrays to sort the documents (docs and values) and SorterTemplate swaps entries in the arrays, so I think that kinda means we need to load everything into memory (well, certainly we cannot swap entries on disk)? But this is just the initial loading used to determine the docs order ... and also, someone can implement a Sorter.oldToNew without loading anything into RAM, but I doubt if anyone will. As for IndexSorter utility, I thought about adding main(), but then sortField is not enough (only works for NumericDV). On the other hand, I do get your point that there's not much use for a Java API that takes Directory-ies... and I don't want to add API that takes reader/writer and only calls .addIndexes. So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works. I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options. At least, stored field is, I can let go of the payload, if that really bothers anyone.
          Hide
          Robert Muir added a comment -

          patch fixing tests to not suppress whole codecs.

          instead the testSortedSet() has an assume (and is ignored for ancient codecs).

          in the case of offsets, ancient codecs just index and test docs/freqs/positions without offsets

          Show
          Robert Muir added a comment - patch fixing tests to not suppress whole codecs. instead the testSortedSet() has an assume (and is ignored for ancient codecs). in the case of offsets, ancient codecs just index and test docs/freqs/positions without offsets
          Hide
          Adrien Grand added a comment -

          I use two parallel arrays to sort the documents (docs and values)

          I updated the patch to use doc IDs as ords so that values are never swapped (only doc IDs) and the numeric doc values don't need to be all loaded in memory.

          So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works.

          +1

          I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options.

          I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?

          Show
          Adrien Grand added a comment - I use two parallel arrays to sort the documents (docs and values) I updated the patch to use doc IDs as ords so that values are never swapped (only doc IDs) and the numeric doc values don't need to be all loaded in memory. So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works. +1 I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options. I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?
          Hide
          Shai Erera added a comment -

          Thanks Rob - I didn't know we can check these things . Certainly better than suppressing the entire Codec.

          Adrien, thanks for the update as well. So if someone loads NumericDV (default), indeed there's no need to copy the values again into an array. If someone uses DiskDVFormat though, list.get will access the disk on every call ... but I guess that's fine since if someone wanted to save RAM, he should be ready to pay the price, and we should respect him.

          I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?

          Someone might have an existing index without DV. Also, who said that a stored field used for display cannot be used to sort the index?
          But, since it's quite trivial to implement, I'll remove both Payload and StoredFields. I'll also make Reverse and Numeric sorters inner classes (though public) of Sorter.

          I added a check in SortingAtomicReader ctor that old2new.length == reader.maxDoc(), to ensure that sorters provide a mapping for every document in the index.
          I'll get rid of IndexSorter, but keep a test around + add to SortingAR javadocs code example how to use it for addIndexes.

          Will upload a new patch later.

          Show
          Shai Erera added a comment - Thanks Rob - I didn't know we can check these things . Certainly better than suppressing the entire Codec. Adrien, thanks for the update as well. So if someone loads NumericDV (default), indeed there's no need to copy the values again into an array. If someone uses DiskDVFormat though, list.get will access the disk on every call ... but I guess that's fine since if someone wanted to save RAM, he should be ready to pay the price, and we should respect him. I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results? Someone might have an existing index without DV. Also, who said that a stored field used for display cannot be used to sort the index? But, since it's quite trivial to implement, I'll remove both Payload and StoredFields. I'll also make Reverse and Numeric sorters inner classes (though public) of Sorter. I added a check in SortingAtomicReader ctor that old2new.length == reader.maxDoc(), to ensure that sorters provide a mapping for every document in the index. I'll get rid of IndexSorter, but keep a test around + add to SortingAR javadocs code example how to use it for addIndexes. Will upload a new patch later.
          Hide
          Andrzej Bialecki added a comment -

          I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?

          This is a legacy of the original usage of this tool in Nutch - indexes would use a PageRank value as a document boost, and that was the value to be used for sorting - but since the doc boost is not recoverable from an existing index the value itself was stored in a stored field.

          And definitely DV didn't exist yet at that time

          Show
          Andrzej Bialecki added a comment - I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results? This is a legacy of the original usage of this tool in Nutch - indexes would use a PageRank value as a document boost, and that was the value to be used for sorting - but since the doc boost is not recoverable from an existing index the value itself was stored in a stored field. And definitely DV didn't exist yet at that time
          Hide
          Shai Erera added a comment -

          Patch removes IndexSorter (but keeps IndexSortingTest). I also:

          • Moved ReverseDocIDSorter to a singleton on Sorter, and made IndexSortingTest randomly pick it or NumericDVSorter.
          • Removed Payload and StoredFields sorter. As a consequence, removed SorterTest (sorters are covered by IndexSortingTest).
          • Added example code to SortingAtomicReader jdocs.
          Show
          Shai Erera added a comment - Patch removes IndexSorter (but keeps IndexSortingTest). I also: Moved ReverseDocIDSorter to a singleton on Sorter, and made IndexSortingTest randomly pick it or NumericDVSorter. Removed Payload and StoredFields sorter. As a consequence, removed SorterTest (sorters are covered by IndexSortingTest). Added example code to SortingAtomicReader jdocs.
          Hide
          Shai Erera added a comment -

          I think it's ready. If there are no objections, I'd like to commit it later today.

          Show
          Shai Erera added a comment - I think it's ready. If there are no objections, I'd like to commit it later today.
          Hide
          Shai Erera added a comment -

          Patch optimizes not encoding offsets in memory if offsets are not indexed. This saves 10 bytes per position for most cases (since offsets are not indexed by default, even for positions-enabled fields, e.g. TextField).

          Show
          Shai Erera added a comment - Patch optimizes not encoding offsets in memory if offsets are not indexed. This saves 10 bytes per position for most cases (since offsets are not indexed by default, even for positions-enabled fields, e.g. TextField).
          Hide
          Shai Erera added a comment -

          Optimize not encoding freqs in memory if freqs were not indexed (even if they are requested in flags).

          Show
          Shai Erera added a comment - Optimize not encoding freqs in memory if freqs were not indexed (even if they are requested in flags).
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1454801

          LUCENE-3918: port IndexSorter to trunk API

          Show
          Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1454801 LUCENE-3918 : port IndexSorter to trunk API
          Hide
          Shai Erera added a comment -

          Committed to trunk and 4x. Thanks Anat, your work has re-ignited this issue!

          Show
          Shai Erera added a comment - Committed to trunk and 4x. Thanks Anat, your work has re-ignited this issue!
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1454804

          LUCENE-3918: port IndexSorter to trunk API

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1454804 LUCENE-3918 : port IndexSorter to trunk API
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Shai Erera
              Reporter:
              Robert Muir
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development