Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available
    1. LUCENE-4858.patch
      31 kB
      Shai Erera
    2. LUCENE-4858.patch
      31 kB
      Shai Erera
    3. LUCENE-4858.patch
      31 kB
      Adrien Grand
    4. LUCENE-4858.patch
      23 kB
      Shai Erera
    5. LUCENE-4858.patch
      25 kB
      Shai Erera
    6. LUCENE-4858.patch
      21 kB
      Shai Erera
    7. LUCENE-4858.patch
      18 kB
      Adrien Grand

      Issue Links

        Activity

        Hide
        Adrien Grand added a comment -

        What in the patch guarantees that any segment with more than maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which ensures every such segment gets picked up by SortingMP, however didn't find it.

        I don't think that in general we should make assumptions based on a maxBufferedDocs setting because the default setting in IWC is per RAM consumption and also it seems slightly unrelated. I.e. if a segment is sorted, but has deletions such that numDocs < maxBufferedDocs, we do full collection, while we can early terminate as usual?

        Indeed I think that finding out which segments are sorted is the main issue. My idea was to say that if you want to use early query termination, you need to set maxBufferedDocs to a given limit (low values improve early query termination while high values improve indexing speed), so that large segments (the ones that are interesting for early query termination since they require time to collect) that have more than maxBufferedDocs documents (deleted or not) are known to be sorted, because they result from a merge. Of course, this could miss some small segments which are sorted but since they are small, they're not as interesting for early query termination?

        What options do we have here? I think you mentionned tagging sorted segments, do you have an idea where/how we could do that?

        And hopefully we can stuff the early termination logic down to IndexSearcher eventually. There are other scenarios for early termination, such as time limit, and therefore I think it's ok if we have an EarlyTerminationException which IndexSearcher responds to.

        Inded, I think this makes sense.

        Show
        Adrien Grand added a comment - What in the patch guarantees that any segment with more than maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which ensures every such segment gets picked up by SortingMP, however didn't find it. I don't think that in general we should make assumptions based on a maxBufferedDocs setting because the default setting in IWC is per RAM consumption and also it seems slightly unrelated. I.e. if a segment is sorted, but has deletions such that numDocs < maxBufferedDocs, we do full collection, while we can early terminate as usual? Indeed I think that finding out which segments are sorted is the main issue. My idea was to say that if you want to use early query termination, you need to set maxBufferedDocs to a given limit (low values improve early query termination while high values improve indexing speed), so that large segments (the ones that are interesting for early query termination since they require time to collect) that have more than maxBufferedDocs documents (deleted or not) are known to be sorted, because they result from a merge. Of course, this could miss some small segments which are sorted but since they are small, they're not as interesting for early query termination? What options do we have here? I think you mentionned tagging sorted segments, do you have an idea where/how we could do that? And hopefully we can stuff the early termination logic down to IndexSearcher eventually. There are other scenarios for early termination, such as time limit, and therefore I think it's ok if we have an EarlyTerminationException which IndexSearcher responds to. Inded, I think this makes sense.
        Hide
        Shai Erera added a comment -

        I am thinking for some time on segment-level metadata. Something like SegmentInfo.attributes(). These attributes are not exposed on the API today, and therefore they are not a good fit, but we should be able to expose something like that, which by default is empty, but SortingAtomicReader can set. A Map<String,String> basically. I wanted to use something similar in facets to record per segment some of the FacetIndexingParams, so that the index is more self-descriptive to code that reads it.

        Perhaps we should do that (separate issue)? Maybe AtomicReader.getMetadata() returns a Map?

        Show
        Shai Erera added a comment - I am thinking for some time on segment-level metadata. Something like SegmentInfo.attributes(). These attributes are not exposed on the API today, and therefore they are not a good fit, but we should be able to expose something like that, which by default is empty, but SortingAtomicReader can set. A Map<String,String> basically. I wanted to use something similar in facets to record per segment some of the FacetIndexingParams, so that the index is more self-descriptive to code that reads it. Perhaps we should do that (separate issue)? Maybe AtomicReader.getMetadata() returns a Map?
        Hide
        Robert Muir added a comment -

        Can't we split this issue up? I think the current discussion is focused much on this sorted segments thing, but thats not the only possible implementation for this kind of thing.

        Already you can terminate the whole query (globally) by throwing an exception from your collector.

        So to me the next step is to make it easier to terminate the collection of just one segment and have IndexSearcher proceed to the other ones. This can just be a special marker exception (extends RuntimeException) in IndexSearcher. Then any collector can do this.

        Show
        Robert Muir added a comment - Can't we split this issue up? I think the current discussion is focused much on this sorted segments thing, but thats not the only possible implementation for this kind of thing. Already you can terminate the whole query (globally) by throwing an exception from your collector. So to me the next step is to make it easier to terminate the collection of just one segment and have IndexSearcher proceed to the other ones. This can just be a special marker exception (extends RuntimeException) in IndexSearcher. Then any collector can do this.
        Hide
        Adrien Grand added a comment -

        Can't we split this issue up? I think the current discussion is focused much on this sorted segments thing, but thats not the only possible implementation for this kind of thing.

        I created LUCENE-4862.

        Show
        Adrien Grand added a comment - Can't we split this issue up? I think the current discussion is focused much on this sorted segments thing, but thats not the only possible implementation for this kind of thing. I created LUCENE-4862 .
        Hide
        Shai Erera added a comment -

        You're right Rob. And I don't believe we intended to tackle the marking of a segment sorted/not in this issue (I wrote that we should do it separately). It's just what span-off this issue. We can introduce here this runtime exception and discuss on yet another issue how to add metadata at the segment-level.

        Show
        Shai Erera added a comment - You're right Rob. And I don't believe we intended to tackle the marking of a segment sorted/not in this issue (I wrote that we should do it separately). It's just what span-off this issue. We can introduce here this runtime exception and discuss on yet another issue how to add metadata at the segment-level.
        Hide
        Robert Muir added a comment -

        OK... i was just confused by the comments. I think here can add the general capability of "Move on" via this exception, add a simple test or two that just use little MockCollectors, and resolve this issue pretty easily.

        Show
        Robert Muir added a comment - OK... i was just confused by the comments. I think here can add the general capability of "Move on" via this exception, add a simple test or two that just use little MockCollectors, and resolve this issue pretty easily.
        Hide
        Adrien Grand added a comment -

        I am thinking for some time on segment-level metadata. Something like SegmentInfo.attributes().

        I agree that something like SegmentInfo.attributes would be helpful but why not SegmentInfo.attributes themselves? (I'm not trying to push for it, just curious what their use-cases are, they seem to be unused today?)

        Show
        Adrien Grand added a comment - I am thinking for some time on segment-level metadata. Something like SegmentInfo.attributes(). I agree that something like SegmentInfo.attributes would be helpful but why not SegmentInfo.attributes themselves? (I'm not trying to push for it, just curious what their use-cases are, they seem to be unused today?)
        Hide
        Robert Muir added a comment -

        Why is additional metadata necessary?

        Isnt SegmentInfo.getDiagnostics().get("source") enough to tell you if the segment was created via a flush or a merge... maybe a little evil but the data is already there.

        Show
        Robert Muir added a comment - Why is additional metadata necessary? Isnt SegmentInfo.getDiagnostics().get("source") enough to tell you if the segment was created via a flush or a merge... maybe a little evil but the data is already there.
        Hide
        Adrien Grand added a comment -

        Why is additional metadata necessary? Isnt SegmentInfo.getDiagnostics().get("source") enough to tell you if the segment was created via a flush or a merge... maybe a little evil but the data is already there.

        It looks good, I hadn't noticed that we store this information in the diagnostics, thanks!

        Show
        Adrien Grand added a comment - Why is additional metadata necessary? Isnt SegmentInfo.getDiagnostics().get("source") enough to tell you if the segment was created via a flush or a merge... maybe a little evil but the data is already there. It looks good, I hadn't noticed that we store this information in the diagnostics, thanks!
        Hide
        Robert Muir added a comment -

        I'm not trying to push for it, just curious what their use-cases are, they seem to be unused today?)

        They are not unused really, its a way for a codec to store special stuff that can then be read later. One use case today is backwards compat, for example 3.x codec stores lots of crazy stuff here that only it cares about, like shared doc store information

        Show
        Robert Muir added a comment - I'm not trying to push for it, just curious what their use-cases are, they seem to be unused today?) They are not unused really, its a way for a codec to store special stuff that can then be read later. One use case today is backwards compat, for example 3.x codec stores lots of crazy stuff here that only it cares about, like shared doc store information
        Hide
        Adrien Grand added a comment -

        Here is a first patch:

        • New convenient abstract collector class: EarlyTerminationCollector which makes no assumption about the readers it collects (it relies on sub-classes in order to know whether the collected context is sorted and how many docs should be collected at most).
        • New collector: SortingMergePolicyCollector that assumes that segments that result from a merge are sorted (to do so it inspect the diagnostics of the SegmentInfo). I named it this way to make it clear it needs to be used with SortingMergePolicy.
        • I made SegmentReader.getSegmentInfo public (instead of pkg-private) to be able to read the diagnostics. Is it OK to do so/Is there a cleaner way to expose diagnostics to high-level APIs?
        Show
        Adrien Grand added a comment - Here is a first patch: New convenient abstract collector class: EarlyTerminationCollector which makes no assumption about the readers it collects (it relies on sub-classes in order to know whether the collected context is sorted and how many docs should be collected at most). New collector: SortingMergePolicyCollector that assumes that segments that result from a merge are sorted (to do so it inspect the diagnostics of the SegmentInfo). I named it this way to make it clear it needs to be used with SortingMergePolicy. I made SegmentReader.getSegmentInfo public (instead of pkg-private) to be able to read the diagnostics. Is it OK to do so/Is there a cleaner way to expose diagnostics to high-level APIs?
        Hide
        Adrien Grand added a comment -

        Shai Erera, maybe you want to take a look at this patch?

        Show
        Adrien Grand added a comment - Shai Erera , maybe you want to take a look at this patch?
        Hide
        Shai Erera added a comment -

        I will! Slow to respond because I'm on vacation, but I will review it when I'm back (Tue).

        Show
        Shai Erera added a comment - I will! Slow to respond because I'm on vacation, but I will review it when I'm back (Tue).
        Hide
        Adrien Grand added a comment -

        No problem, enjoy your vacation!

        Show
        Adrien Grand added a comment - No problem, enjoy your vacation!
        Hide
        Shai Erera added a comment -

        Patch looks good. I wonder if we can make the 'sorted' decision more specific though. E.g., if OneMerge.info was not assigned directly by IndexWriter, but rather IW called OneMerge.setInfo() and SortingOneMerge would add to the diagnostics another property 'sorted=true', then the collector would be more robust – right now you can use it with unsorted segments and you'd silently get wrong results. Even though it is documented, I feel that with this tiny hook, we can make it work correctly, not tripping users who don't read javadocs. What do you think?

        Show
        Shai Erera added a comment - Patch looks good. I wonder if we can make the 'sorted' decision more specific though. E.g., if OneMerge.info was not assigned directly by IndexWriter, but rather IW called OneMerge.setInfo() and SortingOneMerge would add to the diagnostics another property 'sorted=true', then the collector would be more robust – right now you can use it with unsorted segments and you'd silently get wrong results. Even though it is documented, I feel that with this tiny hook, we can make it work correctly, not tripping users who don't read javadocs. What do you think?
        Hide
        Shai Erera added a comment -

        Also, I don't think that this works with addIndexes right? So if someone follows SortingAtomicReader.addIndexes jdoc example, he cannot use this collector.

        Show
        Shai Erera added a comment - Also, I don't think that this works with addIndexes right? So if someone follows SortingAtomicReader.addIndexes jdoc example, he cannot use this collector.
        Hide
        Shai Erera added a comment -

        Here's a quick patch that adds OneMerge.setInfo which SortingOneMerge overrides to add 'sorted' property. SortingEarlyTerminationCollector modified to read that property instead of SOURCE. 'core' and 'misc' tests pass.

        This still does not address addIndexes. I think it will be good if we can have a SortingEarlyTerminationCollector which works with both modes. I'll try that later.

        Show
        Shai Erera added a comment - Here's a quick patch that adds OneMerge.setInfo which SortingOneMerge overrides to add 'sorted' property. SortingEarlyTerminationCollector modified to read that property instead of SOURCE. 'core' and 'misc' tests pass. This still does not address addIndexes. I think it will be good if we can have a SortingEarlyTerminationCollector which works with both modes. I'll try that later.
        Hide
        Adrien Grand added a comment -

        I like making it more robust by ensuring the segment is sorted and not only the result from a merge! Maybe we could even go further and add an identifier of the Sorter which has been used to sort the segment. This way, early query termination could keep working correctly even if the user decides to change the sort order of the index?

        This still does not address addIndexes. I think it will be good if we can have a SortingEarlyTerminationCollector which works with both modes. I'll try that later.

        I'm still thinking about whether IndexWriter should sort the provided readers in addIndexes. On the one hand, I understand that given that the user can wrap the readers with a SortingAtomicReader himself, there is little added value in sorting the readers in addIndexes. But on the other hand, this method "merges the provided indexes into this index" (quoting the javadocs) so not sorting the readers while a SortingMergePolicy is used feels like the MergePolicy is being bypassed. So net/net I think I prefer making addIndexes sort the readers and have a dedicated method in MergePolicy to handle addIndexes? (And this would make it easy to add additional diagnostic to the segments resulting from a call to addIndexes.)

        Show
        Adrien Grand added a comment - I like making it more robust by ensuring the segment is sorted and not only the result from a merge! Maybe we could even go further and add an identifier of the Sorter which has been used to sort the segment. This way, early query termination could keep working correctly even if the user decides to change the sort order of the index? This still does not address addIndexes. I think it will be good if we can have a SortingEarlyTerminationCollector which works with both modes. I'll try that later. I'm still thinking about whether IndexWriter should sort the provided readers in addIndexes. On the one hand, I understand that given that the user can wrap the readers with a SortingAtomicReader himself, there is little added value in sorting the readers in addIndexes. But on the other hand, this method "merges the provided indexes into this index" (quoting the javadocs) so not sorting the readers while a SortingMergePolicy is used feels like the MergePolicy is being bypassed. So net/net I think I prefer making addIndexes sort the readers and have a dedicated method in MergePolicy to handle addIndexes? (And this would make it easy to add additional diagnostic to the segments resulting from a call to addIndexes.)
        Hide
        Shai Erera added a comment -

        Maybe we could even go further and add an identifier of the Sorter which has been used to sort the segment

        +1. This makes sense. We need to be as robust as possible. If a user makes a mistake, it's best if he can avoid tripping himself. It needs to be something unique, i.e. not just the sorter class, but e.g. for NumericDV also the field. Perhaps Sorter should have a sortKey? Then we record Sorter.class_Sorter.sortKey?

        I agree that addIndexes should use MergePolicy. Unlike the Directory version, which shallow-copies the segments, including whatever Diagnostics information they contain, the IR version uses SegmentMerger, however bypasses MP. So e.g. if the app uses TieredMP, limiting the merged segment size to 10 GB, you can addIndexes a 20-segment index, totalling 100 GB, and end up in a single 100 GB segment. That's ... uexpected.

        So I think we need something on MP, maybe findMergesForAddIndexes... and then it will be easier to control how these indexes are added. If that's the direction, perhaps we do this in a different issue, as it's unrelated to sorting?

        And, while diagnostics allow us to record sorted + sorter, we're still limited to SegmentReader. In practice this may not be a true limitation, but I feel that if AtomicReader exposed metadata(), like commitData() for the composite, it will give us more freedom. This collector does not need to be limited to SegmentReader only ... but I guess it's ok for now, at least, I know others don't like the idea of having metadata() on AR.

        Show
        Shai Erera added a comment - Maybe we could even go further and add an identifier of the Sorter which has been used to sort the segment +1. This makes sense. We need to be as robust as possible. If a user makes a mistake, it's best if he can avoid tripping himself. It needs to be something unique, i.e. not just the sorter class, but e.g. for NumericDV also the field. Perhaps Sorter should have a sortKey? Then we record Sorter.class_Sorter.sortKey? I agree that addIndexes should use MergePolicy. Unlike the Directory version, which shallow-copies the segments, including whatever Diagnostics information they contain, the IR version uses SegmentMerger, however bypasses MP. So e.g. if the app uses TieredMP, limiting the merged segment size to 10 GB, you can addIndexes a 20-segment index, totalling 100 GB, and end up in a single 100 GB segment. That's ... uexpected. So I think we need something on MP, maybe findMergesForAddIndexes... and then it will be easier to control how these indexes are added. If that's the direction, perhaps we do this in a different issue, as it's unrelated to sorting? And, while diagnostics allow us to record sorted + sorter, we're still limited to SegmentReader. In practice this may not be a true limitation, but I feel that if AtomicReader exposed metadata(), like commitData() for the composite, it will give us more freedom. This collector does not need to be limited to SegmentReader only ... but I guess it's ok for now, at least, I know others don't like the idea of having metadata() on AR.
        Hide
        Robert Muir added a comment -

        +1. This makes sense. We need to be as robust as possible. If a user makes a mistake, it's best if he can avoid tripping himself.

        Then step number 1 is not to futz around with adding more and more metadata to the index and changing the behavior of addindexes and so on, but instead to enforce the collector is only used when sorting by index order!

        The current patch goes to great extremes to add more and more and more complexity at the index-level but misses the forrest for the trees.

        Show
        Robert Muir added a comment - +1. This makes sense. We need to be as robust as possible. If a user makes a mistake, it's best if he can avoid tripping himself. Then step number 1 is not to futz around with adding more and more metadata to the index and changing the behavior of addindexes and so on, but instead to enforce the collector is only used when sorting by index order! The current patch goes to great extremes to add more and more and more complexity at the index-level but misses the forrest for the trees.
        Hide
        Robert Muir added a comment -

        Just to be clear, I'm not against allowing someone to tradeoff relevance for speed (even where the requested sort is e.g. by score), but I just think that its way more important to make these tradeoffs clear in the APIs than to worry about expert things like addIndexes.

        By having good APIs that are clear about this (e.g. a "safe" way and also an "unsafe" way) with good javadocs, then its more likely users will be happy and not run into traps.

        The stuff like addIndexes is still good, I just dont think its as important in the big picture: its so esoteric that it need not even be addressed in the initial commit. I know good APIs and javadocs arent as sexy as adding more stuff to the index, but its much more important here IMO.

        Show
        Robert Muir added a comment - Just to be clear, I'm not against allowing someone to tradeoff relevance for speed (even where the requested sort is e.g. by score), but I just think that its way more important to make these tradeoffs clear in the APIs than to worry about expert things like addIndexes. By having good APIs that are clear about this (e.g. a "safe" way and also an "unsafe" way) with good javadocs, then its more likely users will be happy and not run into traps. The stuff like addIndexes is still good, I just dont think its as important in the big picture: its so esoteric that it need not even be addressed in the initial commit. I know good APIs and javadocs arent as sexy as adding more stuff to the index, but its much more important here IMO.
        Hide
        Shai Erera added a comment -

        I don't disagree, I just think that completely orthogonal to this issue, addIndexes should respect IW's MP. I've been working with an app that relied on having segments no bigger than X GB (controlled by MP), and having addIndexes completely ignore these settings is a bug, at least to me. Who said that addIndexes must end up w/ a single segment? Anyway, this is something we should discuss on a separate issue, since it's unrelated to sorting. I'm going to reproduce this separately, and then open a dedicated issue. This one can wait or proceed unrelated.

        The current patch goes to great extremes to add more and more and more complexity at the index-level but misses the forrest for the trees.

        I don't think so? It only attempts to protect the user from silently tripping on his bugs. To the user, nothing really changes (except that his Sorter impls will need to define a sortKey or whatever .. big deal). The rest is completely opaque. I'm sure you won't object to protecting users from themselves, at least not when all we need is to make this tiny little change to OneMerge.

        Show
        Shai Erera added a comment - I don't disagree, I just think that completely orthogonal to this issue, addIndexes should respect IW's MP. I've been working with an app that relied on having segments no bigger than X GB (controlled by MP), and having addIndexes completely ignore these settings is a bug, at least to me. Who said that addIndexes must end up w/ a single segment? Anyway, this is something we should discuss on a separate issue, since it's unrelated to sorting. I'm going to reproduce this separately, and then open a dedicated issue. This one can wait or proceed unrelated. The current patch goes to great extremes to add more and more and more complexity at the index-level but misses the forrest for the trees. I don't think so? It only attempts to protect the user from silently tripping on his bugs. To the user, nothing really changes (except that his Sorter impls will need to define a sortKey or whatever .. big deal). The rest is completely opaque. I'm sure you won't object to protecting users from themselves, at least not when all we need is to make this tiny little change to OneMerge.
        Hide
        Shai Erera added a comment -

        Another way to fix addIndexes is to never merger, but rather behave like addIndexes(Directory) – iterate on leaves and call SegmentMerger.merge() on each one of them. App can call maybeMerge afterwards. addIndexes(IndexReader) is intended, mostly I think, for filtering readers, otherwise the Directory version is much faster. Fixing addIndexes like that makes it consistent with the Directory version, and still accomplishing its goal. Note, it does not address the 'sorted' issue, but as I wrote a couple times already, the two are unrelated.

        Show
        Shai Erera added a comment - Another way to fix addIndexes is to never merger, but rather behave like addIndexes(Directory) – iterate on leaves and call SegmentMerger.merge() on each one of them. App can call maybeMerge afterwards. addIndexes(IndexReader) is intended, mostly I think, for filtering readers, otherwise the Directory version is much faster. Fixing addIndexes like that makes it consistent with the Directory version, and still accomplishing its goal. Note, it does not address the 'sorted' issue, but as I wrote a couple times already, the two are unrelated.
        Hide
        Shai Erera added a comment -

        Adrien, let's keep this issue focused on the Collector working in conjunction with SortingMP. I think the added info to diagnostics is important, to prevent users tripping themselves. Let's also add the Sorter ID/Key, as another way to make the collector robust.

        We can deal w/ addIndexes later. If it will be modified to use MP, we'll get that support for free; if not, we'll think about what we need to do.

        Show
        Shai Erera added a comment - Adrien, let's keep this issue focused on the Collector working in conjunction with SortingMP. I think the added info to diagnostics is important, to prevent users tripping themselves. Let's also add the Sorter ID/Key, as another way to make the collector robust. We can deal w/ addIndexes later. If it will be modified to use MP, we'll get that support for free; if not, we'll think about what we need to do.
        Hide
        Adrien Grand added a comment -

        We can deal w/ addIndexes later. If it will be modified to use MP, we'll get that support for free; if not, we'll think about what we need to do.

        Sounds good to me.

        Show
        Adrien Grand added a comment - We can deal w/ addIndexes later. If it will be modified to use MP, we'll get that support for free; if not, we'll think about what we need to do. Sounds good to me.
        Hide
        Shai Erera added a comment -

        Patch adds Sorter.getID() and the collector to check whether the given Sorter matches the one that's recorded in the segment. I added a test to TestEarlyTermination which fails if the added check to sorted() is removed.

        Show
        Shai Erera added a comment - Patch adds Sorter.getID() and the collector to check whether the given Sorter matches the one that's recorded in the segment. I added a test to TestEarlyTermination which fails if the added check to sorted() is removed.
        Hide
        Adrien Grand added a comment -

        Thanks Shai, storing the Sorter is a good step forward to avoid traps. Now that SORTER_ID_PROP is added to the diagnostics, maybe that SORTED_DIAGNOSTICS_PROP is redundant?

        I would like that we make the collector more robust too. Nothing prevent you from wrapping eg. a TotalHitCountCollector inside it. I've been thinking about adding an optional operation to Sorter that returns a oal.search.Sort that describes the index order so that we could hide all the early-termination complexity behind a custom IndexSearcher that could never return wrong results. This would make the API accessible to casual users while super-expert users could still write custom collectors if they need approximate results. But maybe we don't even need to provide an expert API given how straightforward it is to detect if the segment is sorted and to terminate the collection on a segment. What do you think?

        I'll work on improving documentation as well.

        Show
        Adrien Grand added a comment - Thanks Shai, storing the Sorter is a good step forward to avoid traps. Now that SORTER_ID_PROP is added to the diagnostics, maybe that SORTED_DIAGNOSTICS_PROP is redundant? I would like that we make the collector more robust too. Nothing prevent you from wrapping eg. a TotalHitCountCollector inside it. I've been thinking about adding an optional operation to Sorter that returns a oal.search.Sort that describes the index order so that we could hide all the early-termination complexity behind a custom IndexSearcher that could never return wrong results. This would make the API accessible to casual users while super-expert users could still write custom collectors if they need approximate results. But maybe we don't even need to provide an expert API given how straightforward it is to detect if the segment is sorted and to terminate the collection on a segment. What do you think? I'll work on improving documentation as well.
        Hide
        Shai Erera added a comment -

        Now that SORTER_ID_PROP is added to the diagnostics, maybe that SORTED_DIAGNOSTICS_PROP is redundant?

        Good point.

        Nothing prevent you from wrapping eg. a TotalHitCountCollector inside it

        Well, I think that's the point where we're trying to protect the user too much. I suggest that we commit what we have, maybe clarify the javadocs to explain better what early termination means, and fix/make it more robust when users actually trip on something.

        I'll work on improving documentation as well.

        Thanks. Would you mind to get rid of SORTED prop in that patch too?

        Show
        Shai Erera added a comment - Now that SORTER_ID_PROP is added to the diagnostics, maybe that SORTED_DIAGNOSTICS_PROP is redundant? Good point. Nothing prevent you from wrapping eg. a TotalHitCountCollector inside it Well, I think that's the point where we're trying to protect the user too much. I suggest that we commit what we have, maybe clarify the javadocs to explain better what early termination means, and fix/make it more robust when users actually trip on something. I'll work on improving documentation as well. Thanks. Would you mind to get rid of SORTED prop in that patch too?
        Hide
        Shai Erera added a comment -

        Adrien and I chatted about it and we think that what's best to do as part of this issue is:

        • Make a concrete EarlyTerminatingSortedCollector (instead of the ones in the patch), which takes a Sorter and numDocs. For every segment, it checks whether the reader is sorted and collects up to numDocs from each sorted segment, or all docs from unsorted segments.
          • That's basically a combination of what EarlyTerminationCollector and SortingMPCollector do in the current patch.
        • Add isSorted(AtomicReader, Sorter) to SortingMP, as a utility method for other early terminating collectors (Adrien has an idea about a few, but not as part of this issue).
        • Clarify documentation of this collector (e.g. what early termination really means).

        I'll work on patch in the next few days, maybe during the weekend.

        Show
        Shai Erera added a comment - Adrien and I chatted about it and we think that what's best to do as part of this issue is: Make a concrete EarlyTerminatingSortedCollector (instead of the ones in the patch), which takes a Sorter and numDocs. For every segment, it checks whether the reader is sorted and collects up to numDocs from each sorted segment, or all docs from unsorted segments. That's basically a combination of what EarlyTerminationCollector and SortingMPCollector do in the current patch. Add isSorted(AtomicReader, Sorter) to SortingMP, as a utility method for other early terminating collectors (Adrien has an idea about a few, but not as part of this issue). Clarify documentation of this collector (e.g. what early termination really means). I'll work on patch in the next few days, maybe during the weekend.
        Hide
        Shai Erera added a comment -

        Patch implements the above mentioned changes. EarlyTerminationSortingCollector is a simple implementation and complements SortingAR and SortingMP.

        Show
        Shai Erera added a comment - Patch implements the above mentioned changes. EarlyTerminationSortingCollector is a simple implementation and complements SortingAR and SortingMP.
        Hide
        Adrien Grand added a comment -

        Thanks Shai, this looks good! I modified a bit your patch to fix the collector constructor visiblity (from protected to public) and added some documentation. I'd like to discuss whether we should actually add the name of the Sorter class in the "sorter" property of the diagnostics. I would rather remove it so that renaming a Sorter class doesn't break compatibility, what do you think?

        Show
        Adrien Grand added a comment - Thanks Shai, this looks good! I modified a bit your patch to fix the collector constructor visiblity (from protected to public) and added some documentation. I'd like to discuss whether we should actually add the name of the Sorter class in the "sorter" property of the diagnostics. I would rather remove it so that renaming a Sorter class doesn't break compatibility, what do you think?
        Hide
        Shai Erera added a comment -

        collector constructor visiblity

        Good catch, thanks!

        I'd like to discuss whether we should actually add the name of the Sorter class in the "sorter" property of the diagnostics

        +1, let's remove it and rely solely on Sorter.getID(). Besides class refactoring, it's kinda odd that we don't trust the Sorter ID, and add the class name. Rather, Numeric should return ID "numericdv_" + fieldName. And REVERSE reverse_docs. And we should also clarify this in getID() javadocs. What do you think?

        Show
        Shai Erera added a comment - collector constructor visiblity Good catch, thanks! I'd like to discuss whether we should actually add the name of the Sorter class in the "sorter" property of the diagnostics +1, let's remove it and rely solely on Sorter.getID(). Besides class refactoring, it's kinda odd that we don't trust the Sorter ID, and add the class name. Rather, Numeric should return ID "numericdv_" + fieldName . And REVERSE reverse_docs . And we should also clarify this in getID() javadocs. What do you think?
        Hide
        Shai Erera added a comment -

        Patch removes sorter class from the recorded property and fixes existing sorters getID to return a unique ID.

        Adrien, do we have anything else to do here, or are we ready to go? If so, I'll add a CHANGES entry and commit later.

        Show
        Shai Erera added a comment - Patch removes sorter class from the recorded property and fixes existing sorters getID to return a unique ID. Adrien, do we have anything else to do here, or are we ready to go? If so, I'll add a CHANGES entry and commit later.
        Hide
        Adrien Grand added a comment -

        Thanks for updating the patch, Shai.

        Adrien, do we have anything else to do here, or are we ready to go? If so, I'll add a CHANGES entry and commit later.

        The patch looks good to me. Maybe NumericDocValuesSorter.getID() could just return 'fieldName'? I think it's not necessary to describe the doc values type since they are exclusive and doc values are the natural way to sort documents by field values in Lucene? Otherwise +1.

        Show
        Adrien Grand added a comment - Thanks for updating the patch, Shai. Adrien, do we have anything else to do here, or are we ready to go? If so, I'll add a CHANGES entry and commit later. The patch looks good to me. Maybe NumericDocValuesSorter.getID() could just return 'fieldName'? I think it's not necessary to describe the doc values type since they are exclusive and doc values are the natural way to sort documents by field values in Lucene? Otherwise +1.
        Hide
        Shai Erera added a comment -

        Maybe NumericDocValuesSorter.getID() could just return 'fieldName'?

        The reason I did that is in case someone will want to sort by a stored field and numeric field which have same names. I know it's probably very low chance, but "numericdv_field" is really unique, as you cannot have two numeric DV fields with the same name, but different meaning.

        Show
        Shai Erera added a comment - Maybe NumericDocValuesSorter.getID() could just return 'fieldName'? The reason I did that is in case someone will want to sort by a stored field and numeric field which have same names. I know it's probably very low chance, but "numericdv_field" is really unique, as you cannot have two numeric DV fields with the same name, but different meaning.
        Hide
        Adrien Grand added a comment -

        The reason I did that is in case someone will want to sort by a stored field and numeric field which have same names.

        A Sorter which sorts by stored field values would indeed need to add more information to its ID (at least to say that it is a stored field).

        "numericdv_field" is really unique, as you cannot have two numeric DV fields with the same name, but different meaning.

        Since doc values types are exclusive, could we then just say that these are doc values without mentioning the type? I think this would help keep up with doc values types evolutions (for example there used to be BYTES_FIXED_SORTED and BYTES_VAR_SORTED which have been merged into SORTED) and/or additions (SORTED_SET). I would also prefer having something even more human-readable (like "DocValues(fieldName=$fieldName,order=asc|desc)"?).

        Show
        Adrien Grand added a comment - The reason I did that is in case someone will want to sort by a stored field and numeric field which have same names. A Sorter which sorts by stored field values would indeed need to add more information to its ID (at least to say that it is a stored field). "numericdv_field" is really unique, as you cannot have two numeric DV fields with the same name, but different meaning. Since doc values types are exclusive, could we then just say that these are doc values without mentioning the type? I think this would help keep up with doc values types evolutions (for example there used to be BYTES_FIXED_SORTED and BYTES_VAR_SORTED which have been merged into SORTED) and/or additions (SORTED_SET). I would also prefer having something even more human-readable (like "DocValues(fieldName=$fieldName,order=asc|desc)"?).
        Hide
        Shai Erera added a comment -

        Since doc values types are exclusive, could we then just say that these are doc values without mentioning the type?

        +1. I mistakenly thought you can add both a numeric and binary DocValues to a document, under the same name. I prefer slightly less verbosity, but just because I think the fieldName and order part are redundant. So "DocValues($field,asc|desc)"?

        Show
        Shai Erera added a comment - Since doc values types are exclusive, could we then just say that these are doc values without mentioning the type? +1. I mistakenly thought you can add both a numeric and binary DocValues to a document, under the same name. I prefer slightly less verbosity, but just because I think the fieldName and order part are redundant. So "DocValues($field,asc|desc)"?
        Hide
        Adrien Grand added a comment -

        Sounds good to me!

        Show
        Adrien Grand added a comment - Sounds good to me!
        Hide
        Shai Erera added a comment -

        Patch adds CHANGES and improves getID impls. I think it's ready. I'll run some tests and if everything's ok, commit.

        Show
        Shai Erera added a comment - Patch adds CHANGES and improves getID impls. I think it's ready. I'll run some tests and if everything's ok, commit.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x. Thanks Adrien for the fun collaboration!

        Show
        Shai Erera added a comment - Committed to trunk and 4x. Thanks Adrien for the fun collaboration!
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development