Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I realized this glancing at LUCENE-7091.

      I think this should have points support or else people cannot move off of the deprecated LegacyXXX encodings?

      1. LUCENE-7093.patch
        25 kB
        Martijn van Groningen
      2. LUCENE-7093.patch
        23 kB
        Martijn van Groningen
      3. LUCENE-7093.patch
        21 kB
        Martijn van Groningen

        Issue Links

          Activity

          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Attached a patch that adds point values support for the memory index.
          This patch is based on LUCENE-7091, which will need to get merged first before this one.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - Attached a patch that adds point values support for the memory index. This patch is based on LUCENE-7091 , which will need to get merged first before this one.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Synched patch with LUCENE-7091 and removed the an added constructor to BytesRefArray (after realising that BytesRefArray can only work with its own privately constructed byte pool) .

          Show
          martijn.v.groningen Martijn van Groningen added a comment - Synched patch with LUCENE-7091 and removed the an added constructor to BytesRefArray (after realising that BytesRefArray can only work with its own privately constructed byte pool) .
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Martijn van Groningen, this is nice!

          You don't have to box the point dimensionCount/numBytes up from int to Integer: 0 can safely mean "this field has no points".

          In addField you return right away if the field had point values, but this is dangerous because the field could also have e.g. doc values: (LatLonPoint just recently started doing this), maybe we can add a test case for that?

          Instead of making BytesRefArray.sort fully public, can we make a version that always sorts by natural order public? This class's entire existence scares me so I like to minimize what methods we make public, even for internal usage.

          Or maybe MemoryIndex should just use BytesRef[], instead of BytesRefArray, and sort that? The vast majority of the time we are looking at a single point value for the fields here? Your visit function could be simplified too: just call the compare only once, and if it crosses, visit all points (with doc and value); if the cell is inside the query, visit with just doc; else, do nothing?

          Show
          mikemccand Michael McCandless added a comment - Thanks Martijn van Groningen , this is nice! You don't have to box the point dimensionCount/numBytes up from int to Integer : 0 can safely mean "this field has no points". In addField you return right away if the field had point values, but this is dangerous because the field could also have e.g. doc values: ( LatLonPoint just recently started doing this), maybe we can add a test case for that? Instead of making BytesRefArray.sort fully public, can we make a version that always sorts by natural order public? This class's entire existence scares me so I like to minimize what methods we make public, even for internal usage. Or maybe MemoryIndex should just use BytesRef[] , instead of BytesRefArray , and sort that? The vast majority of the time we are looking at a single point value for the fields here? Your visit function could be simplified too: just call the compare only once, and if it crosses, visit all points (with doc and value); if the cell is inside the query, visit with just doc; else, do nothing?
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          In addField you return right away if the field had point values, but this is dangerous because the field could also have e.g. doc values: (LatLonPoint just recently started doing this), maybe we can add a test case for that?

          Good point. I'll add a test for this.

          Or maybe MemoryIndex should just use BytesRef[], instead of BytesRefArray, and sort that?

          +1, lets simplify here. MemoryIndex only exists for a short amount of time and then it is tossed away, so we shouldn't spend to0 much effort in space efficient data structures here.

          Your visit function could be simplified too: just call the compare only once, and if it crosses, visit all points (with doc and value);

          If we there many values (lets say a couple of hundred values) for a single field, wouldn't the current visit method be better. (because less comparisons will need to happen). I have seen cases where fields had relatively a large number of terms.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - In addField you return right away if the field had point values, but this is dangerous because the field could also have e.g. doc values: (LatLonPoint just recently started doing this), maybe we can add a test case for that? Good point. I'll add a test for this. Or maybe MemoryIndex should just use BytesRef[], instead of BytesRefArray, and sort that? +1, lets simplify here. MemoryIndex only exists for a short amount of time and then it is tossed away, so we shouldn't spend to0 much effort in space efficient data structures here. Your visit function could be simplified too: just call the compare only once, and if it crosses, visit all points (with doc and value); If we there many values (lets say a couple of hundred values) for a single field, wouldn't the current visit method be better. (because less comparisons will need to happen). I have seen cases where fields had relatively a large number of terms.
          Hide
          mikemccand Michael McCandless added a comment -

          If we there many values (lets say a couple of hundred values) for a single field, wouldn't the current visit method be better

          It's hard to say ... it invokes compare more than the normal BKD tree would, and the compare can be more costly than filtering a single value?

          Also, I think the visit method only works in the 1D case ... for e.g. the 2D case, I think it may be buggy because the byte[] values were sorted only by the first dimension?

          I feel like it's best to get a simple, correct, implementation in at first, and then worry about optimizing for the "many points in a single document" case later?

          Show
          mikemccand Michael McCandless added a comment - If we there many values (lets say a couple of hundred values) for a single field, wouldn't the current visit method be better It's hard to say ... it invokes compare more than the normal BKD tree would, and the compare can be more costly than filtering a single value? Also, I think the visit method only works in the 1D case ... for e.g. the 2D case, I think it may be buggy because the byte[] values were sorted only by the first dimension? I feel like it's best to get a simple, correct, implementation in at first, and then worry about optimizing for the "many points in a single document" case later?
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Also, I think the visit method only works in the 1D case ... for e.g. the 2D case, I think it may be buggy because the byte[] values were sorted only by the first dimension?

          I didn't realize that this was buggy in 2d case. I assumed that I was sorting the values correctly for any dimension, because LongPoint and friends pack multiple dimensions into a single BytesRef. Just curious, how would the sort then work?

          I feel like it's best to get a simple, correct, implementation in at first, and then worry about optimizing for the "many points in a single document" case later?

          +1. I'll simplify the visitor method.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - Also, I think the visit method only works in the 1D case ... for e.g. the 2D case, I think it may be buggy because the byte[] values were sorted only by the first dimension? I didn't realize that this was buggy in 2d case. I assumed that I was sorting the values correctly for any dimension, because LongPoint and friends pack multiple dimensions into a single BytesRef. Just curious, how would the sort then work? I feel like it's best to get a simple, correct, implementation in at first, and then worry about optimizing for the "many points in a single document" case later? +1. I'll simplify the visitor method.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Added a new patch:

          • Removed the unnecessary boxing of the dimensionCount and numBytes.
          • Undo the changes made to BytesRefArray and simply use ByteRef[]
          • The MemoryIndexPointValues#intersect(...) was broken for anything with more than 1 dimension. This was caused by how the terms were sorted, anything beyond the first dimension was ignored, because the values were sorted by natural order. So removed the sorting and instead the memory index now only picks the lowest and highest packed value. The intersect() method now simple visits all points. This makes >2d cases work and is good enough for now.
          • If there are points in a field, don't stop indexing, so that if doc values is enabled, these get stored too.
          Show
          martijn.v.groningen Martijn van Groningen added a comment - Added a new patch: Removed the unnecessary boxing of the dimensionCount and numBytes. Undo the changes made to BytesRefArray and simply use ByteRef[] The MemoryIndexPointValues#intersect(...) was broken for anything with more than 1 dimension. This was caused by how the terms were sorted, anything beyond the first dimension was ignored, because the values were sorted by natural order. So removed the sorting and instead the memory index now only picks the lowest and highest packed value. The intersect() method now simple visits all points. This makes >2d cases work and is good enough for now. If there are points in a field, don't stop indexing, so that if doc values is enabled, these get stored too.
          Hide
          mikemccand Michael McCandless added a comment -
          Show
          mikemccand Michael McCandless added a comment - +1, thanks Martijn van Groningen !
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit fe58313de027b454395af2d8d13d18393fa7e2e0 in lucene-solr's branch refs/heads/branch_6_0 from Martijn van Groningen
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fe58313 ]

          LUCENE-7093: Added point values support to the memory index

          Show
          jira-bot ASF subversion and git services added a comment - Commit fe58313de027b454395af2d8d13d18393fa7e2e0 in lucene-solr's branch refs/heads/branch_6_0 from Martijn van Groningen [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fe58313 ] LUCENE-7093 : Added point values support to the memory index
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d4fe0c3849d9e1140e5d6cd5d7b4d0e0495d9ea5 in lucene-solr's branch refs/heads/branch_6x from Martijn van Groningen
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d4fe0c3 ]

          LUCENE-7093: Added point values support to the memory index

          Show
          jira-bot ASF subversion and git services added a comment - Commit d4fe0c3849d9e1140e5d6cd5d7b4d0e0495d9ea5 in lucene-solr's branch refs/heads/branch_6x from Martijn van Groningen [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d4fe0c3 ] LUCENE-7093 : Added point values support to the memory index
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c1dfeb8ef85be924f17f8aece46d008382d538e9 in lucene-solr's branch refs/heads/master from Martijn van Groningen
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c1dfeb8 ]

          LUCENE-7093: Added point values support to the memory index

          Show
          jira-bot ASF subversion and git services added a comment - Commit c1dfeb8ef85be924f17f8aece46d008382d538e9 in lucene-solr's branch refs/heads/master from Martijn van Groningen [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c1dfeb8 ] LUCENE-7093 : Added point values support to the memory index
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          pushed to branch_6_0, branch_6x and master

          Show
          martijn.v.groningen Martijn van Groningen added a comment - pushed to branch_6_0, branch_6x and master
          Hide
          dsmiley David Smiley added a comment -

          Mike, just curious, what scares you about BytesRefArray? (I make no objection to any of your feedback above)

          Show
          dsmiley David Smiley added a comment - Mike, just curious, what scares you about BytesRefArray? (I make no objection to any of your feedback above)
          Hide
          mikemccand Michael McCandless added a comment -

          Assignee: David Smiley (was: Martijn van Groningen)

          Hmm David Smiley, why do you assign Martijn van Groningen's recent MemoryIndex issues to yourself? This confuses me, e.g. this one is already pushed.

          I think you did the same for his recent issue adding doc values.

          Mike, just curious, what scares you about BytesRefArray? (I make no objection to any of your feedback above)

          It's a very very expert internal class, it has scary package-private methods, we should not be making it more public.

          It also duplicates various other "append byte[] to big pool of byte[] blocks" that Lucene has and we should at some point consolidate, yet they all have small but important differences in their behavior.

          Show
          mikemccand Michael McCandless added a comment - Assignee: David Smiley (was: Martijn van Groningen) Hmm David Smiley , why do you assign Martijn van Groningen 's recent MemoryIndex issues to yourself? This confuses me, e.g. this one is already pushed. I think you did the same for his recent issue adding doc values. Mike, just curious, what scares you about BytesRefArray? (I make no objection to any of your feedback above) It's a very very expert internal class, it has scary package-private methods, we should not be making it more public. It also duplicates various other "append byte[] to big pool of byte[] blocks" that Lucene has and we should at some point consolidate, yet they all have small but important differences in their behavior.
          Hide
          dsmiley David Smiley added a comment -

          Assignee: David Smiley (was: Martijn van Groningen)

          That is so weird; I did not intend to do that! I'm fixing it now. Perhaps I hit some keys accidentally that changed it; I don't know. Very disconcerting!

          Show
          dsmiley David Smiley added a comment - Assignee: David Smiley (was: Martijn van Groningen) That is so weird; I did not intend to do that! I'm fixing it now. Perhaps I hit some keys accidentally that changed it; I don't know. Very disconcerting!

            People

            • Assignee:
              martijn.v.groningen Martijn van Groningen
              Reporter:
              rcmuir Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development