Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      This is logically a subtask of CASSANDRA-2710, but Jira doesn't allow sub-sub-tasks.

      We need to support multiple ranges in a SliceQueryFilter, and we want querying them to be efficient, i.e., one pass through the row to get all of the ranges, rather than one pass per range.

      Supercolumns are irrelevant since the goal is to replace them anyway. Ignore supercolumn-related code or rip it out, whichever is easier.

      This is ONLY dealing with the storage engine part, not the StorageProxy and Command intra-node messages or the Thrift or CQL client APIs. Thus, a unit test should be added to ColumnFamilyStoreTest to demonstrate that it works.

      1. 3885-v2.txt
        98 kB
        Sylvain Lebresne
      2. CASSANDRA-3885.patch
        56 kB
        David Alves
      3. CASSANDRA-3885.patch
        56 kB
        David Alves
      4. CASSANDRA-3885.patch
        57 kB
        David Alves
      5. CASSANDRA-3885.patch
        58 kB
        David Alves
      6. CASSANDRA-3885.patch
        53 kB
        David Alves

        Activity

        Hide
        Sylvain Lebresne added a comment -

        Alright, committed. Thanks all.

        Show
        Sylvain Lebresne added a comment - Alright, committed. Thanks all.
        Hide
        David Alves added a comment -

        +1

        I reviewed the patch, looks good overall.
        I does not add failures other than the previously mentioned SerializationTest (CompactionsTest fails on a different method, for a different reason on my pc, that's expected right?).

        Show
        David Alves added a comment - +1 I reviewed the patch, looks good overall. I does not add failures other than the previously mentioned SerializationTest (CompactionsTest fails on a different method, for a different reason on my pc, that's expected right?).
        Hide
        David Alves added a comment -

        right, I can confirm that, sorry for the n00biness.

        patch applies cleanly, running tests.

        Show
        David Alves added a comment - right, I can confirm that, sorry for the n00biness. patch applies cleanly, running tests.
        Hide
        Sylvain Lebresne added a comment -

        I always thought we have /cassandra/test/data/serialization/x.x if you want to test the older versions.

        Ok, I see. So there is different ant targets to run the serialization test against older version. I've personally never ran that. Is there any reason why we don't just always run the serialization test on all versions?

        But anyway, for this patch I'll regenerate the message binaries for 1.2 before committing.

        Show
        Sylvain Lebresne added a comment - I always thought we have /cassandra/test/data/serialization/x.x if you want to test the older versions. Ok, I see. So there is different ant targets to run the serialization test against older version. I've personally never ran that. Is there any reason why we don't just always run the serialization test on all versions? But anyway, for this patch I'll regenerate the message binaries for 1.2 before committing.
        Hide
        Sylvain Lebresne added a comment -

        Be sure to check against http://git-wip-us.apache.org/repos/asf/cassandra.git, not any other repository. In particular the github mirror very often get behind.

        Show
        Sylvain Lebresne added a comment - Be sure to check against http://git-wip-us.apache.org/repos/asf/cassandra.git , not any other repository. In particular the github mirror very often get behind.
        Hide
        Sylvain Lebresne added a comment -

        as far as I can see from the 3885 branch there are other changes in there beyond this patch that are more recent than the last change to trunk.

        Then you have the wrong version of trunk. The attached 3885-v2.txt patch applies cleanly to current trunk (as of this comment).

        Show
        Sylvain Lebresne added a comment - as far as I can see from the 3885 branch there are other changes in there beyond this patch that are more recent than the last change to trunk. Then you have the wrong version of trunk. The attached 3885-v2.txt patch applies cleanly to current trunk (as of this comment).
        Hide
        David Alves added a comment -

        I still can't apply this cleanly to trunk. as far as I can see from the 3885 branch there are other changes in there beyond this patch that are more recent than the last change to trunk.

        Sylvain do you want me to try and pick that up? (i mean take what you did and make it applicable to trunk)

        Show
        David Alves added a comment - I still can't apply this cleanly to trunk. as far as I can see from the 3885 branch there are other changes in there beyond this patch that are more recent than the last change to trunk. Sylvain do you want me to try and pick that up? (i mean take what you did and make it applicable to trunk)
        Hide
        Vijay added a comment -

        So I could regenerate the binary messages, but I'm confused on what SerializationsTest is actually testing.

        I always thought we have /cassandra/test/data/serialization/x.x if you want to test the older versions.

        LGTM +1

        nit: setStart() method should have a assert which will check if there is only one Slice.
        kind of affects:

        // As soon as we'd done our first call, we want to reset the start column if we're paging
                if (isPaging)
                    ((SliceQueryFilter)initialFilter()).setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER);
        

        PS: I was actually doing a parallel work without the prefetch queue and thought of sharing. (https://github.com/Vijay2win/cassandra/commit/31ca6fd9e1bafc1f4d8dfe858929586637bffdef#L3L18)

        Show
        Vijay added a comment - So I could regenerate the binary messages, but I'm confused on what SerializationsTest is actually testing. I always thought we have /cassandra/test/data/serialization/x.x if you want to test the older versions. LGTM +1 nit: setStart() method should have a assert which will check if there is only one Slice. kind of affects: // As soon as we'd done our first call, we want to reset the start column if we're paging if (isPaging) ((SliceQueryFilter)initialFilter()).setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER); PS: I was actually doing a parallel work without the prefetch queue and thought of sharing. ( https://github.com/Vijay2win/cassandra/commit/31ca6fd9e1bafc1f4d8dfe858929586637bffdef#L3L18 )
        Hide
        Sylvain Lebresne added a comment -

        Patch needed rebase. Rebased version attached. I've also pushed it at https://github.com/pcmanus/cassandra/tree/3885.

        Show
        Sylvain Lebresne added a comment - Patch needed rebase. Rebased version attached. I've also pushed it at https://github.com/pcmanus/cassandra/tree/3885 .
        Hide
        David Alves added a comment -

        thanks for reviewing and extending the patch to the rest of the system Sylvain.

        PS: wrt to code style I've made changes to the imports to meet code style, but I was already using tjake's cassandra eclipse profile...

        Show
        David Alves added a comment - thanks for reviewing and extending the patch to the rest of the system Sylvain. PS: wrt to code style I've made changes to the imports to meet code style, but I was already using tjake's cassandra eclipse profile...
        Hide
        Sylvain Lebresne added a comment -

        Ok, I like the last patch much more and I've convinced myself that this prefetched thing is a good idea with multiple slices.

        I believe this can be simpler a tiny bit further though. Typically the readBackward business is a tad confusing and since only SimpleBlockFetcher really cares about it, it can be moved there. I also think the switch of slice/lookup of index block can be cleaned by doing both together. I'm attaching a v2 that does those things, a few other cleanups, adds a bunch of comments and also adds two optimizations:

        • Since we always know the order on which we read columns, we can remember when we've "entered" a slice to avoid a bunch of comparisons until we "leave" it.
        • When we switch to the next slice, we must reuse the binary search to locate where that new slice starts rather than going to the next block, otherwise we may end up reading lots of unnecessary blocks.

        I'll note that imho, with those optimizations, SimpleSliceReader (not to be confused with SimpleBlockFetcher) isn't really useful anymore, but we probably want to make sure by benchmarking it.

        Anyway, outside of IndexedSliceReader, there was a few problems (that are all solved by the attached v2):

        • We needed to deal with multiple slices for the memtable iterator too, otherwise we'll end up returning the wrong columns.
        • We needed to correctly serialize multiple slices for the inter-node protocol.

        I'll note that with this patch, db.SerializationsTest is broken, but that is no mystery: it's trying to read old messages using the current protocol version. So I could regenerate the binary messages, but I'm confused on what SerializationsTest is actually testing. I though it was making sure we don't break backward compatibility. If we regenerate the binary message at each release we're not testing that at all.

        ps: @david, it seems your editor is splitting lines where it shouldn't and is reordering imports in a way that don't respect http://wiki.apache.org/cassandra/CodeStyle.

        Show
        Sylvain Lebresne added a comment - Ok, I like the last patch much more and I've convinced myself that this prefetched thing is a good idea with multiple slices. I believe this can be simpler a tiny bit further though. Typically the readBackward business is a tad confusing and since only SimpleBlockFetcher really cares about it, it can be moved there. I also think the switch of slice/lookup of index block can be cleaned by doing both together. I'm attaching a v2 that does those things, a few other cleanups, adds a bunch of comments and also adds two optimizations: Since we always know the order on which we read columns, we can remember when we've "entered" a slice to avoid a bunch of comparisons until we "leave" it. When we switch to the next slice, we must reuse the binary search to locate where that new slice starts rather than going to the next block, otherwise we may end up reading lots of unnecessary blocks. I'll note that imho, with those optimizations, SimpleSliceReader (not to be confused with SimpleBlockFetcher) isn't really useful anymore, but we probably want to make sure by benchmarking it. Anyway, outside of IndexedSliceReader, there was a few problems (that are all solved by the attached v2): We needed to deal with multiple slices for the memtable iterator too, otherwise we'll end up returning the wrong columns. We needed to correctly serialize multiple slices for the inter-node protocol. I'll note that with this patch, db.SerializationsTest is broken, but that is no mystery: it's trying to read old messages using the current protocol version. So I could regenerate the binary messages, but I'm confused on what SerializationsTest is actually testing. I though it was making sure we don't break backward compatibility. If we regenerate the binary message at each release we're not testing that at all. ps: @david, it seems your editor is splitting lines where it shouldn't and is reordering imports in a way that don't respect http://wiki.apache.org/cassandra/CodeStyle .
        Hide
        David Alves added a comment -

        clarified, thanks.

        Show
        David Alves added a comment - clarified, thanks.
        Hide
        Sylvain Lebresne added a comment -

        Why don't we make the "count" information reach ISR so that we can stop at the precise point the user query specifies?

        Because it's more complex than that. The final result is based on merging (potentially) multiple sstables, so the Nth column in a given sstable is not necessarily the Nth returned column.

        Show
        Sylvain Lebresne added a comment - Why don't we make the "count" information reach ISR so that we can stop at the precise point the user query specifies? Because it's more complex than that. The final result is based on merging (potentially) multiple sstables, so the Nth column in a given sstable is not necessarily the Nth returned column.
        Hide
        David Alves added a comment -

        but why do we need to do this?

        I'm not completely sure I understood what you are asking. If you're asking why the params specify slices in reverse order when reading reversed, it was because I thought it would make sense to specify the ranges we want the first results to come from, first.

        If you're asking about this particular piece of code:

        if (reversed && readBackwards)
                        currentSlice--;
                    else
                        currentSlice++;
        

        lets say we have the following two slices [a,c] [f,j] and, in reverse [j,f] [c,a]. In IndexedBlockFecther we always start with the first slice ([a,c] and [j,f], respectively) and continue from there as we want to make sure we don't fetch more than necessary. In SimpleBlockFecther however we always fetch the whole row so we might as well process [j,f] [c,a] as if it were [a,c] [f,j] and simply add the blocks to the queue in reverse order. that's why we read reverse & backwards.

        I hope that clarified it.

        Your question reminds me of one another one I had, though. Why don't we make the "count" information reach ISR so that we can stop at the precise point the user query specifies? It seems wasteful to be reading the whole row in SimpleBlockFetcher is the user only requested the first 5 cols. This wouldn't work in reverse reads, of course, but would speed up forward ones, which are probably the most frequent, no?

        In IndexedSliceReader itself is a iterator and within which there is Simple and IndexedFetcher which is also kind of a iterator ... it might be better to make IndexedSliceReader as abstract class and add functionality into classes.

        I'm not sure I understood what you are suggesting... is it that we move the code that decides whether we need a Simple- or IndexedBlockFetcher out of ISR and make ISR itself the parent abstract class to Simple- or IndexedBlockFetcher?

        Show
        David Alves added a comment - but why do we need to do this? I'm not completely sure I understood what you are asking. If you're asking why the params specify slices in reverse order when reading reversed, it was because I thought it would make sense to specify the ranges we want the first results to come from, first. If you're asking about this particular piece of code: if (reversed && readBackwards) currentSlice--; else currentSlice++; lets say we have the following two slices [a,c] [f,j] and, in reverse [j,f] [c,a] . In IndexedBlockFecther we always start with the first slice ( [a,c] and [j,f] , respectively) and continue from there as we want to make sure we don't fetch more than necessary. In SimpleBlockFecther however we always fetch the whole row so we might as well process [j,f] [c,a] as if it were [a,c] [f,j] and simply add the blocks to the queue in reverse order. that's why we read reverse & backwards. I hope that clarified it. Your question reminds me of one another one I had, though. Why don't we make the "count" information reach ISR so that we can stop at the precise point the user query specifies? It seems wasteful to be reading the whole row in SimpleBlockFetcher is the user only requested the first 5 cols. This wouldn't work in reverse reads, of course, but would speed up forward ones, which are probably the most frequent, no? In IndexedSliceReader itself is a iterator and within which there is Simple and IndexedFetcher which is also kind of a iterator ... it might be better to make IndexedSliceReader as abstract class and add functionality into classes. I'm not sure I understood what you are suggesting... is it that we move the code that decides whether we need a Simple- or IndexedBlockFetcher out of ISR and make ISR itself the parent abstract class to Simple- or IndexedBlockFetcher?
        Hide
        Vijay added a comment -

        SliceQueryFilter.serializer needs to deal with Multiple ColumnSlice's or at-least we have to create a ticket to followup.

        I am confused with the patch a bit... Comments on IndexedSliceReader says

        forward: [a,b],[d,e],[g,h] reverse: [h,g],[e,d],[b,a].

        but why do we need to do this?

                    if (reversed && readBackwards)
                        currentSlice--;
                    else
                        currentSlice++;
        

        In IndexedSliceReader itself is a iterator and within which there is Simple and IndexedFetcher which is also kind of a iterator ... it might be better to make IndexedSliceReader as abstract class and add functionality into classes.

        Show
        Vijay added a comment - SliceQueryFilter.serializer needs to deal with Multiple ColumnSlice's or at-least we have to create a ticket to followup. I am confused with the patch a bit... Comments on IndexedSliceReader says forward: [a,b] , [d,e] , [g,h] reverse: [h,g] , [e,d] , [b,a] . but why do we need to do this? if (reversed && readBackwards) currentSlice--; else currentSlice++; In IndexedSliceReader itself is a iterator and within which there is Simple and IndexedFetcher which is also kind of a iterator ... it might be better to make IndexedSliceReader as abstract class and add functionality into classes.
        Hide
        David Alves added a comment -

        corrected a bug where in reverse the index was being looked up based on the start of the slice and not on the finish.

        Show
        David Alves added a comment - corrected a bug where in reverse the index was being looked up based on the start of the slice and not on the finish.
        Hide
        David Alves added a comment -

        much simple and readable patch (imo).

        • adds a column slice object (gets rid of the nasty unchecked cast warnings)
        • gets rid of the isColumnNeeded method and diminishes the number of bb comparisons
        Show
        David Alves added a comment - much simple and readable patch (imo). adds a column slice object (gets rid of the nasty unchecked cast warnings) gets rid of the isColumnNeeded method and diminishes the number of bb comparisons
        Hide
        David Alves added a comment -

        Hi Jonathan.

        I'm sorry I wasn't able to explain myself better. It's not that I wouldn't want to change things, it's that the most straightforward to use the same add cols and then recheck approach is to add more checks to the isColumnNeeded method and that is not a good idea since it is pretty convoluted and slow as is (ran some tests and the cache version vs the add the recheck version is about 5-10% faster about 5 on forward reads and 10 on reverse).

        Still in retrospective the code surely be improved lot in terms or readability and Sylvain has made a couple of interesting suggestions that I'm trying out, hopefully we'll get the best of both worlds.

        Show
        David Alves added a comment - Hi Jonathan. I'm sorry I wasn't able to explain myself better. It's not that I wouldn't want to change things, it's that the most straightforward to use the same add cols and then recheck approach is to add more checks to the isColumnNeeded method and that is not a good idea since it is pretty convoluted and slow as is (ran some tests and the cache version vs the add the recheck version is about 5-10% faster about 5 on forward reads and 10 on reverse). Still in retrospective the code surely be improved lot in terms or readability and Sylvain has made a couple of interesting suggestions that I'm trying out, hopefully we'll get the best of both worlds.
        Hide
        Jonathan Ellis added a comment -

        While I'm sympathetic to the time spent getting code working, one of the reasons we do code reviews is we have a strong bias towards doing things "right" as opposed to "good enough." Granted, that can be taken too far, but I'm pretty happy with the balance we've achieved between improving what we commit on the one hand, and bikeshedding on the other.

        In short, I'd like to see Sylvain's remaining concerns addressed. Of course, reviewers aren't infallible either, and getting rid of the cache (for instance) may not actually produce much simplification, but in the interest of achieving consensus it's worth spending the time to try a second approach.

        Show
        Jonathan Ellis added a comment - While I'm sympathetic to the time spent getting code working, one of the reasons we do code reviews is we have a strong bias towards doing things "right" as opposed to "good enough." Granted, that can be taken too far, but I'm pretty happy with the balance we've achieved between improving what we commit on the one hand, and bikeshedding on the other. In short, I'd like to see Sylvain's remaining concerns addressed. Of course, reviewers aren't infallible either, and getting rid of the cache (for instance) may not actually produce much simplification, but in the interest of achieving consensus it's worth spending the time to try a second approach.
        Hide
        David Alves added a comment -

        Thank you for the thorough review, Sylvain.

        My notes are:

        With regard to bugs that's why theres' extensive testing.

        With regard to the symmetry of forward and reverse it is indeed broken but because the two operations (IMO) are really not symmetric since columns are always read from left to right. Not adding unnecessary data to blockColumns prevents having to check ranges for every tuple that comes out of it as was done before, which i think is an advantage in most cases but that bears the penalty of making it impossible to add out-of-order values (as was done before because values were always range-checked twice). Moreover the code is now naturally more complex because the operation is more complex.

        All in all I think that most of what you're saying makes sense, but since things are working (and commented throughout) I'd prefer to leave it. The whole cache thing is arguable though and I'd be happy to add the bb checks to make sure only relevant data is added to it if that's the consensus.

        With regard to the Pair<ByteBuffer,ByteBuffer> I agree, I think I jumped the gun on that. I particularly don't like the the unchecked casts when using arrays. I'll change that as per your suggestion.

        Show
        David Alves added a comment - Thank you for the thorough review, Sylvain. My notes are: With regard to bugs that's why theres' extensive testing. With regard to the symmetry of forward and reverse it is indeed broken but because the two operations (IMO) are really not symmetric since columns are always read from left to right. Not adding unnecessary data to blockColumns prevents having to check ranges for every tuple that comes out of it as was done before, which i think is an advantage in most cases but that bears the penalty of making it impossible to add out-of-order values (as was done before because values were always range-checked twice). Moreover the code is now naturally more complex because the operation is more complex. All in all I think that most of what you're saying makes sense, but since things are working (and commented throughout) I'd prefer to leave it. The whole cache thing is arguable though and I'd be happy to add the bb checks to make sure only relevant data is added to it if that's the consensus. With regard to the Pair<ByteBuffer,ByteBuffer> I agree, I think I jumped the gun on that. I particularly don't like the the unchecked casts when using arrays. I'll change that as per your suggestion.
        Hide
        Sylvain Lebresne added a comment -

        On an unrelated note, I disagree with using Pair<ByteBuffer, ByteBuffer> everywhere in the code, as it's not explicit enough (and it's a lot more character than worth it). So I (strongly) think we should define a specific class for this (say ColumnRange). But, I'm fine having that class extends Pair<ByteBuffer, ByteBuffer>, it's only a matter of having an explicit name.

        Show
        Sylvain Lebresne added a comment - On an unrelated note, I disagree with using Pair<ByteBuffer, ByteBuffer> everywhere in the code, as it's not explicit enough (and it's a lot more character than worth it). So I (strongly) think we should define a specific class for this (say ColumnRange). But , I'm fine having that class extends Pair<ByteBuffer, ByteBuffer> , it's only a matter of having an explicit name.
        Hide
        Sylvain Lebresne added a comment -

        It's not so much about necessarily mimicking the current behavior, but that the cache make the code a lot more disymetric between forward and reverse slices, which imo makes it harder to follow and more prone to having bug for reversed but not forward slices. Also (and though that is arguably minor), reusing the blockColumns queue as the "cache" as is done currently saves the cache allocation.

        Show
        Sylvain Lebresne added a comment - It's not so much about necessarily mimicking the current behavior, but that the cache make the code a lot more disymetric between forward and reverse slices, which imo makes it harder to follow and more prone to having bug for reversed but not forward slices. Also (and though that is arguably minor), reusing the blockColumns queue as the "cache" as is done currently saves the cache allocation.
        Hide
        David Alves added a comment - - edited

        The new version of the code has one differece in regard to the blockColumns queue, that's that it never adds values to it that aren't going to be necessary (as as done previously) so the blockColumns queue is not really a cache at all, all values there are returned (if computNext() is called, of course) that's why the "cache" is needed only for a (partial) index segment size and only in reverse. In fact I can add couple of bb checks and it is no longer a cache and simply stuff that hasn't been added to the blockColumns queue yet.

        Do you think it's important to change the code to mimic how it behaved before?

        Show
        David Alves added a comment - - edited The new version of the code has one differece in regard to the blockColumns queue, that's that it never adds values to it that aren't going to be necessary (as as done previously) so the blockColumns queue is not really a cache at all, all values there are returned (if computNext() is called, of course) that's why the "cache" is needed only for a (partial) index segment size and only in reverse. In fact I can add couple of bb checks and it is no longer a cache and simply stuff that hasn't been added to the blockColumns queue yet. Do you think it's important to change the code to mimic how it behaved before?
        Hide
        Sylvain Lebresne added a comment -

        simply storing anyway it in the case it is needed.

        Yeah but we kind of do that in the current code already with the queue blockColumns. blockColumns is a cache of sorts and that's why I think we shouldn't add a second one. Much like we currently do, during deserialization we can simply check if it's worth deserializing the next column of the block (which should only mean comparing each column once, against the "biggest" range intersecting the index block). Then, in computeNext(), when polling the column from blockColumns we can validate if the column is really needed (again, since we know the order in which we poll the column, this should only require comparison to one of the range). This should be roughly equivalent to the cache of your patch, but it avoids having both blockColumns and the cache.

        Show
        Sylvain Lebresne added a comment - simply storing anyway it in the case it is needed. Yeah but we kind of do that in the current code already with the queue blockColumns . blockColumns is a cache of sorts and that's why I think we shouldn't add a second one. Much like we currently do, during deserialization we can simply check if it's worth deserializing the next column of the block (which should only mean comparing each column once, against the "biggest" range intersecting the index block). Then, in computeNext(), when polling the column from blockColumns we can validate if the column is really needed (again, since we know the order in which we poll the column, this should only require comparison to one of the range). This should be roughly equivalent to the cache of your patch, but it avoids having both blockColumns and the cache.
        Hide
        David Alves added a comment -

        Thank you for your input Sylvain

        That crossed my mind.

        The thing is we have to deserialize "d" anyway. So the question becomes what's faster/more optimized, keep comparing every column to see if it matches the previous range (the next range to be processed in reverse) doing two bb comparisons and then store it for later if yes (because we're not finished with the current range) or simply storing anyway it in the case it is needed.

        I figured the latter option would probably be faster, but I might be wrong.

        Show
        David Alves added a comment - Thank you for your input Sylvain That crossed my mind. The thing is we have to deserialize "d" anyway. So the question becomes what's faster/more optimized, keep comparing every column to see if it matches the previous range (the next range to be processed in reverse) doing two bb comparisons and then store it for later if yes (because we're not finished with the current range) or simply storing anyway it in the case it is needed. I figured the latter option would probably be faster, but I might be wrong.
        Hide
        Sylvain Lebresne added a comment -

        Now while the the current range we're looking for only needs "e" and "f" we have also deserialized "d"

        That's what looks wrong to me (without having read the patch, I'll admit). It seems that as soon as we deserialize "d", we should realize that we're outside of [f-e] and thus we should start considering [d-b] and as such add the column to the result without a need for caching. Provided the requested range don't overlap (which I agree could be assumed by the sstable reader and validated way upfront), we shouldn't need to deserialize anything twice and we shouldn't need any cache.

        Show
        Sylvain Lebresne added a comment - Now while the the current range we're looking for only needs "e" and "f" we have also deserialized "d" That's what looks wrong to me (without having read the patch, I'll admit). It seems that as soon as we deserialize "d", we should realize that we're outside of [f-e] and thus we should start considering [d-b] and as such add the column to the result without a need for caching. Provided the requested range don't overlap (which I agree could be assumed by the sstable reader and validated way upfront), we shouldn't need to deserialize anything twice and we shouldn't need any cache.
        Hide
        David Alves added a comment -

        Refactored to use Pair<ByteBuffer,ByteBuffer> instead of ColumnSlice.

        Thanks for the tip vijay.

        Show
        David Alves added a comment - Refactored to use Pair<ByteBuffer,ByteBuffer> instead of ColumnSlice. Thanks for the tip vijay.
        Hide
        David Alves added a comment -

        With regard to count/reverse per slice, I tried to keep it relatively simple. My reasoning was that writing a query that specifies count and order (fwd,rev) per slice would be hard to do clearly, grammar wise, and that most people would mostly global (for all ranges) ordering and counting. That being said I can always rewrite if people think that per slice count and order are important.

        Show
        David Alves added a comment - With regard to count/reverse per slice, I tried to keep it relatively simple. My reasoning was that writing a query that specifies count and order (fwd,rev) per slice would be hard to do clearly, grammar wise, and that most people would mostly global (for all ranges) ordering and counting. That being said I can always rewrite if people think that per slice count and order are important.
        Hide
        David Alves added a comment -

        Thanks for reviewing and for your input Vijay.

        Now I understand what you were saying, but cacheing is used in a different circumstance. In fact the IndexedReader assumes that the query has already been adequately rewritten and that ranges are in order (either forward or reverse) and non-overlapping.

        Cacheing is used in the following, very particular, circumstance, lets say we the following columns: a, b, c, d, e, f and that index segments cover the columns as such i1=[a-c] i2=[e-f] and that we want the following (reversed) ranges [f-e] [d-b]. We start by looking for the index segment for "f" which falls into i2 and we'll deserialize columns d through f. Now while the the current range we're looking for only needs "e" and "f" we have also deserialized "d". Had we not cached "d" we would have to re-read the "d" to handle the next range (now [d-b]), but because we cached it we don't. In general if the next range requires any of the data we've already read we're sure to have it in cache and won't need to re-read, if it doesn't we clear the cache.

        As you can see it's a very particular optimization for indexed, reverse reads.

        Show
        David Alves added a comment - Thanks for reviewing and for your input Vijay. Now I understand what you were saying, but cacheing is used in a different circumstance. In fact the IndexedReader assumes that the query has already been adequately rewritten and that ranges are in order (either forward or reverse) and non-overlapping. Cacheing is used in the following, very particular, circumstance, lets say we the following columns: a, b, c, d, e, f and that index segments cover the columns as such i1= [a-c] i2= [e-f] and that we want the following (reversed) ranges [f-e] [d-b] . We start by looking for the index segment for "f" which falls into i2 and we'll deserialize columns d through f. Now while the the current range we're looking for only needs "e" and "f" we have also deserialized "d". Had we not cached "d" we would have to re-read the "d" to handle the next range (now [d-b] ), but because we cached it we don't. In general if the next range requires any of the data we've already read we're sure to have it in cache and won't need to re-read, if it doesn't we clear the cache. As you can see it's a very particular optimization for indexed, reverse reads.
        Hide
        Vijay added a comment -

        Hi David,

        Cacheing the results is not about query validation, at least I can't see how better query validation would discard the necessity of cacheing to avoid re-reads.

        We can validate and also rewrite the query, Example: user wants a query from

        {A:B-A:D, A:D-A:H, A:D-A:G}

        can be rewritten as

        {A:B-A:H}

        ... End of the day the user needs a list of sorted columns.

        In addition with the new patch: ColumnSlice can be a Pair<T1,T2>, (but I thought we also need count and isReverse in it?)

        Show
        Vijay added a comment - Hi David, Cacheing the results is not about query validation, at least I can't see how better query validation would discard the necessity of cacheing to avoid re-reads. We can validate and also rewrite the query, Example: user wants a query from {A:B-A:D, A:D-A:H, A:D-A:G} can be rewritten as {A:B-A:H} ... End of the day the user needs a list of sorted columns. In addition with the new patch: ColumnSlice can be a Pair<T1,T2>, (but I thought we also need count and isReverse in it?)
        Hide
        David Alves added a comment -

        Here's a patch that addresses most of the issues:

        • A slice range is now specified as a ColumnSlice, a db package immutable object.
        • Ranges are passed as arrays and never changed or copied.

        The only thing that was not addressed was the cache issue. IMO cache is useful as it prevents disk rereads. It might be better to make sure columns are needed, but even that I'm really not sure since making sure would require a series of bytebuffer comparisons, and keeping a single index-lenght (max) of cols is probably not that bad.

        Show
        David Alves added a comment - Here's a patch that addresses most of the issues: A slice range is now specified as a ColumnSlice, a db package immutable object. Ranges are passed as arrays and never changed or copied. The only thing that was not addressed was the cache issue. IMO cache is useful as it prevents disk rereads. It might be better to make sure columns are needed, but even that I'm really not sure since making sure would require a series of bytebuffer comparisons, and keeping a single index-lenght (max) of cols is probably not that bad.
        Hide
        David Alves added a comment - - edited

        Hi Vijay

        Sorry for the delay also, somehow I didn't get notified that the issue was updated/commented.
        1) - With regard to SliceRange you're right, I just wanted to get some feedback before changing thrift stuff
        2) - We can definitely avoid that, maybe make IndexedSliceReader receive the ranges as an array?
        3) - Hadn't thought about that... maybe we could have a internal (within org.apache.cassandra.db) version and transform?
        4) - Cacheing the results is not about query validation, at least I can't see how better query validation would discard the necessity of cacheing to avoid re-reads. There is always a scenario where cacheing the results avoids re-reads (when we're sure that the next reversed range will need them), the thing is that right now we're cacheing without knowing wether they will be necessary (vs making sure the previous range will need them). In any case the maximum that gets cached is an index range of records per query so that might not be a terrible issue. What do you think?

        Show
        David Alves added a comment - - edited Hi Vijay Sorry for the delay also, somehow I didn't get notified that the issue was updated/commented. 1) - With regard to SliceRange you're right, I just wanted to get some feedback before changing thrift stuff 2) - We can definitely avoid that, maybe make IndexedSliceReader receive the ranges as an array? 3) - Hadn't thought about that... maybe we could have a internal (within org.apache.cassandra.db) version and transform? 4) - Cacheing the results is not about query validation, at least I can't see how better query validation would discard the necessity of cacheing to avoid re-reads. There is always a scenario where cacheing the results avoids re-reads (when we're sure that the next reversed range will need them), the thing is that right now we're cacheing without knowing wether they will be necessary (vs making sure the previous range will need them). In any case the maximum that gets cached is an index range of records per query so that might not be a terrible issue. What do you think?
        Hide
        Vijay added a comment - - edited

        Hi David,
        Thanks for the patch, (Needs a rebase and sorry for the delay in responding)
        1) We should not be hand editing generated code/ SliceRange.
        2) Can we avoid doing ranges = ranges.toArray(new SliceRange[ranges.size()]); to avoid unnecessary copies
        3) we should not create org.apache.cassandra.thrift.SliceRange inside of SliceQueryFilter we might want to keep the dependency of the thrift outside of the storage layer.
        4) not sure about the caching the results... We can just validate it in the thrift layer so users will send proper query (this reduces complexity and comparison on reads)?

        Show
        Vijay added a comment - - edited Hi David, Thanks for the patch, (Needs a rebase and sorry for the delay in responding) 1) We should not be hand editing generated code/ SliceRange. 2) Can we avoid doing ranges = ranges.toArray(new SliceRange [ranges.size()] ); to avoid unnecessary copies 3) we should not create org.apache.cassandra.thrift.SliceRange inside of SliceQueryFilter we might want to keep the dependency of the thrift outside of the storage layer. 4) not sure about the caching the results... We can just validate it in the thrift layer so users will send proper query (this reduces complexity and comparison on reads)?
        Hide
        David Alves added a comment - - edited

        Here's a patch that implements multi-slice in ISR, it behaves pretty much as expected/suggested, additionally it implements re-read prevention. to clarify:

        • for forward scans it checks whether the next slice is within the current index segment and continues the scan there if it is.
        • for reverse scans it stores the columns before the current range, in cache. if the next range (which in reverse is before) starts/uses the same index segment the it uses the cache instead of re-reading the columns.

        The test also reduces the number of bytebuffer comparisons in computeNext() as all the columns added to blockColumns are within the argument ranges and may be read so there is no need to make range checks.

        The patch slightly changes SliceQueryFilter to use range sets but maintains backwards compatibility. Public access to start and finish is changed to be done through accessors.

        The patch includes a somewhat (maybe too) verbose test of the new and old behavior.

        The slice reader assumes that either the whole ranges are read reversed or not (there are no forward+reversed range mix).

        The slice reader assumes that ranges are sorted correctly, e.g. that for forward lookup ranges are in lexicographic order of start elements and that for reverse lookup they are in reverse lexicographic order of finish (reverse start) elements. i.e. forward: [a,b],[d,e],[g,h] reverse: [h,g],[e,d],[b,a]. The reader also assumes that validation has been performed in terms of intervals (no overlapping intervals).

        Tests are mostly green with the following exception:

        [junit] Testsuite: org.apache.cassandra.db.CommitLogTest
        [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
        [junit]
        [junit] Testcase: org.apache.cassandra.db.CommitLogTest:terminated successfully: Caused an ERROR
        [junit] Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
        [junit] junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
        [junit]
        [junit]
        [junit] Test org.apache.cassandra.db.CommitLogTest FAILED (timeout)

        Show
        David Alves added a comment - - edited Here's a patch that implements multi-slice in ISR, it behaves pretty much as expected/suggested, additionally it implements re-read prevention. to clarify: for forward scans it checks whether the next slice is within the current index segment and continues the scan there if it is. for reverse scans it stores the columns before the current range, in cache. if the next range (which in reverse is before) starts/uses the same index segment the it uses the cache instead of re-reading the columns. The test also reduces the number of bytebuffer comparisons in computeNext() as all the columns added to blockColumns are within the argument ranges and may be read so there is no need to make range checks. The patch slightly changes SliceQueryFilter to use range sets but maintains backwards compatibility. Public access to start and finish is changed to be done through accessors. The patch includes a somewhat (maybe too) verbose test of the new and old behavior. The slice reader assumes that either the whole ranges are read reversed or not (there are no forward+reversed range mix). The slice reader assumes that ranges are sorted correctly, e.g. that for forward lookup ranges are in lexicographic order of start elements and that for reverse lookup they are in reverse lexicographic order of finish (reverse start) elements. i.e. forward: [a,b] , [d,e] , [g,h] reverse: [h,g] , [e,d] , [b,a] . The reader also assumes that validation has been performed in terms of intervals (no overlapping intervals). Tests are mostly green with the following exception: [junit] Testsuite: org.apache.cassandra.db.CommitLogTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] [junit] Testcase: org.apache.cassandra.db.CommitLogTest:terminated successfully: Caused an ERROR [junit] Timeout occurred. Please note the time in the report does not reflect the time until the timeout. [junit] junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout. [junit] [junit] [junit] Test org.apache.cassandra.db.CommitLogTest FAILED (timeout)
        Hide
        Jonathan Ellis added a comment -

        (Most of the work will be pushed down into Memtable.getSliceIterator and SSTableSliceIterator.)

        Show
        Jonathan Ellis added a comment - (Most of the work will be pushed down into Memtable.getSliceIterator and SSTableSliceIterator.)

          People

          • Assignee:
            David Alves
            Reporter:
            Jonathan Ellis
            Reviewer:
            Vijay
          • Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development