Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:

      N/A

      Description

      Allow a ColumnFamily to be versioned via vector clocks, instead of long timestamps. Purpose: enable incr/decr; flexible conflict resolution.

      1. 0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch
        20 kB
        Sylvain Lebresne
      2. 580-version-vector-wip.patch
        126 kB
        Kelvin Kakugawa
      3. 580-1-Add-ColumnType-as-enum.patch
        40 kB
        Sylvain Lebresne
      4. ASF.LICENSE.NOT.GRANTED--580-counts-wip1.patch
        328 kB
        Kelvin Kakugawa
      5. 580-context-v4.patch
        72 kB
        Kelvin Kakugawa
      6. 580-thrift-v6.patch
        2 kB
        Kelvin Kakugawa
      7. 580-thrift-v3.patch
        2 kB
        Jonathan Ellis
      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Jonathan Ellis added a comment -

        Closing as reasoned above.

        Show
        Jonathan Ellis added a comment - Closing as reasoned above.
        Hide
        Jonathan Ellis added a comment - - edited

        As discussed here and in CASSANDRA-1072, this ticket is a dead end for counting, and the primary use case for classic vector clocks of merging non-conflicting updates to different fields w/in a value is already handled by cassandra breaking a row into columns.

        Is there a good reason not to close this?

        Show
        Jonathan Ellis added a comment - - edited As discussed here and in CASSANDRA-1072 , this ticket is a dead end for counting, and the primary use case for classic vector clocks of merging non-conflicting updates to different fields w/in a value is already handled by cassandra breaking a row into columns. Is there a good reason not to close this?
        Hide
        Kelvin Kakugawa added a comment -

        Hi Sylvain,

        I read through your patch:
        0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch

        It makes sense and generalizes to 1072. I agree that it's better than the current impl in 580.

        If you don't mind, I'm working on integrating the above diff w/ 1072 logic, here:
        https://issues.apache.org/jira/browse/CASSANDRA-1397

        Show
        Kelvin Kakugawa added a comment - Hi Sylvain, I read through your patch: 0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch It makes sense and generalizes to 1072. I agree that it's better than the current impl in 580. If you don't mind, I'm working on integrating the above diff w/ 1072 logic, here: https://issues.apache.org/jira/browse/CASSANDRA-1397
        Hide
        Sylvain Lebresne added a comment -

        I believe the attached patch (580-version-vector-wip.patch) has a problem. At
        CL.ZERO and CL.ONE, it doesn't replicate writes (the ones using version
        vectors) at all (SP.updateDestinationByClock() clears the destinationEndpoints
        but still returns an empty collection). This is (overly) unsafe.

        This certainly could be fixed by adding new WriteResponseHandler for those
        cases. But I believe that there is a much better alternative.

        This alternative consists in changing the write protocol (for version vector
        only of course) to do the following (and note that the protocol of the current
        patch is already different of the one for timestamps):

        1. a node receive a write request (with version vector clock) from a client.
          If it's a replica for the write, goto 3) otherwise goto 2)
        2. the node delegate the write to one replica (along with the asked CL) and
          then only wait for a ack of this replica before answering the client (it
          doesn't replicate anything)
        3. the chosen replica apply the mutation locally first (we must do it before
          replication)
        4. then it send the mutation to other replicates, waiting for how many
          responses are necessary to achieve asked consistency

        To make this more concrete, I'm attaching a patch (0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch)
        that implements this protocol (it all starts in SP.delegateMutateBlocking()).
        Small disclaimers: this should work but is not really tested (so please be
        nice ). The function RowMutation.updateBeforeReplication() could safely be
        ignored on a first read but it would be needed if #1072 was to use this. It
        could also probably be slightly optimized by allowing the
        DelegatedRowMutationVerbHandler to handle multiple mutations at once. This is
        also just the protocol mentioned above, #580 would have to be rebased on top
        of this.

        Anyway, I think this alternative is superior to the one used by the currently
        attached #580 patch for the following reasons:

        • the protocol used by the current patch (write to one replica, wait for the
          ack and then replicate to others, which differs from what I propose in
          that this is done from a potentially non replica node), doesn't work for
          #1072 (because of potential race condition with the read repairs). The
          protocol I'm proposing does not suffer of this problem and (I'm quite
          convinced, let's hope I'm not wrong) would reconciliate #1072 with the EC
          model of Cassandra. This is obviously the more important point.
        • it is slightly faster (network-latency-wise), as we don't wait for a full
          round-trip to a node before starting the replication.
        • it more cleanly separate the protocols of timestamped writes and versionned
          ones (without much code duplication really). I suppose this is more a
          matter of opinion whether this is better or not, but at the very least it
          make it clearer that version vectors don't slow down nor break the other
          writes.

        I'd be happy if someone had a look at this and confirm that I'm not
        completely wide of the mark. If I'm not, I may be able to spare some cycle
        merging this idea with #580 (and #1072).

        Show
        Sylvain Lebresne added a comment - I believe the attached patch (580-version-vector-wip.patch) has a problem. At CL.ZERO and CL.ONE, it doesn't replicate writes (the ones using version vectors) at all (SP.updateDestinationByClock() clears the destinationEndpoints but still returns an empty collection). This is (overly) unsafe. This certainly could be fixed by adding new WriteResponseHandler for those cases. But I believe that there is a much better alternative. This alternative consists in changing the write protocol (for version vector only of course) to do the following (and note that the protocol of the current patch is already different of the one for timestamps): a node receive a write request (with version vector clock) from a client. If it's a replica for the write, goto 3) otherwise goto 2) the node delegate the write to one replica (along with the asked CL) and then only wait for a ack of this replica before answering the client (it doesn't replicate anything) the chosen replica apply the mutation locally first (we must do it before replication) then it send the mutation to other replicates, waiting for how many responses are necessary to achieve asked consistency To make this more concrete, I'm attaching a patch (0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch) that implements this protocol (it all starts in SP.delegateMutateBlocking()). Small disclaimers: this should work but is not really tested (so please be nice ). The function RowMutation.updateBeforeReplication() could safely be ignored on a first read but it would be needed if #1072 was to use this. It could also probably be slightly optimized by allowing the DelegatedRowMutationVerbHandler to handle multiple mutations at once. This is also just the protocol mentioned above, #580 would have to be rebased on top of this. Anyway, I think this alternative is superior to the one used by the currently attached #580 patch for the following reasons: the protocol used by the current patch (write to one replica, wait for the ack and then replicate to others, which differs from what I propose in that this is done from a potentially non replica node), doesn't work for #1072 (because of potential race condition with the read repairs). The protocol I'm proposing does not suffer of this problem and (I'm quite convinced, let's hope I'm not wrong) would reconciliate #1072 with the EC model of Cassandra. This is obviously the more important point. it is slightly faster (network-latency-wise), as we don't wait for a full round-trip to a node before starting the replication. it more cleanly separate the protocols of timestamped writes and versionned ones (without much code duplication really). I suppose this is more a matter of opinion whether this is better or not, but at the very least it make it clearer that version vectors don't slow down nor break the other writes. I'd be happy if someone had a look at this and confirm that I'm not completely wide of the mark. If I'm not, I may be able to spare some cycle merging this idea with #580 (and #1072).
        Hide
        Kelvin Kakugawa added a comment -

        Both 580 and 1072 are against 0.7.

        580 works, now. However, I'd like to add the improvements that Cliff and Andy recommended.
        1072 is closer, but we're writing up a design document for how distributed counters work.

        Show
        Kelvin Kakugawa added a comment - Both 580 and 1072 are against 0.7. 580 works, now. However, I'd like to add the improvements that Cliff and Andy recommended. 1072 is closer, but we're writing up a design document for how distributed counters work.
        Hide
        Kazuki Aranami added a comment -

        Hi Kelvin Kakugawa,

        How about the progress condition of your work of 580 and 1072?
        I wonder if I can add works to 0.7 safely if favorable

        Show
        Kazuki Aranami added a comment - Hi Kelvin Kakugawa, How about the progress condition of your work of 580 and 1072? I wonder if I can add works to 0.7 safely if favorable
        Hide
        Kelvin Kakugawa added a comment -

        580 and 1072 are forks of each other. So, either could be applied first.

        Basically, 1072 was built off of the original 580 work. However, 1072 made a lot of various improvements to the underlying code. So, when I brought 580 up to trunk, I incorporated all of the underlying improvements in 1072, but replaced the counter-specific code w/ the vector clock implementation.

        I believe more ppl are interested in 1072. So, after that gets committed to trunk, I'll refactor the above patch to let it be applied on top of 1072.

        Show
        Kelvin Kakugawa added a comment - 580 and 1072 are forks of each other. So, either could be applied first. Basically, 1072 was built off of the original 580 work. However, 1072 made a lot of various improvements to the underlying code. So, when I brought 580 up to trunk, I incorporated all of the underlying improvements in 1072, but replaced the counter-specific code w/ the vector clock implementation. I believe more ppl are interested in 1072. So, after that gets committed to trunk, I'll refactor the above patch to let it be applied on top of 1072.
        Hide
        Johan Oskarsson added a comment -

        CASSANDRA-1243 was my somewhat failed attempt to split out code from CASSANDRA-1072 to make that patch more manageable, but there is not really a clear cut component to split out so I have canceled that patch. I can close it for now and reopen if we want to try again. Alternatively create a few smaller and more focused issues instead. It won't make much of a dent in 1072 though.

        The idea is to get CASSANDRA-1072 committed first and then resume work on CASSANDRA-580. They do both use a lot of the same changes that we have already committed to Cassandra (clocks, reconciler) and probably share a bit of code still.

        Show
        Johan Oskarsson added a comment - CASSANDRA-1243 was my somewhat failed attempt to split out code from CASSANDRA-1072 to make that patch more manageable, but there is not really a clear cut component to split out so I have canceled that patch. I can close it for now and reopen if we want to try again. Alternatively create a few smaller and more focused issues instead. It won't make much of a dent in 1072 though. The idea is to get CASSANDRA-1072 committed first and then resume work on CASSANDRA-580 . They do both use a lot of the same changes that we have already committed to Cassandra (clocks, reconciler) and probably share a bit of code still.
        Hide
        Jonathan Ellis added a comment -

        What is the current relationship between 580, CASSANDRA-1072, CASSANDRA-1243? Is 1072 a subset of 580? Or should the 580 patch be applied first?

        Show
        Jonathan Ellis added a comment - What is the current relationship between 580, CASSANDRA-1072 , CASSANDRA-1243 ? Is 1072 a subset of 580? Or should the 580 patch be applied first?
        Hide
        Kelvin Kakugawa added a comment - - edited

        bringing 580 up to trunk (git@714eab0c4079276eefcdd5a1d8d9ca7b225c9f80)

        TODO:
        1) modify version vector node ids to use a unique, per-connection id (thanks Cliff Moon and Andy Gross for pointing this out as an improvement)

        • note: see Riak's implementation, which derives unique client ids based on erlang process id.
          pros: less replica coordination wrt node ids.
          necessary evil: must add timestamps to version vector tuples and prune the version vector (based on time).

        2) modify tuple format

        • add timestamp (to allow version vector pruning)
        • shorten count type to be int, instead of long (node ids will updated for a much shorter period of time)
        Show
        Kelvin Kakugawa added a comment - - edited bringing 580 up to trunk (git@714eab0c4079276eefcdd5a1d8d9ca7b225c9f80) TODO: 1) modify version vector node ids to use a unique, per-connection id (thanks Cliff Moon and Andy Gross for pointing this out as an improvement) note: see Riak's implementation, which derives unique client ids based on erlang process id. pros: less replica coordination wrt node ids. necessary evil: must add timestamps to version vector tuples and prune the version vector (based on time). 2) modify tuple format add timestamp (to allow version vector pruning) shorten count type to be int, instead of long (node ids will updated for a much shorter period of time)
        Hide
        Kazuki Aranami added a comment -

        Hi Kelvin Kakugawa,

        I pay attention to a cord of github.

        Thanks!!!

        Show
        Kazuki Aranami added a comment - Hi Kelvin Kakugawa, I pay attention to a cord of github. Thanks!!!
        Hide
        Kelvin Kakugawa added a comment -

        Hi Kazuki,

        I'm going to bring vector clocks back up to trunk, here:
        http://github.com/kakugawa/cassandra/tree/trunk-580

        Hopefully, by this week.

        Show
        Kelvin Kakugawa added a comment - Hi Kazuki, I'm going to bring vector clocks back up to trunk, here: http://github.com/kakugawa/cassandra/tree/trunk-580 Hopefully, by this week.
        Hide
        Kazuki Aranami added a comment -

        Hello.

        I recognize that I am very important after implementation of this Vector Clock thought about the architecture.

        Is the work of Vector Clock stopping?
        I am concerned about the influence on road map.

        0.7 roadmap
        http://www.mail-archive.com/dev@cassandra.apache.org/msg00404.html

        There is such a site, too.

        itclocks
        http://code.google.com/p/itclocks/

        thanks!

        Show
        Kazuki Aranami added a comment - Hello. I recognize that I am very important after implementation of this Vector Clock thought about the architecture. Is the work of Vector Clock stopping? I am concerned about the influence on road map. 0.7 roadmap http://www.mail-archive.com/dev@cassandra.apache.org/msg00404.html There is such a site, too. itclocks http://code.google.com/p/itclocks/ thanks!
        Hide
        Kelvin Kakugawa added a comment -

        (as discussed on IRC)
        break out ColumnType enum into two separate enums:
        RowType (Super, Standard)
        ClockType (Timestamp, Version, IncrementalCounter)

        1058 will just introduce RowType

        Show
        Kelvin Kakugawa added a comment - (as discussed on IRC) break out ColumnType enum into two separate enums: RowType (Super, Standard) ClockType (Timestamp, Version, IncrementalCounter) 1058 will just introduce RowType
        Hide
        Sylvain Lebresne added a comment -

        Can you create a new (sub) ticket for that change and attach the patch to it?

        Sure: CASSANDRA-1058

        Also, it looks like the actual ColumnType java file didn't make it into the patch.

        Ahah, my bad. Forgot the add. That's corrected in the patch attached to #1058

        Show
        Sylvain Lebresne added a comment - Can you create a new (sub) ticket for that change and attach the patch to it? Sure: CASSANDRA-1058 Also, it looks like the actual ColumnType java file didn't make it into the patch. Ahah, my bad. Forgot the add. That's corrected in the patch attached to #1058
        Hide
        Johan Oskarsson added a comment -

        Sylvain: Good work splitting out the ColumnType into a new patch. Can you create a new (sub) ticket for that change and attach the patch to it? Also, it looks like the actual ColumnType java file didn't make it into the patch.

        Show
        Johan Oskarsson added a comment - Sylvain: Good work splitting out the ColumnType into a new patch. Can you create a new (sub) ticket for that change and attach the patch to it? Also, it looks like the actual ColumnType java file didn't make it into the patch.
        Hide
        Sylvain Lebresne added a comment -

        Attached patch is the first part of the wip patch (ColumnType as enum) rebased
        against trunk. It ships the addition of the ColumnType in superColumns which is
        not strictly needed per se but make sense here I believe (note that this patch depends
        on CASSANDRA-1054).

        Since I did not wrote this, I merely rebased it, I'll +1 it (just this first part of course) at
        the same time.

        Show
        Sylvain Lebresne added a comment - Attached patch is the first part of the wip patch (ColumnType as enum) rebased against trunk. It ships the addition of the ColumnType in superColumns which is not strictly needed per se but make sense here I believe (note that this patch depends on CASSANDRA-1054 ). Since I did not wrote this, I merely rebased it, I'll +1 it (just this first part of course) at the same time.
        Hide
        Kelvin Kakugawa added a comment -

        Hey Sylvain,

        Any help is appreciated. I'm still testing and refactoring the counters logic.

        However, the ColumnType as enum is pretty distinct, so if you want to rebase that and submit the patch I'm totally fine w/ that. Thanks.

        -Kelvin

        Show
        Kelvin Kakugawa added a comment - Hey Sylvain, Any help is appreciated. I'm still testing and refactoring the counters logic. However, the ColumnType as enum is pretty distinct, so if you want to rebase that and submit the patch I'm totally fine w/ that. Thanks. -Kelvin
        Hide
        Sylvain Lebresne added a comment -

        Where are you with this patch splitting/rebasing ? In the process of
        giving a hand in reviewing this, I've extracted the first part of this
        (ColumnType as enum) and rebased it against trunk. I certainly don't
        want to go behind you back but if that can help I can attach it so
        that maybe this first part get commited.

        Show
        Sylvain Lebresne added a comment - Where are you with this patch splitting/rebasing ? In the process of giving a hand in reviewing this, I've extracted the first part of this (ColumnType as enum) and rebased it against trunk. I certainly don't want to go behind you back but if that can help I can attach it so that maybe this first part get commited.
        Hide
        Kelvin Kakugawa added a comment -

        Yes, you're right. This WIP patch will be broken out into 3 separate patches: ColumnType as enum, VC extension, and counters extension.

        And, most of the TODO: Test comments can be covered by system tests.

        Show
        Kelvin Kakugawa added a comment - Yes, you're right. This WIP patch will be broken out into 3 separate patches: ColumnType as enum, VC extension, and counters extension. And, most of the TODO: Test comments can be covered by system tests.
        Hide
        Jonathan Ellis added a comment -

        If there is any "prep" refactoring in here it should be pulled out to a separate patch. It's impossible to say, looking at at 300k patch.

        Adding system tests should make most of the TODO: TESTs go away, right?

        Show
        Jonathan Ellis added a comment - If there is any "prep" refactoring in here it should be pulled out to a separate patch. It's impossible to say, looking at at 300k patch. Adding system tests should make most of the TODO: TESTs go away, right?
        Hide
        Kelvin Kakugawa added a comment -

        WIP patch against svn r923420

        This is the last patch we have against cassandra 0.6. The internal version has more test coverage and bug fixes, but it needs to be backported.

        Show
        Kelvin Kakugawa added a comment - WIP patch against svn r923420 This is the last patch we have against cassandra 0.6. The internal version has more test coverage and bug fixes, but it needs to be backported.
        Hide
        Kelvin Kakugawa added a comment -

        np, I'll remove the historical patches.

        Show
        Kelvin Kakugawa added a comment - np, I'll remove the historical patches.
        Hide
        Jonathan Ellis added a comment -

        kelvin, can you r/m patches from this that are of historical interest only, for the benefit of those who want to see what the current (or at least, most-recently-posted) patchset it?

        Show
        Jonathan Ellis added a comment - kelvin, can you r/m patches from this that are of historical interest only, for the benefit of those who want to see what the current (or at least, most-recently-posted) patchset it?
        Hide
        Kelvin Kakugawa added a comment -

        Moved utility methods in VersionVectorContext to FBUtilities.
        Modified impl to use System.arraycopy().
        Modified internal methods to be protected/private (depending on whether necessary for testing).
        Removed now unused methods.

        Note: FBUtilities.compareByteSubArrays() will throw an IllegalArgumentException if a length is passed in that extends past either array.

        Show
        Kelvin Kakugawa added a comment - Moved utility methods in VersionVectorContext to FBUtilities. Modified impl to use System.arraycopy(). Modified internal methods to be protected/private (depending on whether necessary for testing). Removed now unused methods. Note: FBUtilities.compareByteSubArrays() will throw an IllegalArgumentException if a length is passed in that extends past either array.
        Hide
        Kelvin Kakugawa added a comment -

        For the initial server-side reconciliation, we'll probably adopt an approach similar to the way custom ColumnFamily comparators are specified.

        It might help to refer to:
        http://wiki.apache.org/cassandra/StorageConfiguration

        In particular, the section called, "Keyspaces and ColumnFamilies," where they discuss compareWith and how you can extend org.apache.cassandra.db.marshal.AbstractType to use your own ColumnFamily comparator.

        For the initial implementation to get this out the door, we won't support a javascript API. However, I'm open to suggestions. It probably would be nice to support a high-level scripting language for certain components.

        Show
        Kelvin Kakugawa added a comment - For the initial server-side reconciliation, we'll probably adopt an approach similar to the way custom ColumnFamily comparators are specified. It might help to refer to: http://wiki.apache.org/cassandra/StorageConfiguration In particular, the section called, "Keyspaces and ColumnFamilies," where they discuss compareWith and how you can extend org.apache.cassandra.db.marshal.AbstractType to use your own ColumnFamily comparator. For the initial implementation to get this out the door, we won't support a javascript API. However, I'm open to suggestions. It probably would be nice to support a high-level scripting language for certain components.
        Hide
        Pedro Gomes added a comment -

        I have been following this issue for some time now and I'm curious how will you deal with reconciliation.
        I suppose for the beginning of the discussion that some sort of interface will be implemented to allow pluggable logic to be added to the server, personalized scripts were an idea, I have heard.

        Is something planned? As a suggestion, Java Scripting API and scripts stored on Cassandra?
        Sorry if I seem hasty, but I will probably work on some dummy implementation as I need this for my work, and I didn't want to diverge from a future release of Cassandra.
        If you can give me some lights, it would be thankful.

        Show
        Pedro Gomes added a comment - I have been following this issue for some time now and I'm curious how will you deal with reconciliation. I suppose for the beginning of the discussion that some sort of interface will be implemented to allow pluggable logic to be added to the server, personalized scripts were an idea, I have heard. Is something planned? As a suggestion, Java Scripting API and scripts stored on Cassandra? Sorry if I seem hasty, but I will probably work on some dummy implementation as I need this for my work, and I didn't want to diverge from a future release of Cassandra. If you can give me some lights, it would be thankful.
        Hide
        Kelvin Kakugawa added a comment -

        np, I'll clean up the API. i.e. make internal methods private and move public methods to FBUtilities.

        Show
        Kelvin Kakugawa added a comment - np, I'll clean up the API. i.e. make internal methods private and move public methods to FBUtilities.
        Hide
        Jonathan Ellis added a comment -

        (It's okay to have helper methods that are only used in one class local to that class, but in that case they should be private.)

        Show
        Jonathan Ellis added a comment - (It's okay to have helper methods that are only used in one class local to that class, but in that case they should be private.)
        Hide
        Kelvin Kakugawa added a comment -

        Thanks for your comments.

        (1) I thought about putting the utility methods in FBUtilities. However, I left them in VVC for this pass. I can move them over.

        (2) You're right about cBSA. I didn't implement the cases that were already handled by VVC. However, if I move it to FBUtilities, I'll look for all the missing cases.

        (3) Thanks for the heads up. I'll look into replacing those loops w/ arraycopy.

        (4) My goal is to keep the context an opaque array, because I want to be support other version implementations. i.e. interval tree clocks, which have a different format. So, if I wanted to use an object representation, VVC would have to internally inflate the opaque context. However, the manual byte manipulation isn't as easy to read as an object-based implementation and this was a concern of mine.

        Right now, for the reconcile() method, I'm probably going to modify its interface. Instead of using a List of Pair objects, I have a new IColumn impl, VectorColumn, that would probably be more appropriate than Pair.

        Show
        Kelvin Kakugawa added a comment - Thanks for your comments. (1) I thought about putting the utility methods in FBUtilities. However, I left them in VVC for this pass. I can move them over. (2) You're right about cBSA. I didn't implement the cases that were already handled by VVC. However, if I move it to FBUtilities, I'll look for all the missing cases. (3) Thanks for the heads up. I'll look into replacing those loops w/ arraycopy. (4) My goal is to keep the context an opaque array, because I want to be support other version implementations. i.e. interval tree clocks, which have a different format. So, if I wanted to use an object representation, VVC would have to internally inflate the opaque context. However, the manual byte manipulation isn't as easy to read as an object-based implementation and this was a concern of mine. Right now, for the reconcile() method, I'm probably going to modify its interface. Instead of using a List of Pair objects, I have a new IColumn impl, VectorColumn, that would probably be more appropriate than Pair.
        Hide
        Jaakko Laine added a comment -

        Read through the patch, following comments/questions:

        (1) Utility methods should be put to FBUtilities

        (2) compareByteSubArrays seems to have following issues: (a) it does not handle the case where bytes2 is null and bytes1 is non-null (b) it does not handle the case where parameter 'length' is bigger than byte array length

        (3) bytes are copied in manual for loops. Is there a reason for not using arraycopy? It would perhaps make the code slightly cleaner and faster (don't know about the latter though, as this involves copying elements within the same array)

        (4) Would it make sense to use some level of data abstraction? Sorting and comparisons always involve copying & handling of individual bytes, which makes the code slightly cumbersome to read, and might be more inefficient than using object references as well (again, not sure about this, have to do some research).

        Show
        Jaakko Laine added a comment - Read through the patch, following comments/questions: (1) Utility methods should be put to FBUtilities (2) compareByteSubArrays seems to have following issues: (a) it does not handle the case where bytes2 is null and bytes1 is non-null (b) it does not handle the case where parameter 'length' is bigger than byte array length (3) bytes are copied in manual for loops. Is there a reason for not using arraycopy? It would perhaps make the code slightly cleaner and faster (don't know about the latter though, as this involves copying elements within the same array) (4) Would it make sense to use some level of data abstraction? Sorting and comparisons always involve copying & handling of individual bytes, which makes the code slightly cumbersome to read, and might be more inefficient than using object references as well (again, not sure about this, have to do some research).
        Hide
        Kelvin Kakugawa added a comment -

        Modify Deletion to use Clock, instead of i64 timestamp.

        Show
        Kelvin Kakugawa added a comment - Modify Deletion to use Clock, instead of i64 timestamp.
        Hide
        Kelvin Kakugawa added a comment -

        potential incompatibility w/ IColumn and IColumnContainer interfaces.

        The key issue is that Column and SuperColumn rely on timestamps to indicate when deletion should occur. In particular, through these methods:
        getMarkedForDeleteAt()
        mostRecentLiveChangeAt()

        A separate, less interesting, issue is that SuperColumn auto-assumes Column for its sub-column serializer.

        Right now, I've put together a new class, VersionColumn, that's context-based. I've added context() methods to IColumn and added some instanceof checks. However, I'm still exploring the implementation for the delete logic.

        atm, I'm trying to avoid having to add a SuperVersionColumn class. However, what is an interesting way to define the version "context" of a SuperColumn? An aggregated context of all the sub-columns may be worth exploring. Alternatively, since all the updates are timestamped, we could use that as a rough approximation for mostRecentLiveChangeAt(), but that seems to break the spirit of versioned contexts.

        Show
        Kelvin Kakugawa added a comment - potential incompatibility w/ IColumn and IColumnContainer interfaces. The key issue is that Column and SuperColumn rely on timestamps to indicate when deletion should occur. In particular, through these methods: getMarkedForDeleteAt() mostRecentLiveChangeAt() A separate, less interesting, issue is that SuperColumn auto-assumes Column for its sub-column serializer. Right now, I've put together a new class, VersionColumn, that's context-based. I've added context() methods to IColumn and added some instanceof checks. However, I'm still exploring the implementation for the delete logic. atm, I'm trying to avoid having to add a SuperVersionColumn class. However, what is an interesting way to define the version "context" of a SuperColumn? An aggregated context of all the sub-columns may be worth exploring. Alternatively, since all the updates are timestamped, we could use that as a rough approximation for mostRecentLiveChangeAt(), but that seems to break the spirit of versioned contexts.
        Hide
        Kelvin Kakugawa added a comment -

        minor modification:
        add comments to reconcile()'s param and return value

        Show
        Kelvin Kakugawa added a comment - minor modification: add comments to reconcile()'s param and return value
        Hide
        Kelvin Kakugawa added a comment -

        minor modification:
        added toString() method in IContext to create a human-readable string from a given context.

        Show
        Kelvin Kakugawa added a comment - minor modification: added toString() method in IContext to create a human-readable string from a given context.
        Hide
        Kelvin Kakugawa added a comment -

        re: the node id format.
        If Cassandra supports IPv6, then it might be advantageous to just use the IPv4 portion of it. If that's possible. Otherwise, IPv6's 16 bytes is kind of rough.

        Show
        Kelvin Kakugawa added a comment - re: the node id format. If Cassandra supports IPv6, then it might be advantageous to just use the IPv4 portion of it. If that's possible. Otherwise, IPv6's 16 bytes is kind of rough.
        Hide
        Kelvin Kakugawa added a comment -

        Nomenclature: When I use the terms superset/subset/disjoint, I really mean: dominates, dominated, neither dominated nor dominates. However, it's easier for me to read the code w/ standard set terms.

        Show
        Kelvin Kakugawa added a comment - Nomenclature: When I use the terms superset/subset/disjoint, I really mean: dominates, dominated, neither dominated nor dominates. However, it's easier for me to read the code w/ standard set terms.
        Hide
        Kelvin Kakugawa added a comment - - edited

        The patch adds a new package:
        db/context

        which contains:
        IContext
        VersionVectorContext

        IContext is just a general interface to manipulate context byte[]. Basically, create(), update() and reconcile().

        create(): create a new/empty context.
        update(): update this context w/ the local node's id.
        reconcile(): pass in a list of context-value pairs and it'll return a merged context (that supersedes all the passed in contexts) and a list of values that it couldn't automatically reconcile.

        VersionVectorContext is a version vector implementation.

        The VV format is a concatenated list of node id (IPv4's 4 bytes), count (int), and timestamp (long) tuples in a byte[].

        create(): returns an empty byte[].
        update(): will look for the local node's tuple in the byte[], increment its count, then prepend it to the front of the byte[] w/ an updated timestamp. So, that the byte[] is always in timestamp descending order.
        reconcile(): looks for all disjoint (incompatible) VVs and collapses all VVs that are a subset of another VV in the list. (implementation note: if 2 VVs are equal, but their values are not equivalent, both values will be added to the set of values that need to be manually reconciled. It seems inefficient, though, so when I go through the rest of the system, I'm going to see if I can avoid this check. Since, it's a problem that can only happen on the local node.)

        VersionVectorContext helper methods of interest:
        compareContexts(): sorts contexts by id, then steps through both contexts to determine pairwise: equality, superset, subset, disjoint.
        mergeContexts(): creates a map from node id to count-timestamp pairs, then create a timestamp-sorted array and pulls off up to the max entries to form the new merged context.

        Show
        Kelvin Kakugawa added a comment - - edited The patch adds a new package: db/context which contains: IContext VersionVectorContext IContext is just a general interface to manipulate context byte[]. Basically, create(), update() and reconcile(). create(): create a new/empty context. update(): update this context w/ the local node's id. reconcile(): pass in a list of context-value pairs and it'll return a merged context (that supersedes all the passed in contexts) and a list of values that it couldn't automatically reconcile. VersionVectorContext is a version vector implementation. The VV format is a concatenated list of node id (IPv4's 4 bytes), count (int), and timestamp (long) tuples in a byte[]. create(): returns an empty byte[]. update(): will look for the local node's tuple in the byte[], increment its count, then prepend it to the front of the byte[] w/ an updated timestamp. So, that the byte[] is always in timestamp descending order. reconcile(): looks for all disjoint (incompatible) VVs and collapses all VVs that are a subset of another VV in the list. (implementation note: if 2 VVs are equal, but their values are not equivalent, both values will be added to the set of values that need to be manually reconciled. It seems inefficient, though, so when I go through the rest of the system, I'm going to see if I can avoid this check. Since, it's a problem that can only happen on the local node.) VersionVectorContext helper methods of interest: compareContexts(): sorts contexts by id, then steps through both contexts to determine pairwise: equality, superset, subset, disjoint. mergeContexts(): creates a map from node id to count-timestamp pairs, then create a timestamp-sorted array and pulls off up to the max entries to form the new merged context.
        Hide
        Kelvin Kakugawa added a comment -

        first-pass at version vector context.

        comments appreciated.

        Show
        Kelvin Kakugawa added a comment - first-pass at version vector context. comments appreciated.
        Hide
        Kelvin Kakugawa added a comment -

        Honestly, I'm starting to lean that way, as well. If anything, we can make it a configuration setting, in the future.

        Show
        Kelvin Kakugawa added a comment - Honestly, I'm starting to lean that way, as well. If anything, we can make it a configuration setting, in the future.
        Hide
        Jonathan Ellis added a comment -

        If it's going to be complex – and it sounds like it is – I'd be inclined to prefer the "timestamped vector clock, with truncation" approach. (And since it's opaque to the client this can be changed later if we determine that the truncation actually is a problem in practice.)

        Show
        Jonathan Ellis added a comment - If it's going to be complex – and it sounds like it is – I'd be inclined to prefer the "timestamped vector clock, with truncation" approach. (And since it's opaque to the client this can be changed later if we determine that the truncation actually is a problem in practice.)
        Hide
        Kelvin Kakugawa added a comment -

        I've been talking w/ the authors of the interval tree clocks (ITC) paper about how to apply ITC to Cassandra, and it looks like we may need to modify the ITC algorithm for our use-case.

        The crux of the matter is Cassandra's hinted hand-off feature. The ITC algorithm composes an id-tree and event-tree to represent the version of a given value. The id-tree is a nice way to create unique ids on-the-fly for any node (by splitting the id-tree, as necessary) and the event-tree represents causality. However, the problem is that for a node to update the event-tree for a value, it has to be assigned a part of the id-tree beforehand.

        A short example, follows:
        If a node tries to forward a value, but (because of failure scenarios) it has to store the value, locally. It wouldn't be able to update the version of the value, unless it had been assigned a part of the id-tree beforehand from the set of nodes responsible for the value.

        The authors have a couple of solutions:
        1) Split the id-tree between all nodes in the cluster from the very start. This solves the problem, but it does mute the attractive benefits of ITC over traditional version vectors. i.e. dynamically partitioning the id space at run-time and only to the extent necessary to conserve space.
        2) On client reads, doing a "fork" instead of a "peek" and sharing the id-tree w/ the client. However, this is a more complicated approach that may need to be worked out some more.

        In any case, since we're using an opaque context, these decisions won't affect the interface. However, it's an interesting implementation concern. Depending on the average size of a Cassandra cluster, it may or may not be worth pre-forking the id-tree to all nodes from the very start.

        Show
        Kelvin Kakugawa added a comment - I've been talking w/ the authors of the interval tree clocks (ITC) paper about how to apply ITC to Cassandra, and it looks like we may need to modify the ITC algorithm for our use-case. The crux of the matter is Cassandra's hinted hand-off feature. The ITC algorithm composes an id-tree and event-tree to represent the version of a given value. The id-tree is a nice way to create unique ids on-the-fly for any node (by splitting the id-tree, as necessary) and the event-tree represents causality. However, the problem is that for a node to update the event-tree for a value, it has to be assigned a part of the id-tree beforehand. A short example, follows: If a node tries to forward a value, but (because of failure scenarios) it has to store the value, locally. It wouldn't be able to update the version of the value, unless it had been assigned a part of the id-tree beforehand from the set of nodes responsible for the value. The authors have a couple of solutions: 1) Split the id-tree between all nodes in the cluster from the very start. This solves the problem, but it does mute the attractive benefits of ITC over traditional version vectors. i.e. dynamically partitioning the id space at run-time and only to the extent necessary to conserve space. 2) On client reads, doing a "fork" instead of a "peek" and sharing the id-tree w/ the client. However, this is a more complicated approach that may need to be worked out some more. In any case, since we're using an opaque context, these decisions won't affect the interface. However, it's an interesting implementation concern. Depending on the average size of a Cassandra cluster, it may or may not be worth pre-forking the id-tree to all nodes from the very start.
        Hide
        Jonathan Ellis added a comment -

        LGTM

        Show
        Jonathan Ellis added a comment - LGTM
        Hide
        Kelvin Kakugawa added a comment -

        use binary for context

        Show
        Kelvin Kakugawa added a comment - use binary for context
        Hide
        Jonathan Ellis added a comment -

        just use "binary" instead of list<byte>

        Show
        Jonathan Ellis added a comment - just use "binary" instead of list<byte>
        Hide
        Kelvin Kakugawa added a comment -

        modified interface to use opaque context for versioning

        Show
        Kelvin Kakugawa added a comment - modified interface to use opaque context for versioning
        Hide
        Jonathan Ellis added a comment -

        making it a byte[] context makes sense.

        replacing the existing timestamps with that does not, since they are client-provided by design which is the opposite of the VC context.

        Show
        Jonathan Ellis added a comment - making it a byte[] context makes sense. replacing the existing timestamps with that does not, since they are client-provided by design which is the opposite of the VC context.
        Hide
        Kelvin Kakugawa added a comment -

        Thanks Stu for the lead.

        I studied the Interval Tree Clocks paper and it looks promising. Let me work out the algorithm, so that we can investigate its implementation in Cassandra.

        Having said the above, I think we should re-do the "clock" aspect of the interface and make it an opaque context object (like Dynamo). We pass it out to clients on a read, and when they update a given value they pass back the context that they're updating. I'm not sure if we want to extend the concept so far as to make existing timestamps just a special case of an opaque context, though.

        Show
        Kelvin Kakugawa added a comment - Thanks Stu for the lead. I studied the Interval Tree Clocks paper and it looks promising. Let me work out the algorithm, so that we can investigate its implementation in Cassandra. Having said the above, I think we should re-do the "clock" aspect of the interface and make it an opaque context object (like Dynamo). We pass it out to clients on a read, and when they update a given value they pass back the context that they're updating. I'm not sure if we want to extend the concept so far as to make existing timestamps just a special case of an opaque context, though.
        Hide
        Stu Hood added a comment -

        Regarding the timestamp being necessary in a version vector: have you looked at Interval Tree Clocks? The paper is slightly over my head, but the algorithm is supposed to be a generalization of version vectors / vector clocks, and it has a natural solution for changing members: http://en.wikipedia.org/wiki/Version_vector#cite_note-5

        Show
        Stu Hood added a comment - Regarding the timestamp being necessary in a version vector: have you looked at Interval Tree Clocks? The paper is slightly over my head, but the algorithm is supposed to be a generalization of version vectors / vector clocks, and it has a natural solution for changing members: http://en.wikipedia.org/wiki/Version_vector#cite_note-5
        Hide
        Kelvin Kakugawa added a comment -

        I think you're right about the timestamps not necessarily being useful to a client. It shouldn't be part of the interface, since it's really an implementation detail to manage the size of the vector. I'll make it internal to the server.

        Show
        Kelvin Kakugawa added a comment - I think you're right about the timestamps not necessarily being useful to a client. It shouldn't be part of the interface, since it's really an implementation detail to manage the size of the vector. I'll make it internal to the server.
        Hide
        Jonathan Ellis added a comment -

        ... but shouldn't that timestamp be internal to the server? we definitely don't want the client to specify what amounts to an implementation detail, and i don't think there is any reason to send it to the client on a read, either.

        Show
        Jonathan Ellis added a comment - ... but shouldn't that timestamp be internal to the server? we definitely don't want the client to specify what amounts to an implementation detail, and i don't think there is any reason to send it to the client on a read, either.
        Hide
        Jonathan Ellis added a comment -

        That makes sense, thanks.

        Show
        Jonathan Ellis added a comment - That makes sense, thanks.
        Hide
        Kelvin Kakugawa added a comment -

        Ah, I see what you were going after. That's a more concise interface change.

        My understanding about including a timestamp along w/ the logical clock is to limit the potential growth of the vector. Basically, whenever a new node updates a value, the vector size grows by one. However, the problem is that if many different nodes happen to update a given value (for various reasons--failure scenarios, etc.), the potential size of a vector could grow to an unmanageable length and it would keep that length forever. So, the Dynamo authors chose to tag each update with a timestamp, so they could truncate the vector to only the last 10 nodes to update the value. There is a possibility that an inconsistency could arise, because of the truncation. However, the paper said in practice, it was a non-issue.

        In summary, the timestamp is not there to help resolve consistency problems, it's there to make the vector more manageable.

        Show
        Kelvin Kakugawa added a comment - Ah, I see what you were going after. That's a more concise interface change. My understanding about including a timestamp along w/ the logical clock is to limit the potential growth of the vector. Basically, whenever a new node updates a value, the vector size grows by one. However, the problem is that if many different nodes happen to update a given value (for various reasons--failure scenarios, etc.), the potential size of a vector could grow to an unmanageable length and it would keep that length forever. So, the Dynamo authors chose to tag each update with a timestamp, so they could truncate the vector to only the last 10 nodes to update the value. There is a possibility that an inconsistency could arise, because of the truncation. However, the paper said in practice, it was a non-issue. In summary, the timestamp is not there to help resolve consistency problems, it's there to make the vector more manageable.
        Hide
        Jonathan Ellis added a comment -

        The reason I switched to favoring server-side-only is because it lets us get by with much smaller API changes, as in the attached.

        Also, why do we need a timestamp in LogicalClock when we have the counter? I thought the whole point was to get away from the problems posed by timestamps.

        Show
        Jonathan Ellis added a comment - The reason I switched to favoring server-side-only is because it lets us get by with much smaller API changes, as in the attached. Also, why do we need a timestamp in LogicalClock when we have the counter? I thought the whole point was to get away from the problems posed by timestamps.
        Hide
        Kelvin Kakugawa added a comment -

        rename vector remove to vector_remove (for consistency).

        Show
        Kelvin Kakugawa added a comment - rename vector remove to vector_remove (for consistency).
        Hide
        Kelvin Kakugawa added a comment -

        proposed vector clock interface

        The vector clock is returned by gets and required for insert/remove. I assume that server-side conflict resolution will be implemented, so I only pass back the definitive version of the row. If client-side conflict resolution will be implemented in the future, another data structure will be needed that encapsulates a list of conflicts and, maybe, a summary vector clock to be used for an insert/remove operation that resolves the conflict.

        Show
        Kelvin Kakugawa added a comment - proposed vector clock interface The vector clock is returned by gets and required for insert/remove. I assume that server-side conflict resolution will be implemented, so I only pass back the definitive version of the row. If client-side conflict resolution will be implemented in the future, another data structure will be needed that encapsulates a list of conflicts and, maybe, a summary vector clock to be used for an insert/remove operation that resolves the conflict.
        Hide
        Jonathan Ellis added a comment -

        Yes, that's how you'd have to change it. I'd rather not; it would get messy.

        (If we do server-side resolution we could still conceivably support both "classic" columnfamilies and vector clocked ones in the same Column and SuperColumn objects, just differing in their clock/timestamp field. But if we have to potentially store multiple versions of a column in a single row for the vector clock version, then I think that is diverging too far and we'd have to split the implementation. Remember that a classic ColumnFamily object just has a hash of its columns by name.)

        Show
        Jonathan Ellis added a comment - Yes, that's how you'd have to change it. I'd rather not; it would get messy. (If we do server-side resolution we could still conceivably support both "classic" columnfamilies and vector clocked ones in the same Column and SuperColumn objects, just differing in their clock/timestamp field. But if we have to potentially store multiple versions of a column in a single row for the vector clock version, then I think that is diverging too far and we'd have to split the implementation. Remember that a classic ColumnFamily object just has a hash of its columns by name.)
        Hide
        Stu Hood added a comment -

        > if the client has to resolve writes, you can no longer compact without involving the client.
        Kindof: you can still compact SSTables, but you need to keep all conflicting versions, which is why options (a) and (c) are still feasible.

        Show
        Stu Hood added a comment - > if the client has to resolve writes, you can no longer compact without involving the client. Kindof: you can still compact SSTables, but you need to keep all conflicting versions, which is why options (a) and (c) are still feasible.
        Hide
        Kelvin Kakugawa added a comment -

        I'll take Stu and Jonathan's advice and restrict the scope of this ticket to server-side conflict resolution.

        If client-side resolution is interesting, we can pursue it in a later ticket.

        Show
        Kelvin Kakugawa added a comment - I'll take Stu and Jonathan's advice and restrict the scope of this ticket to server-side conflict resolution. If client-side resolution is interesting, we can pursue it in a later ticket.
        Hide
        Jonathan Ellis added a comment -

        As Stu points out, if the client has to resolve writes, you can no longer compact without involving the client. This is a big big lose. +1 pluggable server-side conflict resolution from me.

        (This doesn't have to be complicated; just allow a class name to be specified per-CF like we do for CompareWith.)

        Also, I think you can make a good case that this is a better stylistic fit for Cassandra, which tries to support "dumb" clients more than Dynamo did.

        Show
        Jonathan Ellis added a comment - As Stu points out, if the client has to resolve writes, you can no longer compact without involving the client. This is a big big lose. +1 pluggable server-side conflict resolution from me. (This doesn't have to be complicated; just allow a class name to be specified per-CF like we do for CompareWith.) Also, I think you can make a good case that this is a better stylistic fit for Cassandra, which tries to support "dumb" clients more than Dynamo did.
        Hide
        Kelvin Kakugawa added a comment -

        Right now, I'm leaning towards client-side conflict resolution.

        Basically, all updates are written out and conflict resolution is handled at read time. An exception being a version in the Memtable that can be resolved syntactically. However, it would require more copies of the data and a more complex API. It would make the storage system more flexible for end users, though, since they wouldn't have to write server-side logic. However, they would have to parse a list of conflicting versions and pass back a context/summary version vector of the merged conflict.

        My reasoning is that Cassandra is write-optimized, so we should shift the burden to reads rather than writes.

        Show
        Kelvin Kakugawa added a comment - Right now, I'm leaning towards client-side conflict resolution. Basically, all updates are written out and conflict resolution is handled at read time. An exception being a version in the Memtable that can be resolved syntactically. However, it would require more copies of the data and a more complex API. It would make the storage system more flexible for end users, though, since they wouldn't have to write server-side logic. However, they would have to parse a list of conflicting versions and pass back a context/summary version vector of the merged conflict. My reasoning is that Cassandra is write-optimized, so we should shift the burden to reads rather than writes.
        Hide
        Stu Hood added a comment - - edited

        One important difference between Cassandra and some other systems utilizing vector clocks is that in Cassandra, writes do not read from disk any old versions of the value that is being written to. This means that conflict resolution currently happens in 3 different places:
        1. At read time - Two nodes have different versions,
        2. At write time - The version being written is older than the version a node has in a Memtable,
        3. At compaction time - Versions of values persisted to different SSTables disagree.

        NB: For the purposes of this ticket, I think that all resolution should be handled server side, deterministically, and that one of the following options should be implemented as part of a separate ticket.

        But before too much progress is made, we will probably want to decide whether we want to support:
        a) Client side conflict resolution (logic implemented on the client side),
        b) Server side resolution (pluggable logic on the server),
        c) A hybrid (pluggable resolution, which can optionally sends the versions to the client at resolution time #1 or #2)

        If we decide to implement client-side resolution (a), then we will need to remove resolution at steps #2 and #3 (or make it optional for option (c)), and keep more copies of the data. For #2, a Memtable could store conflicting versions in memory until they are resolved by a read or flushed to disk. For #3, SSTables will need to be able to store multiple versions of a row/cf until they are resolved by a read.

        Show
        Stu Hood added a comment - - edited One important difference between Cassandra and some other systems utilizing vector clocks is that in Cassandra, writes do not read from disk any old versions of the value that is being written to. This means that conflict resolution currently happens in 3 different places: 1. At read time - Two nodes have different versions, 2. At write time - The version being written is older than the version a node has in a Memtable, 3. At compaction time - Versions of values persisted to different SSTables disagree. NB: For the purposes of this ticket, I think that all resolution should be handled server side, deterministically, and that one of the following options should be implemented as part of a separate ticket. But before too much progress is made, we will probably want to decide whether we want to support: a) Client side conflict resolution (logic implemented on the client side), b) Server side resolution (pluggable logic on the server), c) A hybrid (pluggable resolution, which can optionally sends the versions to the client at resolution time #1 or #2) If we decide to implement client-side resolution (a), then we will need to remove resolution at steps #2 and #3 (or make it optional for option (c)), and keep more copies of the data. For #2, a Memtable could store conflicting versions in memory until they are resolved by a read or flushed to disk. For #3, SSTables will need to be able to store multiple versions of a row/cf until they are resolved by a read.
        Hide
        Kelvin Kakugawa added a comment -

        Thanks Mateusz, I believe you are right that we want the version vector variant.

        Show
        Kelvin Kakugawa added a comment - Thanks Mateusz, I believe you are right that we want the version vector variant.
        Hide
        Mateusz Berezecki added a comment -

        you probably want this: http://en.wikipedia.org/wiki/Version_vector

        instead of : http://en.wikipedia.org/wiki/Vector_clock

        see the section on different update rules.

        Show
        Mateusz Berezecki added a comment - you probably want this: http://en.wikipedia.org/wiki/Version_vector instead of : http://en.wikipedia.org/wiki/Vector_clock see the section on different update rules.

          People

          • Assignee:
            Kelvin Kakugawa
            Reporter:
            Kelvin Kakugawa
          • Votes:
            14 Vote for this issue
            Watchers:
            40 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 682h
              682h
              Remaining:
              Remaining Estimate - 682h
              682h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development