Cassandra
  1. Cassandra
  2. CASSANDRA-2045

Simplify HH to decrease read load when nodes come back

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently when HH is enabled, hints are stored, and when a node comes back, we begin sending that node data. We do a lookup on the local node for the row to send. To help reduce read load (if a node is offline for long period of time) we should store the data we want forward the node locally instead. We wouldn't have to do any lookups, just take byte[] and send to the destination.

      1. 0001-Changed-storage-of-Hints-to-store-a-serialized-RowMu.patch
        13 kB
        Nicholas Telford
      2. 0002-Refactored-HintedHandoffManager.sendRow-to-reduce-co.patch
        2 kB
        Nicholas Telford
      3. 0003-Fixed-some-coding-style-issues.patch
        3 kB
        Nicholas Telford
      4. 0004-Fixed-direct-usage-of-Gossiper.getEndpointStateForEn.patch
        1 kB
        Nicholas Telford
      5. 0005-Removed-duplicate-failure-detection-conditionals.-It.patch
        2 kB
        Nicholas Telford
      6. 0006-Removed-handling-of-old-style-hints.patch
        7 kB
        Nicholas Telford
      7. 2045-v3.txt
        17 kB
        Jonathan Ellis
      8. 2045-v5.txt
        17 kB
        Jonathan Ellis
      9. 2045-v6.txt
        17 kB
        Jonathan Ellis
      10. CASSANDRA-2045-simplify-hinted-handoff-001.diff
        12 kB
        Nicholas Telford
      11. CASSANDRA-2045-simplify-hinted-handoff-002.diff
        12 kB
        Nicholas Telford
      12. CASSANDRA-2045-v4.diff
        12 kB
        Nicholas Telford

        Activity

        Hide
        Jonathan Ellis added a comment -

        It's a tradeoff – storing the full mutation would be a much bigger hit on writes. Since we store hints on other replicas, storing just a "pointer" is almost free.

        when we were doing the more expensive HH of storing the hint + row on non-replica nodes, we saw a lot of cascading failures; this would be a similar bump in extra work done when HH kicks in, so that scares me.

        Show
        Jonathan Ellis added a comment - It's a tradeoff – storing the full mutation would be a much bigger hit on writes. Since we store hints on other replicas, storing just a "pointer" is almost free. when we were doing the more expensive HH of storing the hint + row on non-replica nodes, we saw a lot of cascading failures; this would be a similar bump in extra work done when HH kicks in, so that scares me.
        Hide
        Chris Goffinet added a comment - - edited

        I disagree, storing a pointer is not almost free. The trade off you make is read performance when nodes come back up. At the moment read performance could very well cause cascading failures too. You trade storage vs I/O. Nodes down for long periods of time, would have a bigger impact on the nodes trying to send HH data.

        Can you clarify, aren't we sending the entire row as well? We might be doing paging but if I modify 1 column, and it has 1M columns, 1M + 1 columns still get sent? I would agree to having it be tunable, but I prefer consistency of performance over storage when it comes to I/O.

        We brought this up because we actually saw this in our live clusters.

        Show
        Chris Goffinet added a comment - - edited I disagree, storing a pointer is not almost free. The trade off you make is read performance when nodes come back up. At the moment read performance could very well cause cascading failures too. You trade storage vs I/O. Nodes down for long periods of time, would have a bigger impact on the nodes trying to send HH data. Can you clarify, aren't we sending the entire row as well? We might be doing paging but if I modify 1 column, and it has 1M columns, 1M + 1 columns still get sent? I would agree to having it be tunable, but I prefer consistency of performance over storage when it comes to I/O. We brought this up because we actually saw this in our live clusters.
        Hide
        Ryan King added a comment - - edited

        I think the two approaches are suitable for different kinds of data models. The pointer approach is almost certainly better for narrow rows, while worse for large, dynamic rows.

        Show
        Ryan King added a comment - - edited I think the two approaches are suitable for different kinds of data models. The pointer approach is almost certainly better for narrow rows, while worse for large, dynamic rows.
        Hide
        Jonathan Ellis added a comment -

        Hmm. We could probably pick which one to use based on the CF histograms we collect now.

        Show
        Jonathan Ellis added a comment - Hmm. We could probably pick which one to use based on the CF histograms we collect now.
        Hide
        Brandon Williams added a comment -

        Do we still need to get this fancy now that we have CASSANDRA-2161?

        Show
        Brandon Williams added a comment - Do we still need to get this fancy now that we have CASSANDRA-2161 ?
        Hide
        Chris Goffinet added a comment -

        Yes. I think the trade off of storage vs reads even if you throttle.

        Show
        Chris Goffinet added a comment - Yes. I think the trade off of storage vs reads even if you throttle.
        Hide
        Jonathan Ellis added a comment -

        You want to use the pointer approach when your ratio of overwrites : row size is sufficiently high – the biggest win there is when you can turn dozens or hundreds of mutations, into replay of just the latest version.

        Not sure what the best way to estimate that is – Brandon suggested checking SSTable bloom filters on writes. Which is probably low-overhead enough, especially if we just do it only every 10% of writes for instance. I kind of like that idea, I think it will be useful in multiple places down the road.

        ("Sufficiently high" depends on SSD vs magnetic – time to introduce a postgresql-like random vs sequential penalty setting?)

        Show
        Jonathan Ellis added a comment - You want to use the pointer approach when your ratio of overwrites : row size is sufficiently high – the biggest win there is when you can turn dozens or hundreds of mutations, into replay of just the latest version. Not sure what the best way to estimate that is – Brandon suggested checking SSTable bloom filters on writes. Which is probably low-overhead enough, especially if we just do it only every 10% of writes for instance. I kind of like that idea, I think it will be useful in multiple places down the road. ("Sufficiently high" depends on SSD vs magnetic – time to introduce a postgresql-like random vs sequential penalty setting?)
        Hide
        Edward Capriolo added a comment -

        I wanted to suggest two ideas.

        • Commit Log
          Store it in commit log format
          Stream the commit log to the awoken host
          let the awoken host worry about replay throttling
        • store hints in a separate physical column family. Still have hinted pointers
          when finding data for the "pointer" the Read has to read through the much smaller hinted handoff table, rather then the actual data table. This
        • store hints in separate physical column family (IE write twice)
          Then stream the files to the awoken node.
        Show
        Edward Capriolo added a comment - I wanted to suggest two ideas. Commit Log Store it in commit log format Stream the commit log to the awoken host let the awoken host worry about replay throttling store hints in a separate physical column family. Still have hinted pointers when finding data for the "pointer" the Read has to read through the much smaller hinted handoff table, rather then the actual data table. This store hints in separate physical column family (IE write twice) Then stream the files to the awoken node.
        Hide
        Nicholas Telford added a comment -

        I've been looking in to this and I have a few observations/questions, although I'm still quite new to the Cassandra codebase, so if I'm wrong, please let me know.

        • Currently, when a node receives a RowMutation containing a hint, it stores it to the application CF and places a hint in the system hints CF. This is fine in the general case, but writes using CL.ANY may result in hinted RowMutations being sent to nodes that don't own that key. They still write the RowMutation to their application CF so they can pass it on to the destination node when it recovers. But this data is only ever deleted during a manual cleanup. Doesn't this mean that, given a very unstable cluster (e.g. EC2) writes using CL.ANY can cause nodes to fill up with data unexpectedly quickly?
        • The JavaDoc for HintedHandOffManager mentions another issue caused by the current strategy: cleanup compactions on the application CF will cause the hints to become invalid. It goes on to suggest a strategy similar to what's being discussed here (placing the individual RowMutations in a separate HH CF).
        • It's probably a good idea to try to retain backwards compatibility here as much as possible so that rolling upgrades of a cluster is possible - hints stored for the old version need to be deliverable to nodes coming back up with the new version and vice versa.
        • I think Edward's idea of storing hints in a per-node CommitLog is a pretty elegant solution, unfortunately it's quite a lot more invasive and would be a nightmare for maintaining backwards compatibility. Thoughts?
        Show
        Nicholas Telford added a comment - I've been looking in to this and I have a few observations/questions, although I'm still quite new to the Cassandra codebase, so if I'm wrong, please let me know. Currently, when a node receives a RowMutation containing a hint, it stores it to the application CF and places a hint in the system hints CF. This is fine in the general case, but writes using CL.ANY may result in hinted RowMutations being sent to nodes that don't own that key. They still write the RowMutation to their application CF so they can pass it on to the destination node when it recovers. But this data is only ever deleted during a manual cleanup. Doesn't this mean that, given a very unstable cluster (e.g. EC2) writes using CL.ANY can cause nodes to fill up with data unexpectedly quickly? The JavaDoc for HintedHandOffManager mentions another issue caused by the current strategy: cleanup compactions on the application CF will cause the hints to become invalid. It goes on to suggest a strategy similar to what's being discussed here (placing the individual RowMutations in a separate HH CF). It's probably a good idea to try to retain backwards compatibility here as much as possible so that rolling upgrades of a cluster is possible - hints stored for the old version need to be deliverable to nodes coming back up with the new version and vice versa. I think Edward's idea of storing hints in a per-node CommitLog is a pretty elegant solution, unfortunately it's quite a lot more invasive and would be a nightmare for maintaining backwards compatibility. Thoughts?
        Hide
        Jonathan Ellis added a comment -

        Doesn't this mean that, given a very unstable cluster (e.g. EC2) writes using CL.ANY can cause nodes to fill up with data unexpectedly quickly?

        Sort of. It means you can fill up by at most 1/RF faster than you thought, yes, since rows can only be stored on at most once node that is not a replica (the coordinator). The correct fix to that is "stabilize your cluster."

        It's probably a good idea to try to retain backwards compatibility here as much as possible so that rolling upgrades of a cluster is possible

        Right, but as discussed above we're not planning to move to materialized-hints entirely, so ripping out "classic" hints isn't an option anyway.

        I think Edward's idea of storing hints in a per-node CommitLog is a pretty elegant solution, unfortunately it's quite a lot more invasive and would be a nightmare for maintaining backwards compatibility.

        serialized mutation objects as columns in a row is pretty close to commitlog format, only you can query it w/ normal tools.

        Show
        Jonathan Ellis added a comment - Doesn't this mean that, given a very unstable cluster (e.g. EC2) writes using CL.ANY can cause nodes to fill up with data unexpectedly quickly? Sort of. It means you can fill up by at most 1/RF faster than you thought, yes, since rows can only be stored on at most once node that is not a replica (the coordinator). The correct fix to that is "stabilize your cluster." It's probably a good idea to try to retain backwards compatibility here as much as possible so that rolling upgrades of a cluster is possible Right, but as discussed above we're not planning to move to materialized-hints entirely, so ripping out "classic" hints isn't an option anyway. I think Edward's idea of storing hints in a per-node CommitLog is a pretty elegant solution, unfortunately it's quite a lot more invasive and would be a nightmare for maintaining backwards compatibility. serialized mutation objects as columns in a row is pretty close to commitlog format, only you can query it w/ normal tools.
        Hide
        Nicholas Telford added a comment -

        Implements serialized RowMutations for Hints.

        This should be optional, but currently isn't. The "if (true)" should be replaced with some logic to either detect the appropriate strategy from the CF histogram or using a manual per-CF setting. I've left this out for now pending a consensus on the matter.

        I'm not hugely familiar with the Cassandra codebase, so it's quite possible I've missed something.

        Unit tests are currently missing, I'll get those sorted out next. I wanted to get feedback on the implementation before continuing.

        I've optimised the patch for fewest changes, as such there's lots of room for refactoring (e.g. HHM.sendRow() and HHM.sendMutation() share a lot of validation code).

        Importantly, the RowMutations are indexed under a sub-column representing the MessagingService.version_ that serialized them. This allows nodes running on a different version to classify these hints as invalid and discard them.

        Show
        Nicholas Telford added a comment - Implements serialized RowMutations for Hints. This should be optional, but currently isn't. The "if (true)" should be replaced with some logic to either detect the appropriate strategy from the CF histogram or using a manual per-CF setting. I've left this out for now pending a consensus on the matter. I'm not hugely familiar with the Cassandra codebase, so it's quite possible I've missed something. Unit tests are currently missing, I'll get those sorted out next. I wanted to get feedback on the implementation before continuing. I've optimised the patch for fewest changes, as such there's lots of room for refactoring (e.g. HHM.sendRow() and HHM.sendMutation() share a lot of validation code). Importantly, the RowMutations are indexed under a sub-column representing the MessagingService.version_ that serialized them. This allows nodes running on a different version to classify these hints as invalid and discard them.
        Hide
        Jonathan Ellis added a comment -

        Looks pretty good to me. I'd only add that you should use RowMutation.getSerializedBuffer (which caches, so you don't redo the serialize unnecessarily for the commitlog as well), instead of manually using the serializer. And of course the obvious about following the C* code style (http://wiki.apache.org/cassandra/CodeStyle).

        The more I think about it the less I think it's worth keeping the old-style hinting around. The cleanup caveat (cleanup will throw out rows that don't belong to this replica, even if they have hints) is a pretty big one, even if it's fairly obscure (rows that don't belong will only be hinted for CL.ANY when all other replicas are down).

        Show
        Jonathan Ellis added a comment - Looks pretty good to me. I'd only add that you should use RowMutation.getSerializedBuffer (which caches, so you don't redo the serialize unnecessarily for the commitlog as well), instead of manually using the serializer. And of course the obvious about following the C* code style ( http://wiki.apache.org/cassandra/CodeStyle ). The more I think about it the less I think it's worth keeping the old-style hinting around. The cleanup caveat (cleanup will throw out rows that don't belong to this replica, even if they have hints) is a pretty big one, even if it's fairly obscure (rows that don't belong will only be hinted for CL.ANY when all other replicas are down).
        Hide
        Nicholas Telford added a comment - - edited
        • Rebased patch to latest trunk.
        • Changed serialization to use cached RowMutation.getSerializedBuffer

        I agree with your thoughts on removing the old-style hinting, although I haven't given much thought to the impact on compatibility. As far as I can tell, using my patch and removing the old style hinting would be a matter of removing hintedMutation.apply(); from RowMutationVerbHandler@63. For now, I've left it as-is.

        I tried to keep to the CodingStyle as much as possible; is there anything specific you noticed that was wrong?

        Btw, the tests I promised have been delayed by nightmarish work-schedule. I'll try to get them sorted ASAP though.

        Show
        Nicholas Telford added a comment - - edited Rebased patch to latest trunk. Changed serialization to use cached RowMutation.getSerializedBuffer I agree with your thoughts on removing the old-style hinting, although I haven't given much thought to the impact on compatibility. As far as I can tell, using my patch and removing the old style hinting would be a matter of removing hintedMutation.apply(); from RowMutationVerbHandler@63 . For now, I've left it as-is. I tried to keep to the CodingStyle as much as possible; is there anything specific you noticed that was wrong? Btw, the tests I promised have been delayed by nightmarish work-schedule. I'll try to get them sorted ASAP though.
        Hide
        Jonathan Ellis added a comment -

        Code style is generally fine, though there's a few violations of brace-on-newline.

        Show
        Jonathan Ellis added a comment - Code style is generally fine, though there's a few violations of brace-on-newline.
        Hide
        Nicholas Telford added a comment -

        Attached is my current patchset for this issue, rebased against trunk as of 23rd June.

        Patches 2, 5 and 6 are optional. 2 and 5 simply remove what I perceive to be redundant code; 6 entirely removes the old-style hint storage.

        Thanks for the tip on coding style, I actually noticed my mistakes shortly after asking - those damn braces, old habits die hard!

        Show
        Nicholas Telford added a comment - Attached is my current patchset for this issue, rebased against trunk as of 23rd June. Patches 2, 5 and 6 are optional. 2 and 5 simply remove what I perceive to be redundant code; 6 entirely removes the old-style hint storage. Thanks for the tip on coding style, I actually noticed my mistakes shortly after asking - those damn braces, old habits die hard!
        Hide
        Jonathan Ellis added a comment -

        Sorry, patch failures on 0001 already.

        Show
        Jonathan Ellis added a comment - Sorry, patch failures on 0001 already.
        Hide
        Nicholas Telford added a comment -

        It's quite possible that my previous patchset wasn't quite right. I'm still getting used to the git/apache workflow.

        I've rebased against trunk again and tested this set with `git am`. Should apply now.

        Show
        Nicholas Telford added a comment - It's quite possible that my previous patchset wasn't quite right. I'm still getting used to the git/apache workflow. I've rebased against trunk again and tested this set with `git am`. Should apply now.
        Hide
        Jonathan Ellis added a comment -

        Thanks!

        For v3 I merged patches and did some cleanup.

        It looks like we do a query per hint to look up its version on replay? I think we can avoid that (one of the benefits of the new approach is we should be able to just do seq reads of a hint row on replay). Why not just add version in as another subcolumn of the hint entry?

        HHOM javadoc needs to be updated.

        Minor: looks like HHOM.getTableAndCFNames is unused, can be removed.

        Show
        Jonathan Ellis added a comment - Thanks! For v3 I merged patches and did some cleanup. It looks like we do a query per hint to look up its version on replay? I think we can avoid that (one of the benefits of the new approach is we should be able to just do seq reads of a hint row on replay). Why not just add version in as another subcolumn of the hint entry? HHOM javadoc needs to be updated. Minor: looks like HHOM.getTableAndCFNames is unused, can be removed.
        Hide
        Jonathan Ellis added a comment -

        Thought of something else: if we're storing the full mutation, why add the complexity of hint headers and forwarding? Can we just make the coordinator responsible for all hints instead?

        Show
        Jonathan Ellis added a comment - Thought of something else: if we're storing the full mutation, why add the complexity of hint headers and forwarding? Can we just make the coordinator responsible for all hints instead?
        Hide
        Nicholas Telford added a comment -

        What if the coordinator happens to be one of the replicas for that key? Having the coordinator store the hint would mean it wasn't replicated at the replication_factor. The same is true for a coordinator that's not a replica for a key, but has to store a hint for multiple nodes (i.e. when multiple replicas are down).

        I don't like this; I was under the impression that HintedHandoff helps to retain the replication factor even in the face of failed replicas.

        Show
        Nicholas Telford added a comment - What if the coordinator happens to be one of the replicas for that key? Having the coordinator store the hint would mean it wasn't replicated at the replication_factor. The same is true for a coordinator that's not a replica for a key, but has to store a hint for multiple nodes (i.e. when multiple replicas are down). I don't like this; I was under the impression that HintedHandoff helps to retain the replication factor even in the face of failed replicas.
        Hide
        Jonathan Ellis added a comment -

        I was under the impression that HintedHandoff helps to retain the replication factor even in the face of failed replicas.

        Nope. If you require N replicas to be written, then you should use an appropriate consistency level.

        In 0.6+ hints are stored to other live replicas whenever possible (i.e. you still have less total replicas written) unless, as you noted, no replicas are alive and you're writing at CL.ANY.

        So my point is that after we move away from storing hints as pointers to row data, there's no reason for the "prefer other replicas" optimization so we might as well just always store it on the coordinator.

        Show
        Jonathan Ellis added a comment - I was under the impression that HintedHandoff helps to retain the replication factor even in the face of failed replicas. Nope. If you require N replicas to be written, then you should use an appropriate consistency level. In 0.6+ hints are stored to other live replicas whenever possible (i.e. you still have less total replicas written) unless, as you noted, no replicas are alive and you're writing at CL.ANY. So my point is that after we move away from storing hints as pointers to row data, there's no reason for the "prefer other replicas" optimization so we might as well just always store it on the coordinator.
        Hide
        Nicholas Telford added a comment -

        It looks like we do a query per hint to look up its version on replay? I think we can avoid that (one of the benefits of the new approach is we should be able to just do seq reads of a hint row on replay). Why not just add version in as another subcolumn of the hint entry?

        I don't quite follow this. The new schema for hints doesn't really allow sequential reads of the row. Here's what I currently have:

        Old
        -----
        Hints: {                    // cf
          <dest ip>: {              // key
            <key>: {                // super-column
              <table>-<cf>: null    // column
            }
          }
        }
        
        New
        ------
        Hints: {                    // cf
          <dest ip>: {              // key
            <key>: {                // super-column
              <table>-<cf>: <id>    // column
            }
          }
        }
        
        HintedMutations: {          // cf
          <dest ip>: {              // key
            <id>: {                 // super-column
              <version>: <mutation> // column
            }
          }
        }
        

        The point was to retain backwards compatability with the old Hints (so we don't have to expunge old ones on upgrade), but if we feel that we gain more by breaking this compatibility I'm open to it. As has been previously mentioned, losing hints during upgrade isn't the end of the world as they're little more than an optimization.

        Show
        Nicholas Telford added a comment - It looks like we do a query per hint to look up its version on replay? I think we can avoid that (one of the benefits of the new approach is we should be able to just do seq reads of a hint row on replay). Why not just add version in as another subcolumn of the hint entry? I don't quite follow this. The new schema for hints doesn't really allow sequential reads of the row. Here's what I currently have: Old ----- Hints: { // cf <dest ip>: { // key <key>: { // super-column <table>-<cf>: null // column } } } New ------ Hints: { // cf <dest ip>: { // key <key>: { // super-column <table>-<cf>: <id> // column } } } HintedMutations: { // cf <dest ip>: { // key <id>: { // super-column <version>: <mutation> // column } } } The point was to retain backwards compatability with the old Hints (so we don't have to expunge old ones on upgrade), but if we feel that we gain more by breaking this compatibility I'm open to it. As has been previously mentioned, losing hints during upgrade isn't the end of the world as they're little more than an optimization.
        Hide
        Nicholas Telford added a comment -

        if we're storing the full mutation, why add the complexity of hint headers and forwarding? Can we just make the coordinator responsible for all hints instead?

        So my point is that after we move away from storing hints as pointers to row data, there's no reason for the "prefer other replicas" optimization so we might as well just always store it on the coordinator.

        While I agree with this, it seems that changing this is non-trivial (lots of changes to StorageProxy by the looks of it) so I'm leaning towards not including it in this ticket. It seems like an isolated idea though, albeit one that depends on this issue. Can we open this as a dependent ticket?

        Show
        Nicholas Telford added a comment - if we're storing the full mutation, why add the complexity of hint headers and forwarding? Can we just make the coordinator responsible for all hints instead? So my point is that after we move away from storing hints as pointers to row data, there's no reason for the "prefer other replicas" optimization so we might as well just always store it on the coordinator. While I agree with this, it seems that changing this is non-trivial (lots of changes to StorageProxy by the looks of it) so I'm leaning towards not including it in this ticket. It seems like an isolated idea though, albeit one that depends on this issue. Can we open this as a dependent ticket?
        Hide
        Nicholas Telford added a comment -

        Another consideration: If we're moving away from the old hint storage layout, we can optimize for cases where the same RowMutation needs to be delivered to multiple endpoints (i.e. multiple replicas are down). This can be done by moving the "destination IP" down to the bottom level of the map so each RowMutation maps to multiple destinations.

        Thoughts?

        Show
        Nicholas Telford added a comment - Another consideration: If we're moving away from the old hint storage layout, we can optimize for cases where the same RowMutation needs to be delivered to multiple endpoints (i.e. multiple replicas are down). This can be done by moving the "destination IP" down to the bottom level of the map so each RowMutation maps to multiple destinations. Thoughts?
        Hide
        Brandon Williams added a comment -

        That's bad though, because then we can't access hints efficiently on a node up/down message (we actually did it that way in 0.6 and learned our lesson.)

        Show
        Brandon Williams added a comment - That's bad though, because then we can't access hints efficiently on a node up/down message (we actually did it that way in 0.6 and learned our lesson.)
        Hide
        Jonathan Ellis added a comment - - edited

        losing hints during upgrade isn't the end of the world

        Right. I'm saying we should do this:

        Hints: {                    // cf
          <dest ip>: {              // key
            <key>: {                // super-column
              <table>-<cf>: <id>    // column
              mutation: <mutation>  // column
            }
          }
        }
        

        So we denormalize but we gain not having to do secondary-lookup-per-mutation, which is our main motivation for the change. (And single-destination-per-hint is by far the common case.)

        Can we open this as a dependent ticket?

        WFM.

        Show
        Jonathan Ellis added a comment - - edited losing hints during upgrade isn't the end of the world Right. I'm saying we should do this: Hints: { // cf <dest ip>: { // key <key>: { // super-column <table>-<cf>: <id> // column mutation: <mutation> // column } } } So we denormalize but we gain not having to do secondary-lookup-per-mutation, which is our main motivation for the change. (And single-destination-per-hint is by far the common case.) Can we open this as a dependent ticket? WFM.
        Hide
        Nicholas Telford added a comment -

        That's bad though, because then we can't access hints efficiently on a node up/down message (we actually did it that way in 0.6 and learned our lesson.)

        Good point. I retract that idea.

        So we denormalize but we gain not having to do secondary-lookup-per-mutation, which is our main motivation for the change. (And single-destination-per-hint is by far the common case.)

        I'm a bit confused here. There could be many mutations for a single key, we'd need to store each of them. I do like the idea of being able to slide the mutations though. Perhaps we could form the key from a compound of the key-table-cf, so it would look something like this:

        Hints: {                    // cf
          <dest ip>: {              // key
            <key>-<table>-<cf>: {   // super-column
              <version>: <mutation> // column
            }
          }
        }
        

        Or is it vital that the key is stored separately from the table and cf?

        Show
        Nicholas Telford added a comment - That's bad though, because then we can't access hints efficiently on a node up/down message (we actually did it that way in 0.6 and learned our lesson.) Good point. I retract that idea. So we denormalize but we gain not having to do secondary-lookup-per-mutation, which is our main motivation for the change. (And single-destination-per-hint is by far the common case.) I'm a bit confused here. There could be many mutations for a single key, we'd need to store each of them. I do like the idea of being able to slide the mutations though. Perhaps we could form the key from a compound of the key-table-cf, so it would look something like this: Hints: { // cf <dest ip>: { // key <key>-<table>-<cf>: { // super-column <version>: <mutation> // column } } } Or is it vital that the key is stored separately from the table and cf?
        Hide
        Jonathan Ellis added a comment - - edited

        oops, didn't look too closely to what I was pasting.

        Hints: {                    // cf
          <dest ip>: {              // key
            <uuid>: {               // super-column
              table: <table>        // columns
              key: <key>
              mutation: <mutation>  
            }
          }
        }
        

        (Mutations can contain multiple CFs so storing a single CF value wouldn't make sense.)

        Show
        Jonathan Ellis added a comment - - edited oops, didn't look too closely to what I was pasting. Hints: { // cf <dest ip>: { // key <uuid>: { // super-column table: <table> // columns key: <key> mutation: <mutation> } } } (Mutations can contain multiple CFs so storing a single CF value wouldn't make sense.)
        Hide
        Nicholas Telford added a comment -

        Ok, that makes sense. I've implemented this in my tree (albeit with an additional "version" column to store the serialization version). I won't post the patch yet as I need to go through it all and ensure it's correct and clean it up a little.

        As an aside: while digging into RowMutationSerializer, I noticed that the "version" passed to deserialize() is ignored - is this intentional?

        Show
        Nicholas Telford added a comment - Ok, that makes sense. I've implemented this in my tree (albeit with an additional "version" column to store the serialization version). I won't post the patch yet as I need to go through it all and ensure it's correct and clean it up a little. As an aside: while digging into RowMutationSerializer, I noticed that the "version" passed to deserialize() is ignored - is this intentional?
        Hide
        Jonathan Ellis added a comment -

        I noticed that the "version" passed to deserialize() is ignored - is this intentional

        Just means RM serialization hasn't changed since we started versioning the protocol.

        Show
        Jonathan Ellis added a comment - I noticed that the "version" passed to deserialize() is ignored - is this intentional Just means RM serialization hasn't changed since we started versioning the protocol.
        Hide
        Nicholas Telford added a comment -

        Since we're now deserializing the RowMutation we find for a hint, we're going to need to start at least ensuring the version matches MessagingService.version_. At the moment, this is done by HHOM, but I'd feel more comfortable having it in RowMutationSerializer.deserialize().

        What would be preferable here? I was thinking of throwing an Exception in deserialize, but I'm concerned about other callers not handling it properly.

        Show
        Nicholas Telford added a comment - Since we're now deserializing the RowMutation we find for a hint, we're going to need to start at least ensuring the version matches MessagingService.version_. At the moment, this is done by HHOM, but I'd feel more comfortable having it in RowMutationSerializer.deserialize(). What would be preferable here? I was thinking of throwing an Exception in deserialize, but I'm concerned about other callers not handling it properly.
        Hide
        Jonathan Ellis added a comment -

        It's fine the way it is. It's simpler if the deserializers can assume they will get a valid version.

        Show
        Jonathan Ellis added a comment - It's fine the way it is. It's simpler if the deserializers can assume they will get a valid version.
        Hide
        Patricio Echague added a comment -

        Nicholas, what is the status of this ticket? I'm willing to help if you need me to. Please let me know.

        Show
        Patricio Echague added a comment - Nicholas, what is the status of this ticket? I'm willing to help if you need me to. Please let me know.
        Hide
        Nicholas Telford added a comment -

        Sorry about the delay to this, my free time this month has been pretty sparse.

        This patch combines all previous changes and uses the new hint storage format we discussed above.

        On my machine, a pile of unrelated tests are failing (mostly in SSTable.estimateRowsFromIndex), but I suspect this is an issue with my test setup.

        If you prefer, the full workflow for this patch is available on GitHub: https://github.com/nicktelford/cassandra/tree/CASSANDRA-2045

        Show
        Nicholas Telford added a comment - Sorry about the delay to this, my free time this month has been pretty sparse. This patch combines all previous changes and uses the new hint storage format we discussed above. On my machine, a pile of unrelated tests are failing (mostly in SSTable.estimateRowsFromIndex), but I suspect this is an issue with my test setup. If you prefer, the full workflow for this patch is available on GitHub: https://github.com/nicktelford/cassandra/tree/CASSANDRA-2045
        Hide
        Jonathan Ellis added a comment -

        if we're storing the full mutation, why add the complexity of hint headers and forwarding? Can we just make the coordinator responsible for all hints instead?

        Created CASSANDRA-2914 for this followup.

        Show
        Jonathan Ellis added a comment - if we're storing the full mutation, why add the complexity of hint headers and forwarding? Can we just make the coordinator responsible for all hints instead? Created CASSANDRA-2914 for this followup.
        Hide
        Jonathan Ellis added a comment -

        v5 fixes a couple bugs and updates the javadoc for HintedHandoffManager.

        Patricio, can you test that hint creation / replay works as expected?

        Show
        Jonathan Ellis added a comment - v5 fixes a couple bugs and updates the javadoc for HintedHandoffManager. Patricio, can you test that hint creation / replay works as expected?
        Hide
        Patricio Echague added a comment -

        Getting this:

        ERROR [MutationStage:40] 2011-07-19 11:52:58,649 AbstractCassandraDaemon.java (line 138) Fatal exception in thread Thread[MutationStage:40,5,main]
        java.lang.AssertionError: -836382071
        	at org.apache.cassandra.db.ExpiringColumn.<init>(ExpiringColumn.java:55)
        	at org.apache.cassandra.db.ExpiringColumn.<init>(ExpiringColumn.java:48)
        	at org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:135)
        	at org.apache.cassandra.db.RowMutation.add(RowMutation.java:173)
        	at org.apache.cassandra.db.RowMutation.hintFor(RowMutation.java:114)
        	at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:61)
        	at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        	at java.lang.Thread.run(Thread.java:680)
        
        Show
        Patricio Echague added a comment - Getting this: ERROR [MutationStage:40] 2011-07-19 11:52:58,649 AbstractCassandraDaemon.java (line 138) Fatal exception in thread Thread [MutationStage:40,5,main] java.lang.AssertionError: -836382071 at org.apache.cassandra.db.ExpiringColumn.<init>(ExpiringColumn.java:55) at org.apache.cassandra.db.ExpiringColumn.<init>(ExpiringColumn.java:48) at org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:135) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:173) at org.apache.cassandra.db.RowMutation.hintFor(RowMutation.java:114) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:61) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:680)
        Hide
        Jonathan Ellis added a comment -

        v6 fixes some variable name confusion in RM.hintFor, and deserialization from a direct buffer in replay.

        Show
        Jonathan Ellis added a comment - v6 fixes some variable name confusion in RM.hintFor, and deserialization from a direct buffer in replay.
        Hide
        Patricio Echague added a comment -

        Jonathan/all, the hints delivery seems to be broken in trunk. However, triggering manually the deliverHints() through JConsole effectively delivers the hints to the replica and remove them from the sender.

        Show
        Patricio Echague added a comment - Jonathan/all, the hints delivery seems to be broken in trunk. However, triggering manually the deliverHints() through JConsole effectively delivers the hints to the replica and remove them from the sender.
        Hide
        Jonathan Ellis added a comment -

        So delivery is broken prior to this ticket?

        Show
        Jonathan Ellis added a comment - So delivery is broken prior to this ticket?
        Hide
        Brandon Williams added a comment -

        I think CASSANDRA-2668 broke it. state.hasToken() is set in the Gossiper's status check, which won't have happened when the onAlive event is sent to SS. hasToken relies entirely on TM.isMember though, so maybe that is the better check.

        Show
        Brandon Williams added a comment - I think CASSANDRA-2668 broke it. state.hasToken() is set in the Gossiper's status check, which won't have happened when the onAlive event is sent to SS. hasToken relies entirely on TM.isMember though, so maybe that is the better check.
        Hide
        Jonathan Ellis added a comment -

        state.hasToken() is set in the Gossiper's status check, which won't have happened when the onAlive event is sent to SS.

        It looks to me though like once set it's permanent. How does it get un-set?

        Show
        Jonathan Ellis added a comment - state.hasToken() is set in the Gossiper's status check, which won't have happened when the onAlive event is sent to SS. It looks to me though like once set it's permanent. How does it get un-set?
        Hide
        Brandon Williams added a comment -

        It never gets unset, since a participating node can never suddenly become a fat client. However, the Ack and Ack2 verb handlers are applying a new ep state every time there is a generation change via Gossiper.applyStateLocally, so it's always unset initially when the node starts up. We should probably make hasToken private, since really that is what the Gossiper uses internally to determine if it needs to evict a dead fat client.

        Show
        Brandon Williams added a comment - It never gets unset, since a participating node can never suddenly become a fat client. However, the Ack and Ack2 verb handlers are applying a new ep state every time there is a generation change via Gossiper.applyStateLocally, so it's always unset initially when the node starts up. We should probably make hasToken private, since really that is what the Gossiper uses internally to determine if it needs to evict a dead fat client.
        Hide
        Jonathan Ellis added a comment -

        Created CASSANDRA-2928 to fix this.

        Show
        Jonathan Ellis added a comment - Created CASSANDRA-2928 to fix this.
        Hide
        Patricio Echague added a comment -

        Tested with CASSANDRA-2928 patch and it works perfectly.

        Test environment:

        • 2 nodes on localhost (127.0.02 and .3)

        Test case:

        • start both nodes
        • create the schema for testing
        • stop node 1
        • insert 5 keys into node 2.
        • verified HintsColumnFamily that has 5 entries in node 2.
        • start node 1.
        • Verify that node 1 has the new data
        • Verify that node 2 deleted the delivered hints.
        Show
        Patricio Echague added a comment - Tested with CASSANDRA-2928 patch and it works perfectly. Test environment: 2 nodes on localhost (127.0.02 and .3) Test case: start both nodes create the schema for testing stop node 1 insert 5 keys into node 2. verified HintsColumnFamily that has 5 entries in node 2. start node 1. Verify that node 1 has the new data Verify that node 2 deleted the delivered hints.
        Hide
        Hudson added a comment -

        Integrated in Cassandra #968 (See https://builds.apache.org/job/Cassandra/968/)
        store hints as serialized mutations instead of pointers to data rows
        patch by Nick Telford, jbellis, and Patricio Echague for CASSANDRA-2045

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149396
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutationVerbHandler.java
        • /cassandra/trunk/CHANGES.txt
        • /cassandra/trunk/src/java/org/apache/cassandra/db/HintedHandOffManager.java
        Show
        Hudson added a comment - Integrated in Cassandra #968 (See https://builds.apache.org/job/Cassandra/968/ ) store hints as serialized mutations instead of pointers to data rows patch by Nick Telford, jbellis, and Patricio Echague for CASSANDRA-2045 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149396 Files : /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutationVerbHandler.java /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/db/HintedHandOffManager.java

          People

          • Assignee:
            Nicholas Telford
            Reporter:
            Chris Goffinet
            Reviewer:
            Jonathan Ellis
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development