Cassandra
  1. Cassandra
  2. CASSANDRA-2034

Make Read Repair unnecessary when Hinted Handoff is enabled

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:
      None

      Description

      Currently, HH is purely an optimization – if a machine goes down, enabling HH means RR/AES will have less work to do, but you can't disable RR entirely in most situations since HH doesn't kick in until the FailureDetector does.

      Let's add a scheduled task to the mutate path, such that we return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets.

      This would making disabling RR when HH is enabled a much more reasonable option, which has a huge impact on read throughput.

      1. 2034-v21.txt
        79 kB
        Patricio Echague
      2. 2034-v20.txt
        75 kB
        Patricio Echague
      3. 2034-v19-rebased.txt
        70 kB
        Jonathan Ellis
      4. 2034-v19.txt
        69 kB
        Patricio Echague
      5. 2034-v18.txt
        28 kB
        Jonathan Ellis
      6. 2034-v17.txt
        65 kB
        Patricio Echague
      7. 2034-v16.txt
        58 kB
        Jonathan Ellis
      8. CASSANDRA-2034-trunk-v15.patch
        66 kB
        Patricio Echague
      9. CASSANDRA-2034-trunk-v14.patch
        71 kB
        Patricio Echague
      10. CASSANDRA-2034-trunk-v13.patch
        63 kB
        Patricio Echague
      11. CASSANDRA-2034-trunk-v12.patch
        39 kB
        Patricio Echague
      12. CASSANDRA-2034-trunk-v11.patch
        39 kB
        Patricio Echague
      13. CASSANDRA-2034-trunk-v11.patch
        37 kB
        Patricio Echague
      14. CASSANDRA-2034-trunk-v10.patch
        34 kB
        Patricio Echague
      15. CASSANDRA-2034-trunk-v9.patch
        36 kB
        Patricio Echague
      16. CASSANDRA-2034-trunk-v8.patch
        35 kB
        Patricio Echague
      17. CASSANDRA-2034-trunk-v7.patch
        40 kB
        Patricio Echague
      18. CASSANDRA-2034-trunk-v6.patch
        29 kB
        Patricio Echague
      19. 2034-formatting.txt
        18 kB
        Jonathan Ellis
      20. CASSANDRA-2034-trunk-v5.patch
        31 kB
        Patricio Echague
      21. CASSANDRA-2034-trunk-v4.patch
        28 kB
        Patricio Echague
      22. CASSANDRA-2034-trunk-v3.patch
        25 kB
        Patricio Echague
      23. CASSANDRA-2034-trunk-v2.patch
        23 kB
        Patricio Echague
      24. CASSANDRA-2034-trunk.patch
        20 kB
        Patricio Echague

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          This would need a separate executor for local writes that doesn't drop writes when it's behind (and blocks when it's full) to avoid problems in overcapacity situations.

          I'm about halfway convinced that while blocking clients for writes to

          {any node in the cluster}

          is bad to control overload situations, blocking clients when the coordinator itself is overloaded is ok.

          Show
          Jonathan Ellis added a comment - This would need a separate executor for local writes that doesn't drop writes when it's behind (and blocks when it's full) to avoid problems in overcapacity situations. I'm about halfway convinced that while blocking clients for writes to {any node in the cluster} is bad to control overload situations, blocking clients when the coordinator itself is overloaded is ok.
          Hide
          Jonathan Ellis added a comment -

          add a scheduled task

          this is the wrong approach, as we found out when we tried something similar for read repair, which we fixed in CASSANDRA-2069.

          Better would be to add a hook to messagingservice callback expiration, and fire hint recording from there if MS expires the callback before all acks are received. (We could refactor the dynamic snitch latency update into a similar hook for reads.)

          This would need a separate executor for local writes that doesn't drop writes when it's behind

          I'm more worried about this; there is the potential to take us back to the Bad Old Days when HH could cause cascading failure. (Of course the right answer is, "Don't run your cluster so close to the edge of capacity," but we still want to degrade gracefully when this is ignored.)

          Show
          Jonathan Ellis added a comment - add a scheduled task this is the wrong approach, as we found out when we tried something similar for read repair, which we fixed in CASSANDRA-2069 . Better would be to add a hook to messagingservice callback expiration, and fire hint recording from there if MS expires the callback before all acks are received. (We could refactor the dynamic snitch latency update into a similar hook for reads.) This would need a separate executor for local writes that doesn't drop writes when it's behind I'm more worried about this; there is the potential to take us back to the Bad Old Days when HH could cause cascading failure. (Of course the right answer is, "Don't run your cluster so close to the edge of capacity," but we still want to degrade gracefully when this is ignored.)
          Hide
          Jonathan Ellis added a comment -

          there is the potential to take us back to the Bad Old Days when HH could cause cascading failure

          To elaborate, the scenario here is, we did a write that succeeded on some nodes, but not others. So we need to write a local hint to replay to the down-or-slow nodes later. But, those nodes being down-or-slow mean load has increased on the rest of the cluster, and writing the extra hint will increase that further, possibly enough that other nodes will see this coordinator as down-or-slow, too, and so on.

          So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time, throw back a TimedOutException which is already our signal that "your cluster may be overloaded, you need to back off."

          Specifically, we could add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue the write but if it is rejected because queue is full we throw the TOE. Otherwise, we wait for the write and then return success to the client.

          The tricky part is the queue needs to be large enough to handle load spikes but small enough that wait-for-success-post-enqueue is negligible compared to RpcTimeout. If we had different timeouts for writes than reads (which we don't – CASSANDRA-959) then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase.

          Show
          Jonathan Ellis added a comment - there is the potential to take us back to the Bad Old Days when HH could cause cascading failure To elaborate, the scenario here is, we did a write that succeeded on some nodes, but not others. So we need to write a local hint to replay to the down-or-slow nodes later. But, those nodes being down-or-slow mean load has increased on the rest of the cluster, and writing the extra hint will increase that further, possibly enough that other nodes will see this coordinator as down-or-slow, too, and so on. So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time, throw back a TimedOutException which is already our signal that "your cluster may be overloaded, you need to back off." Specifically, we could add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue the write but if it is rejected because queue is full we throw the TOE. Otherwise, we wait for the write and then return success to the client. The tricky part is the queue needs to be large enough to handle load spikes but small enough that wait-for-success-post-enqueue is negligible compared to RpcTimeout. If we had different timeouts for writes than reads (which we don't – CASSANDRA-959 ) then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase.
          Hide
          T Jake Luciani added a comment -

          Don't hints timeout? would there be a chance of never resolving the discrepancy with this approach?

          Show
          T Jake Luciani added a comment - Don't hints timeout? would there be a chance of never resolving the discrepancy with this approach?
          Hide
          Stu Hood added a comment -

          Better would be to add a hook to messagingservice callback expiration, and fire hint recording from there

          So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time...

          +1. An expiration handler on the messaging service queue, and throttled local hint writing should work very well.

          Show
          Stu Hood added a comment - Better would be to add a hook to messagingservice callback expiration, and fire hint recording from there So I think what we want to do, with this option on, is to attempt the hint write but if we can't do it in a reasonable time... +1. An expiration handler on the messaging service queue, and throttled local hint writing should work very well.
          Hide
          Jonathan Ellis added a comment -

          Don't hints timeout?

          No (but cleanup can purge hinted rows, so don't do that unless all hints have been replayed).

          would there be a chance of never resolving the discrepancy with this approach?

          Definitely in the case of "a node went down, so I wrote some hints, but then the hinted node lost its hdd too." In that case you'd need to run AE repair.

          So the idea is not to make AE repair (completely) obsolete, only RR.

          Show
          Jonathan Ellis added a comment - Don't hints timeout? No (but cleanup can purge hinted rows, so don't do that unless all hints have been replayed). would there be a chance of never resolving the discrepancy with this approach? Definitely in the case of "a node went down, so I wrote some hints, but then the hinted node lost its hdd too." In that case you'd need to run AE repair. So the idea is not to make AE repair (completely) obsolete, only RR.
          Hide
          Stu Hood added a comment -

          IMO, disabling RR entirely is never a good idea unless we are going to guarantee hint delivery. But I agree that this ticket is a good idea because increasing the probability of hint delivery is healthy.

          Show
          Stu Hood added a comment - IMO, disabling RR entirely is never a good idea unless we are going to guarantee hint delivery. But I agree that this ticket is a good idea because increasing the probability of hint delivery is healthy.
          Hide
          Jonathan Ellis added a comment -

          We should probably make this "aggressive HH" optional to start with, in case of bugs if nothing else.

          Or maybe change hinted_handoff_enabled from a bool to a string – "off", "old", "aggressive".

          Show
          Jonathan Ellis added a comment - We should probably make this "aggressive HH" optional to start with, in case of bugs if nothing else. Or maybe change hinted_handoff_enabled from a bool to a string – "off", "old", "aggressive".
          Hide
          Jonathan Ellis added a comment -

          Note that hint writes MUST be synchronous w/ the writes-as-percieved-by-client (which the design above accomplishes) or else you lose hints silently if a coordinator crashes (or is shut down/killed normally).

          Show
          Jonathan Ellis added a comment - Note that hint writes MUST be synchronous w/ the writes-as-percieved-by-client (which the design above accomplishes) or else you lose hints silently if a coordinator crashes (or is shut down/killed normally).
          Hide
          Jonathan Ellis added a comment -

          If we had different timeouts for writes than reads then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase

          Different r/w timeouts is being added in CASSANDRA-2819.

          Show
          Jonathan Ellis added a comment - If we had different timeouts for writes than reads then it might be nice to use say 80% of the timeout for the normal write, and reserve 20% for the hint phase Different r/w timeouts is being added in CASSANDRA-2819 .
          Hide
          Patricio Echague added a comment -

          Depends on CASSANDRA-2045.

          Show
          Patricio Echague added a comment - Depends on CASSANDRA-2045 .
          Hide
          Nicholas Telford added a comment - - edited

          Don't hints timeout?

          Yes, the timeout is set to the GCGraceSeconds of the CF the hint is for. This is to prevent deletes from being undone by an old hint being replayed.

          Post-#2045 hints contain the RM for multiple CFs; as such, the TTL for a hint is the minimum GCGraceSeconds from all of the CFs it references. This could cause hints to expire before delivery if one of the CFs for a mutation has a particularly short GCGraceSeconds.

          Show
          Nicholas Telford added a comment - - edited Don't hints timeout? Yes, the timeout is set to the GCGraceSeconds of the CF the hint is for. This is to prevent deletes from being undone by an old hint being replayed. Post-#2045 hints contain the RM for multiple CFs; as such, the TTL for a hint is the minimum GCGraceSeconds from all of the CFs it references. This could cause hints to expire before delivery if one of the CFs for a mutation has a particularly short GCGraceSeconds.
          Hide
          Jonathan Ellis added a comment -

          the timeout is set to the GCGraceSeconds of the CF the hint is for

          That's right.

          So the guidance we can give is, "you need to run repair if a node is down for longer than GCGraceSeconds, or if you lose data because of a hardware problem."

          Show
          Jonathan Ellis added a comment - the timeout is set to the GCGraceSeconds of the CF the hint is for That's right. So the guidance we can give is, "you need to run repair if a node is down for longer than GCGraceSeconds, or if you lose data because of a hardware problem."
          Hide
          Patricio Echague added a comment -

          after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets.

          CASSANDRA-2914 handles the local storage of hints.

          Show
          Patricio Echague added a comment - after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets. CASSANDRA-2914 handles the local storage of hints.
          Hide
          Patricio Echague added a comment -

          Attaching a preliminary patch.

          I still need to handle hints on failure.

          Please have a look at how the timeouts are handled and if that part needs to be improved.

          Show
          Patricio Echague added a comment - Attaching a preliminary patch. I still need to handle hints on failure. Please have a look at how the timeouts are handled and if that part needs to be improved.
          Hide
          Jonathan Ellis added a comment -

          Looks reasonable. I'd move the wait method into FBUtilities as an overload of waitOnFutures.

          Does the timeout on get start when the future is created, or when get is called? I think it is the latter but I am not sure. If so, we should track the total time waited and reduce after each get() so we do not allow total of up to timeout * hints.

          Show
          Jonathan Ellis added a comment - Looks reasonable. I'd move the wait method into FBUtilities as an overload of waitOnFutures. Does the timeout on get start when the future is created, or when get is called? I think it is the latter but I am not sure. If so, we should track the total time waited and reduce after each get() so we do not allow total of up to timeout * hints.
          Hide
          Patricio Echague added a comment -

          Does the timeout on get start when the future is created, or when get is called? I think it is the latter but I am not sure. If so, we should track the total time waited and reduce after each get() so we do not allow total of up to timeout * hints.

          Yeah, I need to add a creation time and so something similar to what IAsynResult does.

          I noticed I missed to skip the hints creation when HH is disabled.
          Some thoughts on this I would like some feedback:

          Note: remember that hints are written locally on the coordinator node now.

          Hinted Handoff Consist. Level
          on >=1 --> wait for hints. We DO NOT notify the handler with handler.response() for hints;
          on ANY --> wait for hints. Responses count towards consistency.
          off >=1 --> DO NOT fire hints. And DO NOT wait for them to complete.
          off ANY --> Fire hints but don't wait for them. They count towards consistency.
          Show
          Patricio Echague added a comment - Does the timeout on get start when the future is created, or when get is called? I think it is the latter but I am not sure. If so, we should track the total time waited and reduce after each get() so we do not allow total of up to timeout * hints. Yeah, I need to add a creation time and so something similar to what IAsynResult does. I noticed I missed to skip the hints creation when HH is disabled. Some thoughts on this I would like some feedback: Note: remember that hints are written locally on the coordinator node now. Hinted Handoff Consist. Level on >=1 --> wait for hints. We DO NOT notify the handler with handler.response() for hints; on ANY --> wait for hints. Responses count towards consistency. off >=1 --> DO NOT fire hints. And DO NOT wait for them to complete. off ANY --> Fire hints but don't wait for them. They count towards consistency.
          Hide
          Jonathan Ellis added a comment -

          Looks good to me for the first 3.

          I think ANY should be equal to ONE for hints=off. I.e., when it's off we never create hints.

          Show
          Jonathan Ellis added a comment - Looks good to me for the first 3. I think ANY should be equal to ONE for hints=off. I.e., when it's off we never create hints.
          Hide
          Patricio Echague added a comment - - edited

          v2 patch replaces v1.

          Changes in v2:

          • It fixes the add-up timeouts by adding a CreationAwareFuture
          • moves wait method into FBUtilities
          • Implements the matrix previously discussed
          Hinted Handoff Consist. Level
          on >=1 --> wait for hints. We DO NOT notify the handler with handler.response() for hints;
          on ANY --> wait for hints. Responses count towards consistency.
          off >=1 --> DO NOT fire hints. And DO NOT wait for them to complete.
          off ANY --> DO NOT fire hints. And DO NOT wait for them to complete.
          Show
          Patricio Echague added a comment - - edited v2 patch replaces v1. Changes in v2: It fixes the add-up timeouts by adding a CreationAwareFuture moves wait method into FBUtilities Implements the matrix previously discussed Hinted Handoff Consist. Level on >=1 --> wait for hints. We DO NOT notify the handler with handler.response() for hints; on ANY --> wait for hints. Responses count towards consistency. off >=1 --> DO NOT fire hints. And DO NOT wait for them to complete. off ANY --> DO NOT fire hints. And DO NOT wait for them to complete.
          Hide
          Patricio Echague added a comment - - edited

          v3 replaces all previous patches.

          Add missing local writes for when a node is a coordinator, is a replica and also holds a hint.

          Wire up a bit the nodereplied logic in the WriteHandler.

          Still need to write hints on failure. (next patch)

          Show
          Patricio Echague added a comment - - edited v3 replaces all previous patches. Add missing local writes for when a node is a coordinator, is a replica and also holds a hint. Wire up a bit the nodereplied logic in the WriteHandler. Still need to write hints on failure. (next patch)
          Hide
          Patricio Echague added a comment -

          v4 replaces previous ones.

          Fire hints when a replica does not reply

          Show
          Patricio Echague added a comment - v4 replaces previous ones. Fire hints when a replica does not reply
          Hide
          Patricio Echague added a comment - - edited

          This ticket assumes that counter mutations will not be written with CL == ANY (CASSANDRA-2990)

          Show
          Patricio Echague added a comment - - edited This ticket assumes that counter mutations will not be written with CL == ANY ( CASSANDRA-2990 )
          Hide
          Jonathan Ellis added a comment -

          I think CreationTimeAwareFuture is missing?

          Show
          Jonathan Ellis added a comment - I think CreationTimeAwareFuture is missing?
          Hide
          Patricio Echague added a comment -

          v5 replaces v4.

          add src/java/org/apache/cassandra/concurrent/CreationTimeAwareFuture.java

          Show
          Patricio Echague added a comment - v5 replaces v4. add src/java/org/apache/cassandra/concurrent/CreationTimeAwareFuture.java
          Hide
          Jonathan Ellis added a comment -

          Attaching formatting patch that applies on top of v5. (Easier than explaining in English.) Looking at the actual hints next.

          Show
          Jonathan Ellis added a comment - Attaching formatting patch that applies on top of v5. (Easier than explaining in English.) Looking at the actual hints next.
          Hide
          Patricio Echague added a comment -

          v6 replaces (v5 + 2034-formatting)

          • applies Jonathan formatting patch
          • add assertion in AbstractWriteResponseHandler#markTargetAsReplied
          • log warning when the hint queue is full and there is a RejectionException when trying to queue up a new task
          Show
          Patricio Echague added a comment - v6 replaces (v5 + 2034-formatting) applies Jonathan formatting patch add assertion in AbstractWriteResponseHandler#markTargetAsReplied log warning when the hint queue is full and there is a RejectionException when trying to queue up a new task
          Hide
          Patricio Echague added a comment -

          New v7 patch.

          • When HH is enabled it waits for the min(RPC Timeout , all replicas's ACK)

          The down side is that even for CL.ONE we wait for all the replicas to reply to decide if we need to write a hint or not.
          So if a node is slow, request/mutation with CL.ONE will perform slightly slower when HH is enabled.

          Show
          Patricio Echague added a comment - New v7 patch. When HH is enabled it waits for the min(RPC Timeout , all replicas's ACK) The down side is that even for CL.ONE we wait for all the replicas to reply to decide if we need to write a hint or not. So if a node is slow, request/mutation with CL.ONE will perform slightly slower when HH is enabled.
          Hide
          Jonathan Ellis added a comment -

          So I proposed two mutually exclusive approaches here:

          • return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets
          • add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue [and] wait for the write and then return success to the client

          The difference can be summarized as: do we wait for all hints to be written before returning to the client? If you do, then CL.ONE write latency becomes worst-of-N instead of best-of-N. But, you are guaranteed that successful writes have been hinted (if necessary) so you do not have to repair unless there is hardware permadeath. (Otherwise you would have to repair after power failure or crashes, too.)

          I'm inclined to think that the first option is better, partly because writes are fast so worst-of-N really isn't that different from best-of-N. Also, we could use CASSANDRA-2819 to reduce the default write timeout while still being conservative for reads (which might hit disk).

          Show
          Jonathan Ellis added a comment - So I proposed two mutually exclusive approaches here: return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets add a separate executor here, with a blocking, capped queue. When we go to do a hint-after-failure we enqueue [and] wait for the write and then return success to the client The difference can be summarized as: do we wait for all hints to be written before returning to the client? If you do, then CL.ONE write latency becomes worst-of-N instead of best-of-N. But, you are guaranteed that successful writes have been hinted (if necessary) so you do not have to repair unless there is hardware permadeath. (Otherwise you would have to repair after power failure or crashes, too.) I'm inclined to think that the first option is better, partly because writes are fast so worst-of-N really isn't that different from best-of-N. Also, we could use CASSANDRA-2819 to reduce the default write timeout while still being conservative for reads (which might hit disk).
          Hide
          Lior Golan added a comment -

          Writes are fast, but you also have network latency for communicating with all the nodes. What would happen in the worst-of-N case for multi-datacenter deployments?

          Show
          Lior Golan added a comment - Writes are fast, but you also have network latency for communicating with all the nodes. What would happen in the worst-of-N case for multi-datacenter deployments?
          Hide
          Jonathan Ellis added a comment -

          Ugh. Yeah, that pretty much rules out the always-wait idea, doesn't it?

          Show
          Jonathan Ellis added a comment - Ugh. Yeah, that pretty much rules out the always-wait idea, doesn't it?
          Hide
          Patricio Echague added a comment -

          return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets

          The way I'm planning on implementing this last part is by adding a maybeWriteHint in the hook that already exist for the ExpiringMap in MessagingService.
          The only problem is that I don't have the mutation as to generate a hint in MessageService.
          I might catch what nodes hasn't replied in the StorageProxy and store one mutation per node that hasn't replied into an ExpiringMap in MessagingService.

          The down side is that for CL.ONE I will end up storing basically one mutation per replica minus the one that responded, and it may lead to memory issues.

          Thoughts?

          Show
          Patricio Echague added a comment - return to the client normally after ConsistencyLevel is achieved, but after RpcTimeout we check the responseHandler write acks and write local hints for any missing targets The way I'm planning on implementing this last part is by adding a maybeWriteHint in the hook that already exist for the ExpiringMap in MessagingService. The only problem is that I don't have the mutation as to generate a hint in MessageService. I might catch what nodes hasn't replied in the StorageProxy and store one mutation per node that hasn't replied into an ExpiringMap in MessagingService. The down side is that for CL.ONE I will end up storing basically one mutation per replica minus the one that responded, and it may lead to memory issues. Thoughts?
          Hide
          T Jake Luciani added a comment -

          I think you can reuse the same RowMutation object across messages so shouldn't cause duplicates in memory.

          The only issue is if too many mutations queue up you might OOM but this is the same problem we currently we have with the write stage. So if you use a Expiring map you should add a onExpiration hook to write the hint locally for the replicas that never responded to the Mutation. This covers the case that mutations expire before a response is received.

          Then all the MessageTask needs todo is clear the messageId from the expiring map before the expiration time.

          Show
          T Jake Luciani added a comment - I think you can reuse the same RowMutation object across messages so shouldn't cause duplicates in memory. The only issue is if too many mutations queue up you might OOM but this is the same problem we currently we have with the write stage. So if you use a Expiring map you should add a onExpiration hook to write the hint locally for the replicas that never responded to the Mutation. This covers the case that mutations expire before a response is received. Then all the MessageTask needs todo is clear the messageId from the expiring map before the expiration time.
          Hide
          Jonathan Ellis added a comment -

          The only issue is if too many mutations queue up you might OOM but this is the same problem we currently we have with the write stage

          Right. Either way worst case is already "you need to be able to buffer up to rpc_timeout's worth of writes in memory."

          if you use a Expiring map you should add a onExpiration hook to write the hint locally for the replicas that never responded to the Mutation

          I'm not sure you want a separate ExpiringMap from the one you already have in MS. Might make for weird corner cases. But that's an implementation detail; the approach is sound.

          Show
          Jonathan Ellis added a comment - The only issue is if too many mutations queue up you might OOM but this is the same problem we currently we have with the write stage Right. Either way worst case is already "you need to be able to buffer up to rpc_timeout's worth of writes in memory." if you use a Expiring map you should add a onExpiration hook to write the hint locally for the replicas that never responded to the Mutation I'm not sure you want a separate ExpiringMap from the one you already have in MS. Might make for weird corner cases. But that's an implementation detail; the approach is sound.
          Hide
          Patricio Echague added a comment -

          The reason I need the extra map is to store the mutation. Otherwise I cannot generate a hint.

          Show
          Patricio Echague added a comment - The reason I need the extra map is to store the mutation. Otherwise I cannot generate a hint.
          Hide
          Patricio Echague added a comment -

          But, you are guaranteed that successful writes have been hinted (if necessary) so you do not have to repair unless there is hardware permadeath. (Otherwise you would have to repair after power failure or crashes, too.)

          Since we are queuing up a hint on failure/RPC timeout after acknowledging to the client, it looks like we need to rapair everytime a node is shutdown given the crash-only way to shutting down Cassandra. Since we can have task in the queue yet to be executed. Right? I don't think repair is need only on hardware permadeath.

          Show
          Patricio Echague added a comment - But, you are guaranteed that successful writes have been hinted (if necessary) so you do not have to repair unless there is hardware permadeath. (Otherwise you would have to repair after power failure or crashes, too.) Since we are queuing up a hint on failure/RPC timeout after acknowledging to the client, it looks like we need to rapair everytime a node is shutdown given the crash-only way to shutting down Cassandra. Since we can have task in the queue yet to be executed. Right? I don't think repair is need only on hardware permadeath.
          Hide
          T Jake Luciani added a comment -

          There is a shutdownHook we added for these kinds of things. see StorageService

          Show
          T Jake Luciani added a comment - There is a shutdownHook we added for these kinds of things. see StorageService
          Hide
          Jonathan Ellis added a comment -

          I don't think repair is need only on hardware permadeath.

          It's right there in what you quoted: "Otherwise [if you're not waiting for all replica acks before returning to client] you would have to repair after power failure or crashes, too."

          it looks like we need to rapair everytime a node is shutdown given the crash-only way to shutting down Cassandra

          As discussed on chat, you need to add a shutdown hook to let the executor finish for non-crash shutdowns. Look for Runtime.getRuntime().addShutdownHook in StorageService.

          Show
          Jonathan Ellis added a comment - I don't think repair is need only on hardware permadeath. It's right there in what you quoted: "Otherwise [if you're not waiting for all replica acks before returning to client] you would have to repair after power failure or crashes, too." it looks like we need to rapair everytime a node is shutdown given the crash-only way to shutting down Cassandra As discussed on chat, you need to add a shutdown hook to let the executor finish for non-crash shutdowns. Look for Runtime.getRuntime().addShutdownHook in StorageService.
          Hide
          Patricio Echague added a comment -

          Thanks.

          As per the TimeoutException trying to queue up a hint while writing hints after ACK timeout. Would it be ok to have a second executor with a non-capped queue? I'm currently using a capped queue for writing hints to replicas that are down before starting the mutation.

          Downside of this is that the queue can grow big (by default we don't schedule hints after one hour that a replica has been down)

          Show
          Patricio Echague added a comment - Thanks. As per the TimeoutException trying to queue up a hint while writing hints after ACK timeout. Would it be ok to have a second executor with a non-capped queue? I'm currently using a capped queue for writing hints to replicas that are down before starting the mutation. Downside of this is that the queue can grow big (by default we don't schedule hints after one hour that a replica has been down)
          Hide
          Jonathan Ellis added a comment -

          Remember our algorithm looks something like this:

          • if we know a replica is down, hint it on the capped executor. if this is full, throw timeout.
          • perform writes to believed-to-be-healthy replicas
          • return ok to client once CL is achieved
          • wait for remaining replicas in bg and hint on uncapped executor if necessary

          The last part is what Jake and I were talking about when I said "worst case is already "you need to be able to buffer up to rpc_timeout's worth of writes in memory." (Referring to how the write stage on a replica buffers up mutations.)

          This only comes into play during the delay between a replica becoming unreachable, and the coordinator detecting that. Otherwise it goes to the first, capped-write stage. So scheduling hints after 1h or w/e doesn't really matter.

          Show
          Jonathan Ellis added a comment - Remember our algorithm looks something like this: if we know a replica is down, hint it on the capped executor. if this is full, throw timeout. perform writes to believed-to-be-healthy replicas return ok to client once CL is achieved wait for remaining replicas in bg and hint on uncapped executor if necessary The last part is what Jake and I were talking about when I said "worst case is already "you need to be able to buffer up to rpc_timeout's worth of writes in memory." (Referring to how the write stage on a replica buffers up mutations.) This only comes into play during the delay between a replica becoming unreachable, and the coordinator detecting that. Otherwise it goes to the first, capped-write stage. So scheduling hints after 1h or w/e doesn't really matter.
          Hide
          Patricio Echague added a comment -

          Replaces v7.

          • Add an new executor in SP with an uncapped queue.
          • Add proper shutdown to MessagingService and SP.
          • Add scheduling hints on RPC timeout through MessagingService
          Show
          Patricio Echague added a comment - Replaces v7. Add an new executor in SP with an uncapped queue. Add proper shutdown to MessagingService and SP. Add scheduling hints on RPC timeout through MessagingService
          Hide
          T Jake Luciani added a comment - - edited

          Comments on v8:

          • You are still waiting on hints in the client, I don't think we need CreationTimeAwareFuture anymore.
          • I don't think local hints need to be put on their own queue / thread-pool. Just write the hint to the local mutation queue and increment the hint counters.
          • The shutdown process should not wait for the mutation map to expire, it should simply write any outstanding hints from the map and shutdown.

          Nitpik:

          • The new expiring map logic is backwards in my opinion. I would rather see the expiration callback handle the hint write, then see the MessageService call storageproxy.
          Show
          T Jake Luciani added a comment - - edited Comments on v8: You are still waiting on hints in the client, I don't think we need CreationTimeAwareFuture anymore. I don't think local hints need to be put on their own queue / thread-pool. Just write the hint to the local mutation queue and increment the hint counters. The shutdown process should not wait for the mutation map to expire, it should simply write any outstanding hints from the map and shutdown. Nitpik: The new expiring map logic is backwards in my opinion. I would rather see the expiration callback handle the hint write, then see the MessageService call storageproxy.
          Hide
          Patricio Echague added a comment -

          v9 replaces v8.

          added:

          • one more shutdown for hintsWriter (that was missing in StorageProxy.
          • write all the pending hints in MessageService instead of waiting for them to time out.
          Show
          Patricio Echague added a comment - v9 replaces v8. added: one more shutdown for hintsWriter (that was missing in StorageProxy. write all the pending hints in MessageService instead of waiting for them to time out.
          Hide
          Jonathan Ellis added a comment -

          I don't think local hints need to be put on their own queue / thread-pool. Just write the hint to the local mutation queue and increment the hint counters

          Agreed, that is a better solution than complicating things with extra queues / executors. Since the message dropping is done by the Task, not the executor, there's no problem in that respect. And we can use a simple counter to avoid clogging the queue with hints.

          Show
          Jonathan Ellis added a comment - I don't think local hints need to be put on their own queue / thread-pool. Just write the hint to the local mutation queue and increment the hint counters Agreed, that is a better solution than complicating things with extra queues / executors. Since the message dropping is done by the Task, not the executor, there's no problem in that respect. And we can use a simple counter to avoid clogging the queue with hints.
          Hide
          Patricio Echague added a comment -

          v10 patch.

          • Remove the hint executor and use MutationStage instead.
          • Add a queueSize counter to reject hints when the queue is full. It only applies for hints we know before hand that the node is down. I don't check for it for the hints written from MessagingService during write timeout.
          • Add More JMX info in StorageProxyMBean
          Show
          Patricio Echague added a comment - v10 patch. Remove the hint executor and use MutationStage instead. Add a queueSize counter to reject hints when the queue is full. It only applies for hints we know before hand that the node is down. I don't check for it for the hints written from MessagingService during write timeout. Add More JMX info in StorageProxyMBean
          Hide
          Sylvain Lebresne added a comment -

          I looked at the last patch to see if we don't mess up with counters here, and it looks ok on that side.

          A few comments on the patch while I'm at it:

          • There is a few formatting that don't respect the codeStyle: some in MessageService, some imports moved further from what the codeStyle dictate in AbstractWriteResponseHandler and StorageProxy (that imports quicktime.std.movies.ExecutingWiredAction!!)
          • It seems we use the exact same timeout for hints and callback. However, given that ExpiringMap is only so much precise, it seems to me we have like 50% chance that the hint will expire before the callback, which means no hint will be recorded (we could bump the timeout for hint to twice the one of callback, and remove from mutationMessages once the hint is written; though I would rather go for the next proposition if possible).
          • Can't we reuse the callback expiring map for hints instead of creation a new one. It's not like ExpiringMap is the cheapest thing around, and it seems like each mutation message always have an attached callback with the exact same lifetime. We could replace the Pair<InetAddress, IMessageCallback> by a CallbackInfo (or some better name) class with a InetAddress and IMessageCallback, and subclass it to CallbackInfoWithMessage when we need to store the message too. I'm pretty sure it would much more efficient.
          • In SP.sendToHintedEndpoints, there is no need to test if hintedHandoff is enabled, because the hintedEndpoints map will only contain non-hinted target if it is disabled (see AbstractReplicationStragegy.getHintedEndpoints).
          Show
          Sylvain Lebresne added a comment - I looked at the last patch to see if we don't mess up with counters here, and it looks ok on that side. A few comments on the patch while I'm at it: There is a few formatting that don't respect the codeStyle: some in MessageService, some imports moved further from what the codeStyle dictate in AbstractWriteResponseHandler and StorageProxy (that imports quicktime.std.movies.ExecutingWiredAction!!) It seems we use the exact same timeout for hints and callback. However, given that ExpiringMap is only so much precise, it seems to me we have like 50% chance that the hint will expire before the callback, which means no hint will be recorded (we could bump the timeout for hint to twice the one of callback, and remove from mutationMessages once the hint is written; though I would rather go for the next proposition if possible). Can't we reuse the callback expiring map for hints instead of creation a new one. It's not like ExpiringMap is the cheapest thing around, and it seems like each mutation message always have an attached callback with the exact same lifetime. We could replace the Pair<InetAddress, IMessageCallback> by a CallbackInfo (or some better name) class with a InetAddress and IMessageCallback, and subclass it to CallbackInfoWithMessage when we need to store the message too. I'm pretty sure it would much more efficient. In SP.sendToHintedEndpoints, there is no need to test if hintedHandoff is enabled, because the hintedEndpoints map will only contain non-hinted target if it is disabled (see AbstractReplicationStragegy.getHintedEndpoints).
          Hide
          Patricio Echague added a comment -

          v11 replaces v10.

          • add more volatile to variable that can be changed through JMX
          • Add CallbackInfo and CallbackInfoWithMassage and store the message in the same ExpiringMap.
          • Clean up some code style violations
          • Simplify a bit the logic in SP.sendToHintedEndPoints.
          Show
          Patricio Echague added a comment - v11 replaces v10. add more volatile to variable that can be changed through JMX Add CallbackInfo and CallbackInfoWithMassage and store the message in the same ExpiringMap. Clean up some code style violations Simplify a bit the logic in SP.sendToHintedEndPoints.
          Hide
          Jonathan Ellis added a comment -

          I think v11 is missing the new Callback classes. Can you rebase to trunk when you add those?

          Show
          Jonathan Ellis added a comment - I think v11 is missing the new Callback classes. Can you rebase to trunk when you add those?
          Hide
          Patricio Echague added a comment -

          rebase and add missing classes.

          Show
          Patricio Echague added a comment - rebase and add missing classes.
          Hide
          Jonathan Ellis added a comment -
          • there's an unused overload of MS.addCallback
          • shouldHint should probably be a CallbackInfo method
          • the .warn in scheduleMH should be .error
          • any reason scheduleMH is protected instead of private?
          • technically MS.shutdown should probably collect the Futures of the hints being written and wait on them

          in

          if (consistencyLevel != null && consistencyLevel == ConsistencyLevel.ANY)
          

          did you mean this?

          if (responseHandler != null && consistencyLevel == ConsistencyLevel.ANY)
          
          • not a big fan of the "subclass just exposes a different constructor than its parent" idiom. I think the normal expectation is that a subclass encapsulates some kind of different behavior than its parent. I'd get rid of CIWM and just expose a with-Message constructor in CallbackInfo.
          • Would also make CI message and isMutation final
          • avoid declaring @Override on abstract methods (looking at writeLocalHint Runnable, also CTAF, may be others)
          • avoid comments that repeat what the code says, like "// One more task in the hints queue."
          • would name hintCounter -> totalHints (I had to look at usages to figure out what it does)
          • there's no actual hint queue anymore so would name currentHintsQueueSize -> hintsInProgress (similarly, maxHintsQueueSize)
          • prefer camelCase variable names (CTAF overall_timeout)
          • unit.toMillis(timeout) would be more idiomatic than TimeUnit.MILLISECONDS.convert(timeout, unit)
          • otherwise CTAF is a good clean encapsulation, nice job
          • generics is upset about "hintsFutures.add(new CreationTimeAwareFuture(hintfuture))" – can you fix the unchecked warning there?
          • unnecessary whitespace added to the method below "// wait for writes. throws TimeoutException if necessary"
          • would prefer to avoid allocating the hintFutures list unless we actually need to write hints, since this is on the write inner loop
          • still think we can simplify sendToHinted by getting rid of getHintedEndpoints and operating directly on the raw writeEndpoints (and consulting FD to decide whether to write a hint)
          • currentHintsQueueSize increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors
          • it would be nice if we could tell the CallbackInfo whether the original write reached ConsistencyLevel. Maybe we could do this by changing isMutation to volatile isHintRequired, something like that. (And just set it to false if CL is not reached.) If not, we don't have to write hints for timed out replicas – this will help avoiding OOM if nodes in the cluster start getting overloaded and dropping writes. Otherwise, the coordinator could run out of memory queuing up hints for writes that timed out and the client will retry.
          Show
          Jonathan Ellis added a comment - there's an unused overload of MS.addCallback shouldHint should probably be a CallbackInfo method the .warn in scheduleMH should be .error any reason scheduleMH is protected instead of private? technically MS.shutdown should probably collect the Futures of the hints being written and wait on them in if (consistencyLevel != null && consistencyLevel == ConsistencyLevel.ANY) did you mean this? if (responseHandler != null && consistencyLevel == ConsistencyLevel.ANY) not a big fan of the "subclass just exposes a different constructor than its parent" idiom. I think the normal expectation is that a subclass encapsulates some kind of different behavior than its parent. I'd get rid of CIWM and just expose a with-Message constructor in CallbackInfo. Would also make CI message and isMutation final avoid declaring @Override on abstract methods (looking at writeLocalHint Runnable, also CTAF, may be others) avoid comments that repeat what the code says, like "// One more task in the hints queue." would name hintCounter -> totalHints (I had to look at usages to figure out what it does) there's no actual hint queue anymore so would name currentHintsQueueSize -> hintsInProgress (similarly, maxHintsQueueSize) prefer camelCase variable names (CTAF overall_timeout) unit.toMillis(timeout) would be more idiomatic than TimeUnit.MILLISECONDS.convert(timeout, unit) otherwise CTAF is a good clean encapsulation, nice job generics is upset about "hintsFutures.add(new CreationTimeAwareFuture(hintfuture))" – can you fix the unchecked warning there? unnecessary whitespace added to the method below "// wait for writes. throws TimeoutException if necessary" would prefer to avoid allocating the hintFutures list unless we actually need to write hints, since this is on the write inner loop still think we can simplify sendToHinted by getting rid of getHintedEndpoints and operating directly on the raw writeEndpoints (and consulting FD to decide whether to write a hint) currentHintsQueueSize increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors it would be nice if we could tell the CallbackInfo whether the original write reached ConsistencyLevel. Maybe we could do this by changing isMutation to volatile isHintRequired, something like that. (And just set it to false if CL is not reached.) If not, we don't have to write hints for timed out replicas – this will help avoiding OOM if nodes in the cluster start getting overloaded and dropping writes. Otherwise, the coordinator could run out of memory queuing up hints for writes that timed out and the client will retry.
          Hide
          Jonathan Ellis added a comment -
          • the timeout-based waitOnFutures overload should only accept CTAF objects since it will NOT behave as expected with others, e.g., FutureTask objects
          Show
          Jonathan Ellis added a comment - the timeout-based waitOnFutures overload should only accept CTAF objects since it will NOT behave as expected with others, e.g., FutureTask objects
          Hide
          Patricio Echague added a comment -

          v12 replaces v11.

          • All comments were addressed expect for the following two.
            • I took a slightly different path for when CL is not achieved which I believe it works out just as good.
            • Still putting more thoughts on how to get rid of getHintedEndPoints since it affects the 3 types of WriteHandlers. Once I tackle this point I will also improve the HintFuture premature instantiation for when hints are not needed.
          Show
          Patricio Echague added a comment - v12 replaces v11. All comments were addressed expect for the following two. I took a slightly different path for when CL is not achieved which I believe it works out just as good. Still putting more thoughts on how to get rid of getHintedEndPoints since it affects the 3 types of WriteHandlers. Once I tackle this point I will also improve the HintFuture premature instantiation for when hints are not needed.
          Hide
          Patricio Echague added a comment -
          • remove that extra @Override and refactor SPMBean names.
          Show
          Patricio Echague added a comment - remove that extra @Override and refactor SPMBean names.
          Hide
          Patricio Echague added a comment -

          V13 replaces v12.

          • SImplify the logic sendToHintedEndpoint in SP.
          Show
          Patricio Echague added a comment - V13 replaces v12. SImplify the logic sendToHintedEndpoint in SP.
          Hide
          Jonathan Ellis added a comment -

          I don't think keeping passing a list of unavailableEndpoints everywhere is actually necessary. I may be missing a use case, but what I see is

          • in sendToHintedEndpoints
          • in assureSufficientLiveNodes implementation

          Both of which can be replaced in a straightforward manner with FailureDetector calls. (Note that it is not necessary for FD state to remain unchanged between assureSufficient and sending.)

          In fact using the same list in both places is a bug: assureSufficient only cares about what FD thinks, so mixing hinted-handoff-enabledness in as getUnavailableEndpoints does will cause assureSufficient to return false positives w/ HH off.

          So I'd make assureSufficient use FD directly, and sendTHE use FD + HH state. Bonus: no List allocation in the common case of "everything is healthy."

          Show
          Jonathan Ellis added a comment - I don't think keeping passing a list of unavailableEndpoints everywhere is actually necessary. I may be missing a use case, but what I see is in sendToHintedEndpoints in assureSufficientLiveNodes implementation Both of which can be replaced in a straightforward manner with FailureDetector calls. (Note that it is not necessary for FD state to remain unchanged between assureSufficient and sending.) In fact using the same list in both places is a bug: assureSufficient only cares about what FD thinks, so mixing hinted-handoff-enabledness in as getUnavailableEndpoints does will cause assureSufficient to return false positives w/ HH off. So I'd make assureSufficient use FD directly, and sendTHE use FD + HH state. Bonus: no List allocation in the common case of "everything is healthy."
          Hide
          Patricio Echague added a comment -

          I can do that.

          Also, I'm not quite happy with waiting for local hints to complete per mutation. I'm thinking of adding them to the handler so that we can wait for the hints after scheduling all the mutations.

          It has pros and cons:

          Pros: If the coordinator node is overwhelmed, we can tell the client right away.
          Cons: Por large mutations, we are actually blocking for local hints (if any) per mutation which is not ideal.

          Show
          Patricio Echague added a comment - I can do that. Also, I'm not quite happy with waiting for local hints to complete per mutation. I'm thinking of adding them to the handler so that we can wait for the hints after scheduling all the mutations. It has pros and cons: Pros: If the coordinator node is overwhelmed, we can tell the client right away. Cons: Por large mutations, we are actually blocking for local hints (if any) per mutation which is not ideal.
          Hide
          Jonathan Ellis added a comment -

          Yes, that should be in mutate() or the handler so the waiting is parallelized.

          Show
          Jonathan Ellis added a comment - Yes, that should be in mutate() or the handler so the waiting is parallelized.
          Hide
          Patricio Echague added a comment -

          v14 replaces v13.

          • Wait for the hint at SP.mutate()
          • Avoid passing a list of down host and handle them with FD.
          • Simplifies even more the logic in SP.sendToHintedEndpoints
          Show
          Patricio Echague added a comment - v14 replaces v13. Wait for the hint at SP.mutate() Avoid passing a list of down host and handle them with FD. Simplifies even more the logic in SP.sendToHintedEndpoints
          Hide
          Jonathan Ellis added a comment -

          Hmm. I think you're right: it would work better to do the hints in the handler instead of passing these lists around. Sorry; let's change it to do it that way.

          Other notes:

          SP.shouldHint is broken (will always return true when hints are disabled). I would write it like this:

              public static boolean shouldHint(InetAddress ep)
              {
                  if (!isHintedHandoffEnabled())
                      return false;
                  
                  boolean hintWindowExpired = Gossiper.instance.getEndpointDowntime(ep) > maxHintWindow;
                  if (hintWindowExpired)
                      logger.debug("not hinting {} which has been down {}ms", ep, Gossiper.instance.getEndpointDowntime(ep));
                  return !hintWindowExpired;
              }
          

          CallbackInfo.shouldHint is broken a different way. It should be returning true if and only if the write to the target failed. (Calling this variable "from" is odd – "from" is used to refer to localhost in a MessageService context.) Currently, it returns true if the overall CL is achieved, which in the general case tells us nothing about the individual replica in question.

          Show
          Jonathan Ellis added a comment - Hmm. I think you're right: it would work better to do the hints in the handler instead of passing these lists around. Sorry; let's change it to do it that way. Other notes: SP.shouldHint is broken (will always return true when hints are disabled). I would write it like this: public static boolean shouldHint(InetAddress ep) { if (!isHintedHandoffEnabled()) return false ; boolean hintWindowExpired = Gossiper.instance.getEndpointDowntime(ep) > maxHintWindow; if (hintWindowExpired) logger.debug( "not hinting {} which has been down {}ms" , ep, Gossiper.instance.getEndpointDowntime(ep)); return !hintWindowExpired; } CallbackInfo.shouldHint is broken a different way. It should be returning true if and only if the write to the target failed. (Calling this variable "from" is odd – "from" is used to refer to localhost in a MessageService context.) Currently, it returns true if the overall CL is achieved, which in the general case tells us nothing about the individual replica in question.
          Hide
          Patricio Echague added a comment -

          Thanks Jonathan for the snippet of code. I didn't notice it was broken.

          I don't see where CallbackInfo.shouldHint is broken.

             
          public boolean shouldHint()
              {
                  if (StorageProxy.shouldHint(target) && isMutation)
                  {
                      try
                      {
          1)              ((IWriteResponseHandler) callback).get();
                          return true;
                      }
                      catch (TimeoutException e) 
                      {
                          // CL was not achieved. We should not hint.
                      }
                  }
                  return false;
              }
          

          I process the callback after the message expired. If the CL was achieved (and the requirement for a hint are gathered) I return true for this target meaning that a hint needs to be written.
          On the other hand, if the message expire and the CL was not achieved, then I return FALSE (for this target).

          Perhaps it needs a special treatment during the shutdown ?

          Show
          Patricio Echague added a comment - Thanks Jonathan for the snippet of code. I didn't notice it was broken. I don't see where CallbackInfo.shouldHint is broken. public boolean shouldHint() { if (StorageProxy.shouldHint(target) && isMutation) { try { 1) ((IWriteResponseHandler) callback).get(); return true ; } catch (TimeoutException e) { // CL was not achieved. We should not hint. } } return false ; } I process the callback after the message expired. If the CL was achieved (and the requirement for a hint are gathered) I return true for this target meaning that a hint needs to be written. On the other hand, if the message expire and the CL was not achieved, then I return FALSE (for this target). Perhaps it needs a special treatment during the shutdown ?
          Hide
          Patricio Echague added a comment -

          v15 replaces v14.

          • Wait for hints are handled in the IWriteResponseHandler
          • Fix broken SP.shoudHint

          Note: I think the Callback.shoudHint needs an enhancement for when we are during shut down in MessagingService.

          Show
          Patricio Echague added a comment - v15 replaces v14. Wait for hints are handled in the IWriteResponseHandler Fix broken SP.shoudHint Note: I think the Callback.shoudHint needs an enhancement for when we are during shut down in MessagingService.
          Hide
          Jonathan Ellis added a comment -

          I process the callback after the message expired

          That makes sense.

          I think the Callback.shoudHint needs an enhancement for when we are during shut down in MessagingService

          Yes. We should probably either wait for the messages to time out (which is mildly annoying to the user) or just write hints for everything (which may be confusing: "why are there hints being sent after I restart, when no node was ever down?) I don't see a perfect solution.

          Also, still need to address this:

          currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors

          v16 attached: rebased to current head, fixed import ordering, and added some comments.

          Show
          Jonathan Ellis added a comment - I process the callback after the message expired That makes sense. I think the Callback.shoudHint needs an enhancement for when we are during shut down in MessagingService Yes. We should probably either wait for the messages to time out (which is mildly annoying to the user) or just write hints for everything (which may be confusing: "why are there hints being sent after I restart, when no node was ever down?) I don't see a perfect solution. Also, still need to address this: currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors v16 attached: rebased to current head, fixed import ordering, and added some comments.
          Hide
          Patricio Echague added a comment -

          currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors

          Interesting. I must have forgotten it after one of the patches. I remember fixing it before.

          Yes. We should probably either wait for the messages to time out (which is mildly annoying to the user) or just write hints for everything (which may be confusing: "why are there hints being sent after I restart, when no node was ever down?) I don't see a perfect solution.

          I think I prefer make the user wait for RPCTimeout since it is not that much and perhaps puts a bit more clarity than just saving the hints just in case.

          Show
          Patricio Echague added a comment - currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors Interesting. I must have forgotten it after one of the patches. I remember fixing it before. Yes. We should probably either wait for the messages to time out (which is mildly annoying to the user) or just write hints for everything (which may be confusing: "why are there hints being sent after I restart, when no node was ever down?) I don't see a perfect solution. I think I prefer make the user wait for RPCTimeout since it is not that much and perhaps puts a bit more clarity than just saving the hints just in case.
          Hide
          Patricio Echague added a comment -

          Also, still need to address this:

          currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors

          hintsInProgress.incrementAndGet(); happens outside of the executor and actually before scheduling it.
          totalHints.incrementAndGet(); on the other hand, the totalHint is incremented right after the hint was written and within the task.

          Is that not right ?

          Show
          Patricio Echague added a comment - Also, still need to address this: currentHintsQueueSize [now totalHints] increment needs to be done OUTSIDE the runnable or it will never get above the number of task executors hintsInProgress.incrementAndGet(); happens outside of the executor and actually before scheduling it. totalHints.incrementAndGet(); on the other hand, the totalHint is incremented right after the hint was written and within the task. Is that not right ?
          Hide
          Patricio Echague added a comment -

          Fix defective v1 patch.

          CallbackInfo and CreatiomTimeAware were missing.

          Show
          Patricio Echague added a comment - Fix defective v1 patch. CallbackInfo and CreatiomTimeAware were missing.
          Hide
          Jonathan Ellis added a comment -

          hintsInProgress.incrementAndGet(); happens outside of the executor and actually before scheduling it

          You're right, I mis-read that.

          I process the callback after the message expired. If the CL was achieved (and the requirement for a hint are gathered) I return true for this target meaning that a hint needs to be written.

          Actually I think this does not achieve our goal of making read-repair unnecessary. For that, we need to always hint when an attempted write fails.

          I think I prefer make the user wait for RPCTimeout

          That's reasonable.

          v18 attached (applies on top of 17). I've added waiting for hints to finish to MS.shutdown, added a call to MS.shutdown from the shutdown hook, and added a call to stopRPCServer from the shutdown hook and from drain. (This is important, because there is nothing else to prevent new messages from being added to MS's callback map.)

          Show
          Jonathan Ellis added a comment - hintsInProgress.incrementAndGet(); happens outside of the executor and actually before scheduling it You're right, I mis-read that. I process the callback after the message expired. If the CL was achieved (and the requirement for a hint are gathered) I return true for this target meaning that a hint needs to be written. Actually I think this does not achieve our goal of making read-repair unnecessary. For that, we need to always hint when an attempted write fails. I think I prefer make the user wait for RPCTimeout That's reasonable. v18 attached (applies on top of 17). I've added waiting for hints to finish to MS.shutdown, added a call to MS.shutdown from the shutdown hook, and added a call to stopRPCServer from the shutdown hook and from drain. (This is important, because there is nothing else to prevent new messages from being added to MS's callback map.)
          Hide
          Patricio Echague added a comment -

          Rebase and consolidates patch files 17 and 18

          Show
          Patricio Echague added a comment - Rebase and consolidates patch files 17 and 18
          Hide
          Patricio Echague added a comment -

          Jonathan, I noticed you modified a bit CallbackInfo.shoudHint()

              public boolean shouldHint()
              {
                  return message != null && StorageProxy.shouldHint(target);
              }
          

          Not sure if you meant to say that your changes addresses the issue of not hinting when CL is not reached.
          The new "shoudHint" method you added should be ok as it is processed upon RPCTimeout disregard if the CL was achieved or not.

          Show
          Patricio Echague added a comment - Jonathan, I noticed you modified a bit CallbackInfo.shoudHint() public boolean shouldHint() { return message != null && StorageProxy.shouldHint(target); } Not sure if you meant to say that your changes addresses the issue of not hinting when CL is not reached. The new "shoudHint" method you added should be ok as it is processed upon RPCTimeout disregard if the CL was achieved or not.
          Hide
          Jonathan Ellis added a comment -

          Right, that's what I was referring to when I said "[the old shouldHint] does not achieve our goal of making read-repair unnecessary. For that, we need to always hint when an attempted write fails."

          Show
          Jonathan Ellis added a comment - Right, that's what I was referring to when I said " [the old shouldHint] does not achieve our goal of making read-repair unnecessary. For that, we need to always hint when an attempted write fails."
          Hide
          Jonathan Ellis added a comment -

          rebased v19. no other changes made.

          Show
          Jonathan Ellis added a comment - rebased v19. no other changes made.
          Hide
          Patricio Echague added a comment -

          Rebased and fix ConsistencyLevelTest

          RemoveTest is timing out. Not sure if it is related.

          Update soon.

          Show
          Patricio Echague added a comment - Rebased and fix ConsistencyLevelTest RemoveTest is timing out. Not sure if it is related. Update soon.
          Hide
          Patricio Echague added a comment -

          Rebase and Fix test.

          RemoveTest is timing out after the last test and cannot find the reason.
          The ExpiringMap (callbacks) is not freeing up.

          Show
          Patricio Echague added a comment - Rebase and Fix test. RemoveTest is timing out after the last test and cannot find the reason. The ExpiringMap (callbacks) is not freeing up.
          Hide
          Patricio Echague added a comment -

          This new patch fixes the unit tests.

          Added a way to force a ExpiringMap.shutdown() without waiting for it to empty up.

          Show
          Patricio Echague added a comment - This new patch fixes the unit tests. Added a way to force a ExpiringMap.shutdown() without waiting for it to empty up.
          Hide
          Jonathan Ellis added a comment -

          what messages is RemoveTest waiting for that are causing the problem?

          Show
          Jonathan Ellis added a comment - what messages is RemoveTest waiting for that are causing the problem?
          Hide
          Patricio Echague added a comment -

          I fixed the test.

          If I'm not wrong,

                  for (InetAddress host : hosts)
                  {
                      Message msg = new Message(host, StorageService.Verb.REPLICATION_FINISHED, new byte[0], MessagingService.version_);
                      MessagingService.instance().sendRR(msg, FBUtilities.getBroadcastAddress());
                  }
          

          sends 5 messages but some of them (4) are caught by a local SinkManager implementation(for testing purposes) and not processed.
          And since you added a wait for the callbacks to be processed before exiting, I had to add a force shutdown (for testing purposes) in order to make the test complete successfully.

          Show
          Patricio Echague added a comment - I fixed the test. If I'm not wrong, for (InetAddress host : hosts) { Message msg = new Message(host, StorageService.Verb.REPLICATION_FINISHED, new byte [0], MessagingService.version_); MessagingService.instance().sendRR(msg, FBUtilities.getBroadcastAddress()); } sends 5 messages but some of them (4) are caught by a local SinkManager implementation(for testing purposes) and not processed. And since you added a wait for the callbacks to be processed before exiting, I had to add a force shutdown (for testing purposes) in order to make the test complete successfully.
          Hide
          Jonathan Ellis added a comment -

          That makes sense. I modified it slightly by keeping shutdown the same and adding MessagingService.clearCallbacksUnsafe instead. This makes it harder to "force" shutdown the wrong way by accident.

          Also removed calls to stopRPCServer from drain/shutdownhook. there's no code in the Thrift socket server to time out a read preemptively so this does not complete in a timely fashion; this was causing timeouts in CliTest.

          committed with these changes. Thanks Patricio!

          Show
          Jonathan Ellis added a comment - That makes sense. I modified it slightly by keeping shutdown the same and adding MessagingService.clearCallbacksUnsafe instead. This makes it harder to "force" shutdown the wrong way by accident. Also removed calls to stopRPCServer from drain/shutdownhook. there's no code in the Thrift socket server to time out a read preemptively so this does not complete in a timely fashion; this was causing timeouts in CliTest. committed with these changes. Thanks Patricio!
          Hide
          Hudson added a comment -

          Integrated in Cassandra #1063 (See https://builds.apache.org/job/Cassandra/1063/)
          generate hints for replicas that timeout
          patch by Patricio Echague and jbellis for CASSANDRA-2034

          jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163760
          Files :

          • /cassandra/trunk/CHANGES.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/concurrent/CreationTimeAwareFuture.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutationVerbHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/net/CallbackInfo.java
          • /cassandra/trunk/src/java/org/apache/cassandra/net/MessagingService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/net/ResponseVerbHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterSyncWriteResponseHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/IWriteResponseHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxyMBean.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/WriteResponseHandler.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/ExpiringMap.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/FBUtilities.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/service/ConsistencyLevelTest.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/service/RemoveTest.java
          Show
          Hudson added a comment - Integrated in Cassandra #1063 (See https://builds.apache.org/job/Cassandra/1063/ ) generate hints for replicas that timeout patch by Patricio Echague and jbellis for CASSANDRA-2034 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163760 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/concurrent/CreationTimeAwareFuture.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutationVerbHandler.java /cassandra/trunk/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java /cassandra/trunk/src/java/org/apache/cassandra/net/CallbackInfo.java /cassandra/trunk/src/java/org/apache/cassandra/net/MessagingService.java /cassandra/trunk/src/java/org/apache/cassandra/net/ResponseVerbHandler.java /cassandra/trunk/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java /cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterSyncWriteResponseHandler.java /cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java /cassandra/trunk/src/java/org/apache/cassandra/service/IWriteResponseHandler.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxyMBean.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java /cassandra/trunk/src/java/org/apache/cassandra/service/WriteResponseHandler.java /cassandra/trunk/src/java/org/apache/cassandra/utils/ExpiringMap.java /cassandra/trunk/src/java/org/apache/cassandra/utils/FBUtilities.java /cassandra/trunk/test/unit/org/apache/cassandra/service/ConsistencyLevelTest.java /cassandra/trunk/test/unit/org/apache/cassandra/service/RemoveTest.java

            People

            • Assignee:
              Patricio Echague
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 8h
                8h
                Remaining:
                Remaining Estimate - 8h
                8h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development