Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6364

There should be different disk_failure_policies for data and commit volumes or commit volume failure should always cause node exit

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.6
    • Component/s: None
    • Labels:
      None
    • Environment:

      JBOD, single dedicated commit disk

      Description

      We're doing fault testing on a pre-production Cassandra cluster. One of the tests was to simulation failure of the commit volume/disk, which in our case is on a dedicated disk. We expected failure of the commit volume to be handled somehow, but what we found was that no action was taken by Cassandra when the commit volume fail. We simulated this simply by pulling the physical disk that backed the commit volume, which resulted in filesystem I/O errors on the mount point.

      What then happened was that the Cassandra Heap filled up to the point that it was spending 90% of its time doing garbage collection. No errors were logged in regards to the failed commit volume. Gossip on other nodes in the cluster eventually flagged the node as down. Gossip on the local node showed itself as up, and all other nodes as down.

      The most serious problem was that connections to the coordinator on this node became very slow due to the on-going GC, as I assume uncommitted writes piled up on the JVM heap. What we believe should have happened is that Cassandra should have caught the I/O error and exited with a useful log message, or otherwise done some sort of useful cleanup. Otherwise the node goes into a sort of Zombie state, spending most of its time in GC, and thus slowing down any transactions that happen to use the coordinator on said node.

      A limit on in-memory, unflushed writes before refusing requests may also work. Point being, something should be done to handle the commit volume dying as doing nothing results in affecting the entire cluster. I should note, we are using: disk_failure_policy: best_effort

        Issue Links

          Activity

          Hide
          mishail Mikhail Stepura added a comment -

          What we believe should have happened is that Cassandra should have caught the I/O error and exited with a useful log message

          I believe you should use disk_failure_policy: stop for that

          Show
          mishail Mikhail Stepura added a comment - What we believe should have happened is that Cassandra should have caught the I/O error and exited with a useful log message I believe you should use disk_failure_policy: stop for that
          Hide
          jre J. Ryan Earl added a comment -

          Mikhail Stepura No, that is not the desired behavior. We do not want to "stop" on data disk failures, we want stop on "commit" disk failure. This is specifically for the commit disk. At a minimum, there needs to be different failure policies for commit and data disks/volumes. Given that only 1 commit disk is supported currently, however, it makes no sense to have the commit disk governed by the disk_failure_policy.

          Furthermore, what you say doesn't align with the default cassandra.yaml documentation which states:

          # policy for data disk failures:
          # stop: shut down gossip and Thrift, leaving the node effectively dead, but
          #       can still be inspected via JMX.
          # best_effort: stop using the failed disk and respond to requests based on
          #              remaining available sstables.  This means you WILL see obsolete
          #              data at CL.ONE!
          # ignore: ignore fatal errors and let requests fail, as in pre-1.2 Cassandra
          disk_failure_policy: best_effort
          

          Again, "disk_failure_policy" above specifically says "policy for data disk failures" and I'm talking about the commit disk here.

          Show
          jre J. Ryan Earl added a comment - Mikhail Stepura No, that is not the desired behavior. We do not want to "stop" on data disk failures, we want stop on "commit" disk failure. This is specifically for the commit disk. At a minimum, there needs to be different failure policies for commit and data disks/volumes. Given that only 1 commit disk is supported currently, however, it makes no sense to have the commit disk governed by the disk_failure_policy. Furthermore, what you say doesn't align with the default cassandra.yaml documentation which states: # policy for data disk failures: # stop: shut down gossip and Thrift, leaving the node effectively dead, but # can still be inspected via JMX. # best_effort: stop using the failed disk and respond to requests based on # remaining available sstables. This means you WILL see obsolete # data at CL.ONE! # ignore: ignore fatal errors and let requests fail, as in pre-1.2 Cassandra disk_failure_policy: best_effort Again, "disk_failure_policy" above specifically says "policy for data disk failures" and I'm talking about the commit disk here.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Yes and no. Even with the commitlog disk dead, the node would be able to serve read requests w/ best_effort, which is the intended behavior.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Yes and no. Even with the commitlog disk dead, the node would be able to serve read requests w/ best_effort, which is the intended behavior.
          Hide
          jre J. Ryan Earl added a comment - - edited

          Aleksey Yeschenko So again, that's bad behavior, since the node will essentially crater and be unable to handle read requests in a timely manner or at all, not because the data isn't there to be read, but due to GC death as uncommitted writes pile up on the heap and the JVM spends all of its time doing garbage collection. Furthermore, it affects reads and writes not just to said node, but on any connection that uses said node as its coordinator. At a minimum, there should be different failure policies for commit and data volumes. The scope or description of the ticket can be changed to that effect, maybe there is a corner case where people only read from Cassandra such that "best_effort" makes sense on the commit volume, but it's really hard to see any plausible use-case where that would be desired.

          Cassandra needs to be able to do "best_effort" on the data volumes, where it makes no sense for a node to die when one of a JBOD of data disks fails, while gracefully and immediately exiting on commit disk failures which will otherwise guarantee the node will become unresponsive in a short of amount of time under write load.

          Show
          jre J. Ryan Earl added a comment - - edited Aleksey Yeschenko So again, that's bad behavior, since the node will essentially crater and be unable to handle read requests in a timely manner or at all, not because the data isn't there to be read, but due to GC death as uncommitted writes pile up on the heap and the JVM spends all of its time doing garbage collection. Furthermore, it affects reads and writes not just to said node, but on any connection that uses said node as its coordinator. At a minimum, there should be different failure policies for commit and data volumes. The scope or description of the ticket can be changed to that effect, maybe there is a corner case where people only read from Cassandra such that "best_effort" makes sense on the commit volume, but it's really hard to see any plausible use-case where that would be desired. Cassandra needs to be able to do "best_effort" on the data volumes, where it makes no sense for a node to die when one of a JBOD of data disks fails, while gracefully and immediately exiting on commit disk failures which will otherwise guarantee the node will become unresponsive in a short of amount of time under write load.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I think I agree w/ stopping upon getting a write error on the commitlog drive.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I think I agree w/ stopping upon getting a write error on the commitlog drive.
          Hide
          jbellis Jonathan Ellis added a comment -

          Would like to fix this in 2.0.x as well as 2.1 but I will settle for 2.1 if that's onerous.

          Show
          jbellis Jonathan Ellis added a comment - Would like to fix this in 2.0.x as well as 2.1 but I will settle for 2.1 if that's onerous.
          Hide
          benedict Benedict added a comment -

          How far do we want to go with this?

          Adding a simple exit on error is very straightforward, but in my experience you can have hang-style failures, so we should definitely have a separate thread checking the liveness of the CLSegmentManager and CLService. Probably a user-configurable not-alive time in the yaml should be used to mark the CL as non-responsive if either hasn't heartbeated in that time. Probably we don't want to immediately die on an error too, but simply not heartbeat and die if the error doesn't recover in some interval, so that anyone monitoring the error logs has time to correct the issue (let's say it's just out of space) before it dies.

          The bigger question is, do we want to do anything clever if we don't want to die? Should we start draining the mutation stage and just dropping the messages? If so, should we attempt to recover if the drive starts responding again after draining the mutation stage?

          Show
          benedict Benedict added a comment - How far do we want to go with this? Adding a simple exit on error is very straightforward, but in my experience you can have hang-style failures, so we should definitely have a separate thread checking the liveness of the CLSegmentManager and CLService. Probably a user-configurable not-alive time in the yaml should be used to mark the CL as non-responsive if either hasn't heartbeated in that time. Probably we don't want to immediately die on an error too, but simply not heartbeat and die if the error doesn't recover in some interval, so that anyone monitoring the error logs has time to correct the issue (let's say it's just out of space) before it dies. The bigger question is, do we want to do anything clever if we don't want to die? Should we start draining the mutation stage and just dropping the messages? If so, should we attempt to recover if the drive starts responding again after draining the mutation stage?
          Hide
          jbellis Jonathan Ellis added a comment -

          Just make it die on IOError like the existing code. For now people can deal with "hangs instead of erroring" manually.

          Show
          jbellis Jonathan Ellis added a comment - Just make it die on IOError like the existing code. For now people can deal with "hangs instead of erroring" manually.
          Hide
          benedict Benedict added a comment -

          Straight forward patch for this available here

          There are three options: stop, stop_commit, and ignore: stop_commit is largely the same as before, the thread that experienced the error exits (whichever thread it is will ultimately result in the CLE dying); stop behaves like for disk failures, bringing down all RPC channels (as well as letting the thread die); and ignore logs the exception and continues as if nothing happened. This may result in a huge loop of log spam, depending on the cause, so we might want to either drop this option or have a configurable sleep period after an exception before continuing. Either seems acceptable to me.

          Show
          benedict Benedict added a comment - Straight forward patch for this available here There are three options: stop, stop_commit, and ignore: stop_commit is largely the same as before, the thread that experienced the error exits (whichever thread it is will ultimately result in the CLE dying); stop behaves like for disk failures, bringing down all RPC channels (as well as letting the thread die); and ignore logs the exception and continues as if nothing happened. This may result in a huge loop of log spam, depending on the cause, so we might want to either drop this option or have a configurable sleep period after an exception before continuing. Either seems acceptable to me.
          Hide
          krummas Marcus Eriksson added a comment -

          About the ignore case, lets hard code something for now - rate limit at one log error message per second perhaps?

          I don't think we should default to 'ignore' in Config.java - if someone does a minor upgrade they most likely wont check NEWS or update their config files to add the new parameter.

          The shipped config in cassandra.yaml looks wrong, should be commit_failure_policy, not disk_failure_policy I guess

          Show
          krummas Marcus Eriksson added a comment - About the ignore case, lets hard code something for now - rate limit at one log error message per second perhaps? I don't think we should default to 'ignore' in Config.java - if someone does a minor upgrade they most likely wont check NEWS or update their config files to add the new parameter. The shipped config in cassandra.yaml looks wrong, should be commit_failure_policy, not disk_failure_policy I guess
          Hide
          benedict Benedict added a comment -

          I don't think we should default to 'ignore' in Config.java

          Well, I wasn't too sure about this. On the one hand switching the default to "stop" means we could over cautiously kill user's hosts unexpectedly, maybe resulting in interruption of service (especially, say, our users running on SAN, as much as it is strongly discouraged). Whereas switching to "ignore" means we may not be durable. Neither are great defaults, but both are better than before. I'm comfortable with both, so if you feel strongly it should be "stop", I'll happily switch it. Perhaps I lean slightly in favour of it too, but it depends on if the user favours durability over availability, really, so there doesn't seem a single correct answer to me. Note that the default disk_failure_policy is also ignore, and the prior behaviour was closest to ignore, so introducing a default that results in a failing node is somewhat unprecedented for disk failure.

          The shipped config in cassandra.yaml looks wrong, should be commit_failure_policy, not disk_failure_policy I guess

          Right, looks like I didn't update the first or last lines I copy-pasted. Thanks.

          About the ignore case, lets hard code something for now - rate limit at one log error message per second perhaps?

          If we're just rate limiting the log messages, I'd say one per minute might be better. But I'm not sure having the threads spin trying to make progress is useful. The PCLES, for instance, will just start burning one core until it can successfully sync, assuming it doesn't actually have to wait each time to encounter the error. Tempted to have a 1s pause after an error during which we just sleep the erroring thread.

          Another issue that slightly concerns me is what happens if the CLES sync() starts failing, but the append and CLA doesn't. With "ignore" this could potentially result in us mapping in and allocating huge amounts of disk space, but not being able to sync or clear it. This might either result in lots of swapping, and/or us exceeding by a large margin our max log space goal. Since we never guarantee to keep to this I'm not sure how much of a problem it would be, but an error down to ACLs that stops us syncing one file might potentally end up eating up huge quantities of commit disk space. I'm tempted to have the CLA thread block once it hits twice its goal max space (or maybe introduce a second config parameter for a hard maximum). But I'm also tempted to leave these changes for the 2.1 branch, since it's a fairly specific failure case, and what we have is a big improvement over the current state of affairs.

          Show
          benedict Benedict added a comment - I don't think we should default to 'ignore' in Config.java Well, I wasn't too sure about this. On the one hand switching the default to "stop" means we could over cautiously kill user's hosts unexpectedly, maybe resulting in interruption of service (especially, say, our users running on SAN, as much as it is strongly discouraged). Whereas switching to "ignore" means we may not be durable. Neither are great defaults, but both are better than before. I'm comfortable with both, so if you feel strongly it should be "stop", I'll happily switch it. Perhaps I lean slightly in favour of it too, but it depends on if the user favours durability over availability, really, so there doesn't seem a single correct answer to me. Note that the default disk_failure_policy is also ignore, and the prior behaviour was closest to ignore, so introducing a default that results in a failing node is somewhat unprecedented for disk failure. The shipped config in cassandra.yaml looks wrong, should be commit_failure_policy, not disk_failure_policy I guess Right, looks like I didn't update the first or last lines I copy-pasted. Thanks. About the ignore case, lets hard code something for now - rate limit at one log error message per second perhaps? If we're just rate limiting the log messages, I'd say one per minute might be better. But I'm not sure having the threads spin trying to make progress is useful. The PCLES, for instance, will just start burning one core until it can successfully sync, assuming it doesn't actually have to wait each time to encounter the error. Tempted to have a 1s pause after an error during which we just sleep the erroring thread. Another issue that slightly concerns me is what happens if the CLES sync() starts failing, but the append and CLA doesn't. With "ignore" this could potentially result in us mapping in and allocating huge amounts of disk space, but not being able to sync or clear it. This might either result in lots of swapping, and/or us exceeding by a large margin our max log space goal. Since we never guarantee to keep to this I'm not sure how much of a problem it would be, but an error down to ACLs that stops us syncing one file might potentally end up eating up huge quantities of commit disk space. I'm tempted to have the CLA thread block once it hits twice its goal max space (or maybe introduce a second config parameter for a hard maximum). But I'm also tempted to leave these changes for the 2.1 branch, since it's a fairly specific failure case, and what we have is a big improvement over the current state of affairs.
          Hide
          benedict Benedict added a comment -

          Attaching a patch for 2.0 that switches the default policy to stop, and on policy ignore will pause the erroring thread for 1s before continuing.

          Still not sure this is definitely the correct behaviour for ignore, but it will do for now, since ignore is hardly going to be the most recommended setting.

          Show
          benedict Benedict added a comment - Attaching a patch for 2.0 that switches the default policy to stop, and on policy ignore will pause the erroring thread for 1s before continuing. Still not sure this is definitely the correct behaviour for ignore, but it will do for now, since ignore is hardly going to be the most recommended setting.
          Hide
          krummas Marcus Eriksson added a comment -

          +1 on the 2.0-patch, committed

          Show
          krummas Marcus Eriksson added a comment - +1 on the 2.0-patch, committed

            People

            • Assignee:
              benedict Benedict
              Reporter:
              jre J. Ryan Earl
              Reviewer:
              Marcus Eriksson
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development