Cassandra
  1. Cassandra
  2. CASSANDRA-809

Full disk can result in being marked down

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them):

      When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread
      (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy). The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled.

      Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far:

      SHORT TERM:

      • change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled

      LONG TERM

      • It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
      • Perhaps we could use the failure detector on disks?

        Activity

        Hide
        Jonathan Ellis added a comment -

        Created https://issues.apache.org/jira/browse/CASSANDRA-4292 to pursue the thread-per-disk idea. CASSANDRA-2116 and CASSANDRA-2118 address the issue of what to do when disks error out.

        Show
        Jonathan Ellis added a comment - Created https://issues.apache.org/jira/browse/CASSANDRA-4292 to pursue the thread-per-disk idea. CASSANDRA-2116 and CASSANDRA-2118 address the issue of what to do when disks error out.
        Hide
        Jonathan Ellis added a comment -

        Updating for the past 10 months' worth of changes:

        Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive).

        This is basically fixed now that flow control is implemented (CASSANDRA-685) and refined (CASSANDRA-1358).

        It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory

        At least until we cap sstable size (CASSANDRA-1608?), one data volume is going to be the recommended configuration, so this is low priority.

        if a disk fills up, we stop trying to write to it

        if we're about to write more data to a disk than space available, we don't try and write to that disk

        these two Cassandra has always done on compaction. less sure about flush.

        the nice thing about writes is that erroring out is almost identical to being completely down for ConsistencyLevel purposes.

        we balance data relatively evenly between disks

        also low priority given the above.

        if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)

        this is the biggest problem right now: if a disk/volume goes down, the rest of the node (in particular gossip) will keep functioning, so other nodes will continue trying to read from it.

        short term the best fix for this is to provide timeout information to the dynamic snitch (CASSANDRA-1905) so it can route around such nodes.

        Show
        Jonathan Ellis added a comment - Updating for the past 10 months' worth of changes: Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive). This is basically fixed now that flow control is implemented ( CASSANDRA-685 ) and refined ( CASSANDRA-1358 ). It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory At least until we cap sstable size ( CASSANDRA-1608 ?), one data volume is going to be the recommended configuration, so this is low priority. if a disk fills up, we stop trying to write to it if we're about to write more data to a disk than space available, we don't try and write to that disk these two Cassandra has always done on compaction. less sure about flush. the nice thing about writes is that erroring out is almost identical to being completely down for ConsistencyLevel purposes. we balance data relatively evenly between disks also low priority given the above. if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well) this is the biggest problem right now: if a disk/volume goes down, the rest of the node (in particular gossip) will keep functioning, so other nodes will continue trying to read from it. short term the best fix for this is to provide timeout information to the dynamic snitch ( CASSANDRA-1905 ) so it can route around such nodes.
        Hide
        Ryan King added a comment -

        > > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
        >
        > disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)

        I'll have to think more about this. I agree that we need to not let the queues grow in an unbounded way, but our current setup (basically all threads can be consumed by one queue and some of them will wait indefinitely for conditions).

        We need to decide which kind of failure we want here. Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive).

        > > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
        >
        > yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet.

        I agree. We'll take a look at this early next week.

        > > Perhaps we could use the failure detector on disks?
        >
        > Not sure what this looks like but I agree our story here needs a lot of improvement.

        I'm not entirely sure what this looks like either, but here are the properties I'd like cassandra to have:

        • if a disk fills up, we stop trying to write to it
        • if we're about to write more data to a disk than space available, we don't try and write to that disk
        • we balance data relatively evenly between disks
        • if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)

        > Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

        This is probably the best advice for new clusters. Unfortunately we can't easily implement this right now.

        Show
        Ryan King added a comment - > > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled > > disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush) I'll have to think more about this. I agree that we need to not let the queues grow in an unbounded way, but our current setup (basically all threads can be consumed by one queue and some of them will wait indefinitely for conditions). We need to decide which kind of failure we want here. Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive). > > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory > > yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet. I agree. We'll take a look at this early next week. > > Perhaps we could use the failure detector on disks? > > Not sure what this looks like but I agree our story here needs a lot of improvement. I'm not entirely sure what this looks like either, but here are the properties I'd like cassandra to have: if a disk fills up, we stop trying to write to it if we're about to write more data to a disk than space available, we don't try and write to that disk we balance data relatively evenly between disks if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well) > Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point. This is probably the best advice for new clusters. Unfortunately we can't easily implement this right now.
        Hide
        Jonathan Ellis added a comment -

        > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled

        disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)

        > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory

        yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet.

        > Perhaps we could use the failure detector on disks?

        Not sure what this looks like but I agree our story here needs a lot of improvement.

        Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

        Show
        Jonathan Ellis added a comment - > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush) > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet. > Perhaps we could use the failure detector on disks? Not sure what this looks like but I agree our story here needs a lot of improvement. Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ryan King
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development