[CASSANDRA-14499] node-level disk quota - ASF JIRA

Details

Type: New Feature
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: None
Component/s: Legacy/Core, Local/Config
Labels:
None

Description

Operators should be able to specify, via YAML, the amount of usable disk space on a node as a percentage of the total available or as an absolute value. If both are specified, the absolute value should take precedence. This allows operators to reserve space available to the database for background tasks – primarily compaction. When a node reaches its quota, gossip should be disabled to prevent it taking further writes (which would increase the amount of data stored), being involved in reads (which are likely to be more inconsistent over time), or participating in repair (which may increase the amount of space used on the machine). The node re-enables gossip when the amount of data it stores is below the quota.

The proposed option differs from min_free_space_per_drive_in_mb, which reserves some amount of space on each drive that is not usable by the database.

Attachments

Activity

Ascending order - Click to sort in descending order

Jeremy Hanna added a comment - 08/Jun/18 20:11

I just want to add a note of caution to anything automatic happening when certain metrics trigger. I've seen where metrics can misfire under certain circumstances which leads to unpredictable cluster behavior. I would favor having a warning system over anything done automatically if it were my cluster.

Jeremy Hanna added a comment - 08/Jun/18 20:11 I just want to add a note of caution to anything automatic happening when certain metrics trigger. I've seen where metrics can misfire under certain circumstances which leads to unpredictable cluster behavior. I would favor having a warning system over anything done automatically if it were my cluster.

Jordan West added a comment - 08/Jun/18 20:24

jeromatron I understand those concerns. This would be opt-in for folks who wanted automatic action taken and any such action should take care to not cause the node to flap, for example. One use case where we see this as valuable is QA/perf/test clusters that may not have the full monitoring setup but need to be protected from errant clients filling up disks to a point where worse things happen. The warning system can be accomplished today with monitoring and alerting on the same metrics.

Jordan West added a comment - 08/Jun/18 20:24 jeromatron I understand those concerns. This would be opt-in for folks who wanted automatic action taken and any such action should take care to not cause the node to flap, for example. One use case where we see this as valuable is QA/perf/test clusters that may not have the full monitoring setup but need to be protected from errant clients filling up disks to a point where worse things happen. The warning system can be accomplished today with monitoring and alerting on the same metrics.

Jeremiah Jordan added a comment - 08/Jun/18 20:49

isn't this pretty easy to do with OS level settings? Getting this tracking right across all places we uses disk seems like something we are bound to fail at, where using your OS would not?

Jeremiah Jordan added a comment - 08/Jun/18 20:49 isn't this pretty easy to do with OS level settings? Getting this tracking right across all places we uses disk seems like something we are bound to fail at, where using your OS would not?

Jeff Jirsa added a comment - 08/Jun/18 23:42

Not clear to me how you'd do this as gracefully at the OS level as you can at the cassandra level (by, e.g., blocking writes and inbound streaming).

It's also not clear to me that disabling gossip is the right answer. You can still serve reads, the coordinator will know if it's out of sync and can attempt a (now non-blocking and speculating) read repair if necessary. If read repair is required to meet consistency, we'll fail there, but that's still likely better than not serving the already consistent read.

Jeff Jirsa added a comment - 08/Jun/18 23:42 Not clear to me how you'd do this as gracefully at the OS level as you can at the cassandra level (by, e.g., blocking writes and inbound streaming). It's also not clear to me that disabling gossip is the right answer. You can still serve reads, the coordinator will know if it's out of sync and can attempt a (now non-blocking and speculating) read repair if necessary. If read repair is required to meet consistency, we'll fail there, but that's still likely better than not serving the already consistent read.

Jordan West added a comment - 09/Jun/18 00:33 - edited

The other reason the OS level wouldn't work is we are trying to track live data, which the OS can't tell the difference between. EDIT: also to clarify, the goal here isn't to implement a perfect quota. There will be some room for error where the quota can be exceeded. The goal is to the mark the node unhealthy when it reaches this level and to have enough headroom for compaction or other operations to get it to a healthy state.

Regarding taking reads, jasobrown, krummas, and I discussed this some offline. Since the node can only get more and more out of sync while not taking write traffic and can't participate in (read) repair until the amount of storage used is below quota, we thought it better to disable both reads and writes. Less-blocking and speculative read repair makes us more available in this case (as it should).

Disabling gossip is a quick route to disabling reads/writes. Is it the best approach to doing so? I'm not 100%. My concern is for how the operator gets back to a healthy state once a quota is reached on a node. They have a few options: migrate data to a bigger node, compaction catches up and deletes data, quota is raised so its not met anymore, node(s) are added to take storage responsibility away from the node, or data is forcefully deleted from the node. We need to ensure we don't prevent those operations from taking place. I've been discussing this with jasobrown offline as well.

Jordan West added a comment - 09/Jun/18 00:33 - edited The other reason the OS level wouldn't work is we are trying to track live data, which the OS can't tell the difference between. EDIT: also to clarify, the goal here isn't to implement a perfect quota. There will be some room for error where the quota can be exceeded. The goal is to the mark the node unhealthy when it reaches this level and to have enough headroom for compaction or other operations to get it to a healthy state. Regarding taking reads, jasobrown , krummas , and I discussed this some offline. Since the node can only get more and more out of sync while not taking write traffic and can't participate in (read) repair until the amount of storage used is below quota, we thought it better to disable both reads and writes. Less-blocking and speculative read repair makes us more available in this case (as it should). Disabling gossip is a quick route to disabling reads/writes. Is it the best approach to doing so? I'm not 100%. My concern is for how the operator gets back to a healthy state once a quota is reached on a node. They have a few options: migrate data to a bigger node, compaction catches up and deletes data, quota is raised so its not met anymore, node(s) are added to take storage responsibility away from the node, or data is forcefully deleted from the node. We need to ensure we don't prevent those operations from taking place. I've been discussing this with jasobrown offline as well.

Jeff Jirsa added a comment - 09/Jun/18 00:47

1) disabling gosspi alone is insufficient, also need to disable native

2) recovery is likely some combination of compactions and host replacement

3) still not sure I buy the argument that it’s wrong to serve reads in this case - it may be true that some table is getting out of sync, but that doesn’t mean every table is, and we already have a mechanism to deal with nodes that can serve reads but not writes (speculating on the read repair). If you don’t serve reads either, than any GC pause will be guaranteed to impact client request latency as we can’t soeculate around it in the common rf=3 case.

Jeff Jirsa added a comment - 09/Jun/18 00:47 1) disabling gosspi alone is insufficient, also need to disable native 2) recovery is likely some combination of compactions and host replacement 3) still not sure I buy the argument that it’s wrong to serve reads in this case - it may be true that some table is getting out of sync, but that doesn’t mean every table is, and we already have a mechanism to deal with nodes that can serve reads but not writes (speculating on the read repair). If you don’t serve reads either, than any GC pause will be guaranteed to impact client request latency as we can’t soeculate around it in the common rf=3 case.

Jordan West added a comment - 09/Jun/18 00:58

disabling gosspi alone is insufficient, also need to disable native

Agreed. I hadn't updated the description to reflect it but what I am working on does this as well.

still not sure I buy the argument that it’s wrong to serve reads in this case - it may be true that some table is getting out of sync, but that doesn’t mean every table is,

I agree it depends on the workload for each specific dataset but since we can't know which we have we have to assume it could get really out of sync.

and we already have a mechanism to deal with nodes that can serve reads but not writes (speculating on the read repair).

Even if we speculate we still attempt it. That work will always be for naught and being at quota is likely a prolonged state (the ways out of it take a while).

If you don’t serve reads either, than any GC pause will be guaranteed to impact client request latency as we can’t soeculate around it in the common rf=3 case.

This is true. But thats almost the same as losing a node because its disk has been filled up completely. If we have one unhealthy node we are another unhealthy node away from unavailability in the rf=3/quorum case.

That said, I'll consider the reads more over the weekend. Its a valid concern.

Jordan West added a comment - 09/Jun/18 00:58 disabling gosspi alone is insufficient, also need to disable native Agreed. I hadn't updated the description to reflect it but what I am working on does this as well. still not sure I buy the argument that it’s wrong to serve reads in this case - it may be true that some table is getting out of sync, but that doesn’t mean every table is, I agree it depends on the workload for each specific dataset but since we can't know which we have we have to assume it could get really out of sync. and we already have a mechanism to deal with nodes that can serve reads but not writes (speculating on the read repair). Even if we speculate we still attempt it. That work will always be for naught and being at quota is likely a prolonged state (the ways out of it take a while). If you don’t serve reads either, than any GC pause will be guaranteed to impact client request latency as we can’t soeculate around it in the common rf=3 case. This is true. But thats almost the same as losing a node because its disk has been filled up completely. If we have one unhealthy node we are another unhealthy node away from unavailability in the rf=3/quorum case. That said, I'll consider the reads more over the weekend. Its a valid concern.

Jeremiah Jordan added a comment - 09/Jun/18 13:02 - edited

If one node has reached “full” how likely is it that others are about to as well? Without monitoring how will an operator know to do something to “fix” the situation? I’m just not convinced that it’s worth adding the logic and complications in the rest of the code to allow this feature, which will maybe add a short bandaid of time before things completely fall over, and possible just have things fall over early. If you are lucky and one node has enough more data than others that it hits this first, without others following shortly behind, you might give a small amount of breathing room for compaction to clean a little space out, but that is only going to do so much, it won’t fix the problem. You need to recognize as an operator that your nodes are full and add more nodes to your cluster, or add more disk space to your cluster.

Jeremiah Jordan added a comment - 09/Jun/18 13:02 - edited If one node has reached “full” how likely is it that others are about to as well? Without monitoring how will an operator know to do something to “fix” the situation? I’m just not convinced that it’s worth adding the logic and complications in the rest of the code to allow this feature, which will maybe add a short bandaid of time before things completely fall over, and possible just have things fall over early. If you are lucky and one node has enough more data than others that it hits this first, without others following shortly behind, you might give a small amount of breathing room for compaction to clean a little space out, but that is only going to do so much, it won’t fix the problem. You need to recognize as an operator that your nodes are full and add more nodes to your cluster, or add more disk space to your cluster.

Jordan West added a comment - 11/Jun/18 18:22

Since the goal isn't to strictly enforce the quota (its ok if its violated but once noticed action should be taken) the code isn't invasive. Its a small amount of new code with the only change being to schedule the check on optional tasks. That being said, if the concern is complexity, one potential place for this (and I think it may be better home regardless) is CASSANDRA-14395.

While this may seem like a small bandaid, and there are cases where multiple nodes can go down at once, it is exactly meant to give some headroom. This headroom makes it considerably easier to get the cluster into a healthy state again.

Jordan West added a comment - 11/Jun/18 18:22 Since the goal isn't to strictly enforce the quota (its ok if its violated but once noticed action should be taken) the code isn't invasive. Its a small amount of new code with the only change being to schedule the check on optional tasks. That being said, if the concern is complexity, one potential place for this (and I think it may be better home regardless) is CASSANDRA-14395 . While this may seem like a small bandaid, and there are cases where multiple nodes can go down at once, it is exactly meant to give some headroom. This headroom makes it considerably easier to get the cluster into a healthy state again.

Jeff Jirsa added a comment - 11/Jun/18 18:30

You need to recognize as an operator that your nodes are full and add more nodes to your cluster, or add more disk space to your cluster.

Even the best operators can't always add disk/instances fast enough to avoid running a cluster 100% out of space in some cases, and if/when that happens, you end up in a situation where you have no good options, perhaps a few "ok" options (block client connections and expand, maybe), but plenty of bad options (removing commitlogs, etc).

This isn't entirely new ground here, there's precedent elsewhere. Hadoop has dfs.datanode.du.reserved per disk, for example.

Jeff Jirsa added a comment - 11/Jun/18 18:30 You need to recognize as an operator that your nodes are full and add more nodes to your cluster, or add more disk space to your cluster. Even the best operators can't always add disk/instances fast enough to avoid running a cluster 100% out of space in some cases, and if/when that happens, you end up in a situation where you have no good options, perhaps a few "ok" options (block client connections and expand, maybe), but plenty of bad options (removing commitlogs, etc). This isn't entirely new ground here, there's precedent elsewhere. Hadoop has dfs.datanode.du.reserved per disk, for example.

Sankalp Kohli added a comment - 11/Jun/18 23:23

We should give a way to recover after an application has filled a node. Can we allow deletes or truncate?

Sankalp Kohli added a comment - 11/Jun/18 23:23 We should give a way to recover after an application has filled a node. Can we allow deletes or truncate?

Sankalp Kohli added a comment - 11/Jun/18 23:24

Why not add nodetool stop insert/delete/select and have the side car call these when required?

Sankalp Kohli added a comment - 11/Jun/18 23:24 Why not add nodetool stop insert/delete/select and have the side car call these when required?

Jordan West added a comment - 11/Jun/18 23:36

Nothing would disallow truncate, although if one node is at quota, is dropping all data what is desired? In some use-cases perhaps. Since deletes temporarily inflate storage use, for a node level quota I don't think they should be allowed (for a keyspace-level quota that would be different perhaps). The client also can't be expected to know exactly which keys live on the node(s) that are at quota which makes remediation by delete less viable. The most likely remediations are adding more nodes or truncation. A correct implementation would prevent neither of these.

I agree that this could/should live in the management process

Jordan West added a comment - 11/Jun/18 23:36 Nothing would disallow truncate, although if one node is at quota, is dropping all data what is desired? In some use-cases perhaps. Since deletes temporarily inflate storage use, for a node level quota I don't think they should be allowed (for a keyspace-level quota that would be different perhaps). The client also can't be expected to know exactly which keys live on the node(s) that are at quota which makes remediation by delete less viable. The most likely remediations are adding more nodes or truncation. A correct implementation would prevent neither of these. I agree that this could/should live in the management process

Jeremy Hanna added a comment - 13/Jun/18 18:43

I still have a concern about including something into the codebase that shuts down node operations automatically - even if it's opt-in. Considering that under normal circumstances, nodes will have around the same amount of data, that leads to some fairly normal cascading failure scenarios when this is enabled. That leads me to wonder when this would be useful.

One use case where we see this as valuable is QA/perf/test clusters that may not have the full monitoring setup but need to be protected from errant clients filling up disks to a point where worse things happen.

So is it that there is not a lot of access to the machine or the VM or the OS in those QA/perf/test clusters but there is access to Cassandra so utilize that access to make sure an errant client doesn't do things that require getting access (or contacting the people with access) to the machine to rectify, like when the volume fills up?

Would the only circumstances where this is useful be in QA/perf/test clusters and therefore cascading failure of the cluster isn't the end of the world?

I'm just concerned that while a very mature user is going to use this appropriately, others out there will inadvertently misuse the feature. If this is something that gets into the codebase, I would just want to make extra sure that people are aware of both the intended use cases/scenarios and especially the risks of cascading failure. That said, introducing something that may introduce cascading failure automatically for the purpose of test environments seems unwise.

I'm happy to be wrong about the probability of cascading failure or the expected use cases, but please help me understand.

Jeremy Hanna added a comment - 13/Jun/18 18:43 I still have a concern about including something into the codebase that shuts down node operations automatically - even if it's opt-in. Considering that under normal circumstances, nodes will have around the same amount of data, that leads to some fairly normal cascading failure scenarios when this is enabled. That leads me to wonder when this would be useful. One use case where we see this as valuable is QA/perf/test clusters that may not have the full monitoring setup but need to be protected from errant clients filling up disks to a point where worse things happen. So is it that there is not a lot of access to the machine or the VM or the OS in those QA/perf/test clusters but there is access to Cassandra so utilize that access to make sure an errant client doesn't do things that require getting access (or contacting the people with access) to the machine to rectify, like when the volume fills up? Would the only circumstances where this is useful be in QA/perf/test clusters and therefore cascading failure of the cluster isn't the end of the world? I'm just concerned that while a very mature user is going to use this appropriately, others out there will inadvertently misuse the feature. If this is something that gets into the codebase, I would just want to make extra sure that people are aware of both the intended use cases/scenarios and especially the risks of cascading failure. That said, introducing something that may introduce cascading failure automatically for the purpose of test environments seems unwise. I'm happy to be wrong about the probability of cascading failure or the expected use cases, but please help me understand.

People

Assignee:: Jordan West

Reporter:: Jordan West

Authors:: Jordan West

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 05/Jun/18 23:26

Updated:: 16/Apr/19 09:29

Apache Cassandra