Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9917

MVs should validate gc grace seconds on the tables involved

    Details

      Description

      For correctness reasons (potential resurrection of dropped values), batchlog entries are TTLs with the lowest gc grace second of all the tables involved in a batch.

      It means that if gc gs is set to 0 in one of the tables, the batchlog entry will be dead on arrival, and never replayed.

      We should probably warn against such LOGGED writes taking place, in general, but for MVs, we must validate that gc gs on the base table (and on the MV table, if we should allow altering gc gs there at all), is never set too low, or else.

        Issue Links

          Activity

          Hide
          pauloricardomg Paulo Motta added a comment -

          so, if I understood correctly, we want to validate on CreateMVStatement if baseTable.gcGraceSeconds < MIN_GC_GRACE_SECONDS, and also on AlterTableStatement if baseTable.NewGcGraceSeconds < MIN_GC_GRACE_SECONDS, is that correct? I will assume this is the case, please let me know if I should validate in the actual write instead.

          Show
          pauloricardomg Paulo Motta added a comment - so, if I understood correctly, we want to validate on CreateMVStatement if baseTable.gcGraceSeconds < MIN_GC_GRACE_SECONDS, and also on AlterTableStatement if baseTable.NewGcGraceSeconds < MIN_GC_GRACE_SECONDS, is that correct? I will assume this is the case, please let me know if I should validate in the actual write instead.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Yes, CREATE MV, ALTER TABLE, ALTER MV. More generally, batchlog writes should warn if any involved tables have extremely low gc gs - whatever extremely low actually is - with or without MVs.

          That said, I have no idea what MIN_GC_GRACE_SECONDS should be. Just bringing up the issue that batchlog is completely useless for such writes.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Yes, CREATE MV, ALTER TABLE, ALTER MV. More generally, batchlog writes should warn if any involved tables have extremely low gc gs - whatever extremely low actually is - with or without MVs. That said, I have no idea what MIN_GC_GRACE_SECONDS should be. Just bringing up the issue that batchlog is completely useless for such writes.
          Hide
          pauloricardomg Paulo Motta added a comment -

          3.0 v1 patch available for review. I chose a MIN_GC_GRACE_SECONDS of 3 hours for logged btches, which is the default value of max_hint_window_in_ms.

          Dtests available in this PR

          Tests will be available shortly here:

          Show
          pauloricardomg Paulo Motta added a comment - 3.0 v1 patch available for review. I chose a MIN_GC_GRACE_SECONDS of 3 hours for logged btches, which is the default value of max_hint_window_in_ms. Dtests available in this PR Tests will be available shortly here: 3.0 dtests 3.0 testall trunk dtests trunk testall
          Hide
          jbellis Jonathan Ellis added a comment -

          How about using the hint window cutoff as the minimum? Since they are both saying, "if a node is down longer than X, you're going to need to repair."

          Show
          jbellis Jonathan Ellis added a comment - How about using the hint window cutoff as the minimum? Since they are both saying, "if a node is down longer than X, you're going to need to repair."
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          max_hint_window_in_ms is misunderstood by almost everyone, it seems.

          It only affects the decision to write a hint in the first place (down for long than the window? stop writing hints).

          1) Batchlog ignores max_hint_window_in_ms and always writes its hints, no matter what, if a node is down
          2) max_hint_window_in_ms does not affect replay decisions - it's ignored on replay - only gc gs - turned into ttl - matters

          It's most often true that if a node has been down for longer than max_hint_window_in_mx, it is going to have data missing, yes. But there are no guarantees that it being down for shorter than that means it doesn't. Any number of nodes with hints for the target node could go down for a time longer than hints' expiration time, die forever, lose their hints - you name it.

          And many outright disable hints (batchlog would still write them), or have ridiculously low max_hint_window_in_ms.

          What I'm saying is that I don't like tying any correctness logic to max_hint_window_in_ms param.

          Show
          iamaleksey Aleksey Yeschenko added a comment - max_hint_window_in_ms is misunderstood by almost everyone, it seems. It only affects the decision to write a hint in the first place (down for long than the window? stop writing hints). 1) Batchlog ignores max_hint_window_in_ms and always writes its hints, no matter what, if a node is down 2) max_hint_window_in_ms does not affect replay decisions - it's ignored on replay - only gc gs - turned into ttl - matters It's most often true that if a node has been down for longer than max_hint_window_in_mx , it is going to have data missing, yes. But there are no guarantees that it being down for shorter than that means it doesn't. Any number of nodes with hints for the target node could go down for a time longer than hints' expiration time, die forever, lose their hints - you name it. And many outright disable hints (batchlog would still write them), or have ridiculously low max_hint_window_in_ms . What I'm saying is that I don't like tying any correctness logic to max_hint_window_in_ms param.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I also want to point out that it's neither batch log or hints issue, it also applies to commit log replays - see CASSANDRA-8498.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I also want to point out that it's neither batch log or hints issue, it also applies to commit log replays - see CASSANDRA-8498 .
          Hide
          pauloricardomg Paulo Motta added a comment -

          It means that if gc gs is set to 0 in one of the tables, the batchlog entry will be dead on arrival, and never replayed.

          we want a min-batchlog-ttl to account at least for network delay + min-batchlog-replay-period but that can also protect from small transient failures, so I think 1-3 hours must be sufficient without restricting gc_grace_seconds too much.

          I made the default 3 hours, but the operator can decrease it if necessary by setting the cassandra.min_batchlog_ttl system property.

          Updated patch link with new version. Dtests available here

          Show
          pauloricardomg Paulo Motta added a comment - It means that if gc gs is set to 0 in one of the tables, the batchlog entry will be dead on arrival, and never replayed. we want a min-batchlog-ttl to account at least for network delay + min-batchlog-replay-period but that can also protect from small transient failures, so I think 1-3 hours must be sufficient without restricting gc_grace_seconds too much. I made the default 3 hours, but the operator can decrease it if necessary by setting the cassandra.min_batchlog_ttl system property. Updated patch link with new version. Dtests available here 3.0 dtests 3.0 testall trunk dtests trunk testall
          Hide
          jbellis Jonathan Ellis added a comment -

          It only affects the decision to write a hint in the first place (down for long than the window? stop writing hints)... It's most often true that if a node has been down for longer than max_hint_window_in_mx, it is going to have data missing, yes. But there are no guarantees that it being down for shorter than that means it doesn't.

          True, but let me clarify: if no node in the cluster is down for longer than max hint window, and you have no hardware failures, and you have batch commitlog enabled, then you won't need repair. Fair?

          I don't really see a difference vs min_batchlog_ttl. Especially since Paulo independently came up with the same value we use as default max hint window here.

          Let's not offer users more tuning knobs than they can meaningfully distinguish between.

          Show
          jbellis Jonathan Ellis added a comment - It only affects the decision to write a hint in the first place (down for long than the window? stop writing hints)... It's most often true that if a node has been down for longer than max_hint_window_in_mx, it is going to have data missing, yes. But there are no guarantees that it being down for shorter than that means it doesn't. True, but let me clarify: if no node in the cluster is down for longer than max hint window, and you have no hardware failures, and you have batch commitlog enabled, then you won't need repair. Fair? I don't really see a difference vs min_batchlog_ttl. Especially since Paulo independently came up with the same value we use as default max hint window here. Let's not offer users more tuning knobs than they can meaningfully distinguish between.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          True, but let me clarify: if no node in the cluster is down for longer than max hint window, and you have no hardware failures, and you have batch commitlog enabled, then you won't need repair. Fair?

          I might be misunderstanding the context, but no, in general this is not true. A request times out, a hint - or a batchlog entry gets written - a table in the mutation has a low gc gs - the batchlog/hint entry expires before it can be replayed, and we now need repair.

          Especially since Paulo independently came up with the same value we use as default max hint window here.

          Right. Independently.

          Show
          iamaleksey Aleksey Yeschenko added a comment - True, but let me clarify: if no node in the cluster is down for longer than max hint window, and you have no hardware failures, and you have batch commitlog enabled, then you won't need repair. Fair? I might be misunderstanding the context, but no, in general this is not true. A request times out, a hint - or a batchlog entry gets written - a table in the mutation has a low gc gs - the batchlog/hint entry expires before it can be replayed, and we now need repair. Especially since Paulo independently came up with the same value we use as default max hint window here. Right. Independently.
          Hide
          jbellis Jonathan Ellis added a comment -

          we now need repair

          Which is why "low" gcgs should be defined as lower than max hint window, because that's what causes problems.

          Show
          jbellis Jonathan Ellis added a comment - we now need repair Which is why "low" gcgs should be defined as lower than max hint window, because that's what causes problems.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I'm going to be a pedant and bring up another example, just because: you have nodes A and B, A has some hints/batches for B; A is down for 1.5 hours, then A comes up, but B goes down for 1.5 hours. No single node has been down for longer than max hint window, but, assuming gc gs/max hints window of 3 hours, none of the batches or hints have survived. You need repair. And with current MVs - to drop and recreate the MV?

          What I'm saying is that whatever we do, it's going to be broken. Also that max_hints_window_in_ms should not be part of any calculations whatsoever, as you can ultimately infer nothing from it.

          So let's just validate that it's not set to 0 and properly document the effects of gc gs in the MV documentation.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I'm going to be a pedant and bring up another example, just because: you have nodes A and B, A has some hints/batches for B; A is down for 1.5 hours, then A comes up, but B goes down for 1.5 hours. No single node has been down for longer than max hint window, but, assuming gc gs/max hints window of 3 hours, none of the batches or hints have survived. You need repair. And with current MVs - to drop and recreate the MV? What I'm saying is that whatever we do, it's going to be broken. Also that max_hints_window_in_ms should not be part of any calculations whatsoever, as you can ultimately infer nothing from it. So let's just validate that it's not set to 0 and properly document the effects of gc gs in the MV documentation.
          Hide
          pauloricardomg Paulo Motta added a comment -

          There are 3 possible failure scenarios we want to protect against, by limiting the gc_grace_seconds of tables involved in a batchlog/MV:
          1. batchlog reaching the destination node alive (< 1s)
          2. protect against minor spikes and gc pauses (few minutes)
          3. medium to major outages (> 1h)

          By using a large treshold we can protect against 1-3, but we can't define the optimal value (since this varies from cluster to cluster) nor use max_hints_windows_in_ms (due to the valid reasons given by Alexey). By validating that the value is not 0, we only protect against 1, and users might workaround this limitation by setting the gc_grace_seconds to 1, which can still lead to problems.

          I think 2 is pretty common and can save us some headache in the future, so how about using a min treshold of 10 minutes to protect against 1 and 2? We can either hard code this, or leave the unpublished property cassandra.min_batchlog_ttl if advanced operators want to modify that constant. We can still properly document the effects of gc gs in the MV documentation to protect against 3.

          Show
          pauloricardomg Paulo Motta added a comment - There are 3 possible failure scenarios we want to protect against, by limiting the gc_grace_seconds of tables involved in a batchlog/MV: 1. batchlog reaching the destination node alive (< 1s) 2. protect against minor spikes and gc pauses (few minutes) 3. medium to major outages (> 1h) By using a large treshold we can protect against 1-3, but we can't define the optimal value (since this varies from cluster to cluster) nor use max_hints_windows_in_ms (due to the valid reasons given by Alexey). By validating that the value is not 0 , we only protect against 1, and users might workaround this limitation by setting the gc_grace_seconds to 1, which can still lead to problems. I think 2 is pretty common and can save us some headache in the future, so how about using a min treshold of 10 minutes to protect against 1 and 2? We can either hard code this, or leave the unpublished property cassandra.min_batchlog_ttl if advanced operators want to modify that constant. We can still properly document the effects of gc gs in the MV documentation to protect against 3.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          There are no shades of 'correct'. Let's not overthink it.

          By far the most common GC GS is the default one, which is 10 days. The second most common one is 0. There aren't many out there that are in-between.

          Please, let's just validate that it's not 0, since with the current implementation of BL and hints it just doesn't work (but something we should revisit once we introduce the write-once option for the tables).

          Show
          iamaleksey Aleksey Yeschenko added a comment - There are no shades of 'correct'. Let's not overthink it. By far the most common GC GS is the default one, which is 10 days. The second most common one is 0. There aren't many out there that are in-between. Please, let's just validate that it's not 0, since with the current implementation of BL and hints it just doesn't work (but something we should revisit once we introduce the write-once option for the tables).
          Hide
          jbellis Jonathan Ellis added a comment -

          Validate nonzero WFM.

          Show
          jbellis Jonathan Ellis added a comment - Validate nonzero WFM.
          Hide
          pauloricardomg Paulo Motta added a comment -

          Updated patch available for review here
          Tests updated: cassandra-dtest PR

          Tests will appear shortly in the links below:

          Show
          pauloricardomg Paulo Motta added a comment - Updated patch available for review here Tests updated: cassandra-dtest PR Tests will appear shortly in the links below: 3.0 dtests 3.0 testall trunk dtests trunk testall
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Yep, this is it. Can almost commit as is.

          Pushed some minor tweaking here to make the language slightly less verbose and also avoid allocating a hashset in the common case of no tables having a gc gs of 0. This is going to require updating dtests a bit.

          I want one more thing addressed, though. Logging in BatchStatement needs to be changed. For one, we definitely should not log a warning for each table. It should be once per batch. Secondly, it should be rate-limited (use NoSpamLogger). Thirdly, it should utilize the now available client warning facility (ClientWarn).

          You should look at how this is handled by verifyBatchType.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Yep, this is it. Can almost commit as is. Pushed some minor tweaking here to make the language slightly less verbose and also avoid allocating a hashset in the common case of no tables having a gc gs of 0. This is going to require updating dtests a bit. I want one more thing addressed, though. Logging in BatchStatement needs to be changed. For one, we definitely should not log a warning for each table. It should be once per batch. Secondly, it should be rate-limited (use NoSpamLogger ). Thirdly, it should utilize the now available client warning facility ( ClientWarn ). You should look at how this is handled by verifyBatchType .
          Hide
          pauloricardomg Paulo Motta added a comment -

          Updated patch available for review here
          Tests updated: cassandra-dtest PR

          Tests will appear shortly in the links below:

          Show
          pauloricardomg Paulo Motta added a comment - Updated patch available for review here Tests updated: cassandra-dtest PR Tests will appear shortly in the links below: 3.0 dtests 3.0 testall trunk dtests trunk testall
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Thank you Paulo, this looks good. Will commit once cassci results are up, and once beta1 is out.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Thank you Paulo, this looks good. Will commit once cassci results are up, and once beta1 is out.
          Hide
          pauloricardomg Paulo Motta added a comment -

          nice, thanks!

          Show
          pauloricardomg Paulo Motta added a comment - nice, thanks!
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Committed as d0e8ba4947d3e7804421869bcd1997ca6aad3840 to cassandra-3.0 and merged with trunk, thanks.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Committed as d0e8ba4947d3e7804421869bcd1997ca6aad3840 to cassandra-3.0 and merged with trunk, thanks.

            People

            • Assignee:
              pauloricardomg Paulo Motta
              Reporter:
              iamaleksey Aleksey Yeschenko
              Reviewer:
              Aleksey Yeschenko
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development