Cassandra
  1. Cassandra
  2. CASSANDRA-4753

Timeout reporter writes hints for the local node

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.1
    • Component/s: None
    • Labels:
      None

      Description

      MessagingService.java:330 calls StorageProxy.scheduleLocalHint() without checking if the local node is the target. This causes StorageProxy.scheduleLocalHint to throw AssertionError sometimes.

      Got this error when some batches are timed out. This can happen because even local batches now go through StorageProxy.sendMessages and aren't just rm.apply()'d.

      1. 4753.txt
        0.8 kB
        Aleksey Yeschenko
      2. 4753-v2.txt
        2 kB
        Jonathan Ellis
      3. 4753-v3.txt
        2 kB
        Aleksey Yeschenko

        Activity

        Show
        Aleksey Yeschenko added a comment - See https://issues.apache.org/jira/browse/CASSANDRA-4542?focusedCommentId=13451479&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13451479 and https://issues.apache.org/jira/browse/CASSANDRA-4542?focusedCommentId=13451486&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13451486
        Hide
        Aleksey Yeschenko added a comment -

        Special-casing for local inserts might be worth it now that abm is the default and we went even futher with CASSANDRA-4738.

        Show
        Aleksey Yeschenko added a comment - Special-casing for local inserts might be worth it now that abm is the default and we went even futher with CASSANDRA-4738 .
        Hide
        Jonathan Ellis added a comment -

        Sounds like the right solution is to add a local-only apply the way we do for normal Mutations.

        Show
        Jonathan Ellis added a comment - Sounds like the right solution is to add a local-only apply the way we do for normal Mutations.
        Hide
        Aleksey Yeschenko added a comment -

        That particular error has been fixed some time ago (same issue that was causing batchlog timeouts in beta1). So there is really no need to special-case timeout-reporter for local node - such cases just should not happen.

        However, I added a check for local node to StorageProxy.shouldHint so that no issue like this one pops in the future.

        Show
        Aleksey Yeschenko added a comment - That particular error has been fixed some time ago (same issue that was causing batchlog timeouts in beta1). So there is really no need to special-case timeout-reporter for local node - such cases just should not happen. However, I added a check for local node to StorageProxy.shouldHint so that no issue like this one pops in the future.
        Hide
        Jonathan Ellis added a comment -

        It's valid for local writes to timeout, but we don't want to just drop them on the floor if they do. v2 attached.

        Show
        Jonathan Ellis added a comment - It's valid for local writes to timeout, but we don't want to just drop them on the floor if they do. v2 attached.
        Hide
        Aleksey Yeschenko added a comment -

        You are right. There is still the option of special-casing local endpoint for syncWriteToBatchlog and asyncRemoveFromBatchlog. I know single-node clusters are not a pirority, but still. That would also make this issue go away.

        Show
        Aleksey Yeschenko added a comment - You are right. There is still the option of special-casing local endpoint for syncWriteToBatchlog and asyncRemoveFromBatchlog. I know single-node clusters are not a pirority, but still. That would also make this issue go away.
        Hide
        Aleksey Yeschenko added a comment -

        Something like v3.

        Show
        Aleksey Yeschenko added a comment - Something like v3.
        Hide
        Jonathan Ellis added a comment -

        Hmm. We actually have a couple problems here.

        1. We do allow local writes to be dropped, but
        2. Nobody writes hints for dropped local writes
        3. If local hints were attempted, they would error out (what v2 was trying to fix)

        We actually do want to hint batchlog inserts to maintain the contract of "timed out means it's in progress, you do not have to retry."

        I'm not sure what the best way to add hinting (actually, retry-on-hint-stage) code for insertLocal is though. One option would be to just make local writes non-droppable, but then we lose our backpressure mechanism of a full hint stage causing OverloadedException, and a poorly behaving client could OOM the coordinator with a lot of local writes.

        Show
        Jonathan Ellis added a comment - Hmm. We actually have a couple problems here. We do allow local writes to be dropped, but Nobody writes hints for dropped local writes If local hints were attempted, they would error out (what v2 was trying to fix) We actually do want to hint batchlog inserts to maintain the contract of "timed out means it's in progress, you do not have to retry." I'm not sure what the best way to add hinting (actually, retry-on-hint-stage) code for insertLocal is though. One option would be to just make local writes non-droppable, but then we lose our backpressure mechanism of a full hint stage causing OverloadedException, and a poorly behaving client could OOM the coordinator with a lot of local writes.
        Hide
        Jonathan Ellis added a comment -

        Here is one solution: http://github.com/jbellis/cassandra/branches/4753-5

        • Created LocalMutationRunnable that retries with hint backpressure if it gets dropped
        • Added an updated v3, with an additional assert to make sure we don't introduce a bug if we decide to allow using the coordinator as a batchlog member in non-single-node clusters
        Show
        Jonathan Ellis added a comment - Here is one solution: http://github.com/jbellis/cassandra/branches/4753-5 Created LocalMutationRunnable that retries with hint backpressure if it gets dropped Added an updated v3, with an additional assert to make sure we don't introduce a bug if we decide to allow using the coordinator as a batchlog member in non-single-node clusters
        Hide
        Aleksey Yeschenko added a comment -

        +1

        Show
        Aleksey Yeschenko added a comment - +1
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Aleksey Yeschenko
            Reviewer:
            Aleksey Yeschenko
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development