Cassandra
  1. Cassandra
  2. CASSANDRA-6079

Memtables flush is delayed when having a lot of batchlog activity, making node OOM

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.11, 2.0.2
    • Component/s: None
    • Labels:
      None

      Description

      Both MeteredFlusher and BatchlogManager share the same OptionalTasks thread. So, when batchlog manager processes its tasks no flushes can occur. Even more, batchlog manager waits for batchlog CF compaction to finish.

      On a lot of batchlog activity this prevents memtables from flush for a long time, making the node OOM.

      Fixed this by moving batchlog to its own thread and not waiting for batchlog compaction to finish.

        Activity

        Hide
        Jonathan Ellis added a comment -

        Committed the executorservice change. I'm less convinced by making cleanup not block for the compaction – if batchlog is really behind, you don't want it to have to re-scan all the tombstoned rows you just replayed.

        Show
        Jonathan Ellis added a comment - Committed the executorservice change. I'm less convinced by making cleanup not block for the compaction – if batchlog is really behind, you don't want it to have to re-scan all the tombstoned rows you just replayed.
        Hide
        Jonathan Ellis added a comment -

        Committed the executorservice change

        ... also to 1.2.11.

        Show
        Jonathan Ellis added a comment - Committed the executorservice change ... also to 1.2.11.
        Hide
        Oleg Anastasyev added a comment -

        From the other hand, if all compaction pool threads are occupied by large CFs, batchlog manager stops replaying batches for a long time, which may not be acceptable.
        Rescanning tombstones looks more acceptable to me than delaying of mutations.

        Ideally, batchlog should ensure compaction really happens before waiting. Either running it in its own executor or making some communication with CompactionManager. And if compaction cannot be done at the moment - just go and rescan tobmstones, trading efficiency for consistency.

        Show
        Oleg Anastasyev added a comment - From the other hand, if all compaction pool threads are occupied by large CFs, batchlog manager stops replaying batches for a long time, which may not be acceptable. Rescanning tombstones looks more acceptable to me than delaying of mutations. Ideally, batchlog should ensure compaction really happens before waiting. Either running it in its own executor or making some communication with CompactionManager. And if compaction cannot be done at the moment - just go and rescan tobmstones, trading efficiency for consistency.
        Hide
        Jonathan Ellis added a comment -

        Remember, this is the replay path. If it's running, the mutation is already delayed, probably because the target node was down. So delaying for compactions doesn't seem like a dealbreaker to me. Especially since that leaves more iops to keep the rest of the system healthy and not go down ourselves.

        Show
        Jonathan Ellis added a comment - Remember, this is the replay path. If it's running, the mutation is already delayed, probably because the target node was down. So delaying for compactions doesn't seem like a dealbreaker to me. Especially since that leaves more iops to keep the rest of the system healthy and not go down ourselves.
        Hide
        Oleg Anastasyev added a comment -

        Not the target, but coordinator. This could be a target node at the same time of course, but even if it is, at least 2 other targets are ready to accept this mutation. Delaying mutation to quorum for more than rpc write timeout practically means it is lost.

        Show
        Oleg Anastasyev added a comment - Not the target, but coordinator. This could be a target node at the same time of course, but even if it is, at least 2 other targets are ready to accept this mutation. Delaying mutation to quorum for more than rpc write timeout practically means it is lost.
        Hide
        Jonathan Ellis added a comment -

        The coordinator doesn't block for compaction. Only replayAllFailedBatches.

        Show
        Jonathan Ellis added a comment - The coordinator doesn't block for compaction. Only replayAllFailedBatches.
        Hide
        Oleg Anastasyev added a comment -

        Exactly.
        The case here could be:
        1. Coordinator writes batchlog to chosen 2 nodes within the same dc
        2. Coordinator crashes
        3. Those 2 nodes supposed to run replayAllFailedBatches to replay this batch of crashed coordinator
        3.1 but they cannot, because compaction pool is busy with ongoing large CF compaction and they are waiting on submitUsedDefined().get()

        Show
        Oleg Anastasyev added a comment - Exactly. The case here could be: 1. Coordinator writes batchlog to chosen 2 nodes within the same dc 2. Coordinator crashes 3. Those 2 nodes supposed to run replayAllFailedBatches to replay this batch of crashed coordinator 3.1 but they cannot, because compaction pool is busy with ongoing large CF compaction and they are waiting on submitUsedDefined().get()
        Hide
        Jonathan Ellis added a comment -

        Right. I don't think it's worth sacrificing throughput in the common case, to reduce delay in the exceptional case.

        Show
        Jonathan Ellis added a comment - Right. I don't think it's worth sacrificing throughput in the common case, to reduce delay in the exceptional case.
        Hide
        Jonathan Ellis added a comment -

        (But, I'd be fine with avoiding the cleanup call unless there was actually something replayed.)

        Show
        Jonathan Ellis added a comment - (But, I'd be fine with avoiding the cleanup call unless there was actually something replayed.)
        Hide
        Oleg Anastasyev added a comment -

        Yeah, this makes stall on get() less probable.

        I could measure how big the impact on common case throughout due to rescan of thombstones, if you did not performed this measurement earlier.

        Yet another idea is to reserve a special thread in CompactionManager for doing compactions on batchlog (and possibly hints) CF. This way it at least wouldn't delay for a long time.

        Show
        Oleg Anastasyev added a comment - Yeah, this makes stall on get() less probable. I could measure how big the impact on common case throughout due to rescan of thombstones, if you did not performed this measurement earlier. Yet another idea is to reserve a special thread in CompactionManager for doing compactions on batchlog (and possibly hints) CF. This way it at least wouldn't delay for a long time.
        Hide
        Jonathan Ellis added a comment -

        I'm not a huge fan of making CompactionManager more complex, either.

        Show
        Jonathan Ellis added a comment - I'm not a huge fan of making CompactionManager more complex, either.
        Hide
        Oleg Anastasyev added a comment -

        Well, I did some quick test. As i suspected there is no measurable performance impact on this.

        3 node cluster, RF=3, separated client, writing batches of 3 inserts each to 3 different CFs. 8 core (not so) recent iron servers.
        (no batches were replayed actually, so this pure common case).

        I measured this for 1 hour run in both cases.

        Here is performance WITH wait on get():
        avg 3052 batches/sec, st deviance 131

        WITHOUT waiting in get()
        avg 3030 per sec, deviance 106

        Show
        Oleg Anastasyev added a comment - Well, I did some quick test. As i suspected there is no measurable performance impact on this. 3 node cluster, RF=3, separated client, writing batches of 3 inserts each to 3 different CFs. 8 core (not so) recent iron servers. (no batches were replayed actually, so this pure common case). I measured this for 1 hour run in both cases. Here is performance WITH wait on get(): avg 3052 batches/sec, st deviance 131 WITHOUT waiting in get() avg 3030 per sec, deviance 106
        Hide
        Jonathan Ellis added a comment -

        I think you're missing that we need to replay not just for coordinator failure but for replica failure. So "only scan for last couple seconds" is not correct; you need to scan every partition since the replica went down.

        (Note that for typical deployments, replica failure will be 3x more likely than coordinator failure.)

        Show
        Jonathan Ellis added a comment - I think you're missing that we need to replay not just for coordinator failure but for replica failure. So "only scan for last couple seconds" is not correct; you need to scan every partition since the replica went down. (Note that for typical deployments, replica failure will be 3x more likely than coordinator failure.)
        Hide
        Oleg Anastasyev added a comment - - edited

        Do you mean the replica, which is target for one of batched mutations ? I see a hint is written by BM if it cannot deliver one of mutations serialized to batchlog record (in replaySerializedMutation and attemptDirectDelivery) , so as i understood HintManager bothers further about delivering it to failed replica when it is back online. So as soon as BM writes a hint(s) for (all of) failed replica(s), BM has nothing to do with batchlog record, so no all partitions scanning neccessary.

        Am I missing something again ?

        Show
        Oleg Anastasyev added a comment - - edited Do you mean the replica, which is target for one of batched mutations ? I see a hint is written by BM if it cannot deliver one of mutations serialized to batchlog record (in replaySerializedMutation and attemptDirectDelivery) , so as i understood HintManager bothers further about delivering it to failed replica when it is back online. So as soon as BM writes a hint(s) for (all of) failed replica(s), BM has nothing to do with batchlog record, so no all partitions scanning neccessary. Am I missing something again ?
        Hide
        Jonathan Ellis added a comment -

        as soon as BM writes a hint(s) for (all of) failed replica(s), BM has nothing to do with batchlog record, so no all partitions scanning neccessary

        You are right. Carry on!

        Show
        Jonathan Ellis added a comment - as soon as BM writes a hint(s) for (all of) failed replica(s), BM has nothing to do with batchlog record, so no all partitions scanning neccessary You are right. Carry on!
        Hide
        Oleg Anastasyev added a comment -

        ok then. started writing code

        Show
        Oleg Anastasyev added a comment - ok then. started writing code

          People

          • Assignee:
            Oleg Anastasyev
            Reporter:
            Oleg Anastasyev
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development