Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 2.1 beta1
    • Component/s: None
    • Labels:

      Description

      Brian Aker pointed out a while ago that allowing multiple threads to modify the commitlog simultaneously (reserving space for each with a CAS first, the way we do in the SlabAllocator.Region.allocate) can improve performance, since you're not bottlenecking on a single thread to do all the copying and CRC computation.

      Now that we use mmap'd CommitLog segments (CASSANDRA-3411) this becomes doable.

      (moved from CASSANDRA-622, which was getting a bit muddled.)

      1. TestEA.java
        2 kB
        Benedict
      2. parallel_commit_log_2.patch
        40 kB
        Piotr Kołaczkowski
      3. oprate.svg
        7 kB
        Benedict
      4. Multi-Threded-CL.png
        173 kB
        Vijay
      5. latency.svg
        467 kB
        Benedict
      6. Current-CL.png
        189 kB
        Vijay
      7. ComitlogStress.java
        3 kB
        Vijay
      8. 3578-logging-v3.txt
        4 kB
        Jonathan Ellis
      9. 3578-logging-v2.txt
        4 kB
        Jonathan Ellis
      10. 0001-CASSANDRA-3578.patch
        91 kB
        Michaël Figuière

        Issue Links

          Activity

          Hide
          pkolaczk Piotr Kołaczkowski added a comment -

          Changes contained in this patch:

          1. Serialization and CRC is moved to a separate threadpool, making these operations concurrent. However, appending serialized buffers to the commit log is still serial, so this patch cannot be viewed as the final fix for the issue.

          2. Semantic of the CommitLog.add method when configured with periodic CLES is slightly changed - in that case the add method enqueues the request and returns immediately. It doesn't wait even for the serialization, CRC and copying the RM into the commit log memory mapped segment. If this behaviour makes some problems, the old behaviour can be easily brought back, with only a performance penalty for additional synchronisation.

          3. Segment syncing is done in parallel to CLS appending. This works perfectly at least on Linux.

          My observations while performing some limited performance testing while developing this patch:

          1. CRC calculation is the CPU-heaviest operation while saving the RM.

          2. Writing to the memory mapped buffer is extremely fast. My Dell Latitude can easily achieve copying speeds of several GB/s. The serial commit log executor was not loaded fully, even when everything was running on a RAMDISK and with 4 parallel serializer threads running on all the 4 cores of the CPU. Parallelizing CL appends might not improve performance by a huge factor, because probably we hit the memory throughput limit first, not the CPU. But anyway, I think it still makes sense to parallelize it in order to avoid temporary serialized buffer creation, which would offload GC.

          3. When tested on a RAMDISK, I was able to get some minor performance improvement by being careful not-waiting unnecesarily on locks e.g. Future objects. It is very important that for small RMs, queues are long enough. If CL.add is blocking, the queues are usually short - their size is limited by the number of active worker threads using CL. And with short queues, the frequency of thread context switches rises.

          4. I propose to limit the capacity of the BlockingQueues not by the number of RMs, but by the predicted size of RMs. For large RMs, we probably don't want to enqueue too many of them, not to waste memory or even get out of memory. On the other hand, for small RMs, longer queues are better for keeping thread context switches low.

          Show
          pkolaczk Piotr Kołaczkowski added a comment - Changes contained in this patch: 1. Serialization and CRC is moved to a separate threadpool, making these operations concurrent. However, appending serialized buffers to the commit log is still serial , so this patch cannot be viewed as the final fix for the issue. 2. Semantic of the CommitLog.add method when configured with periodic CLES is slightly changed - in that case the add method enqueues the request and returns immediately. It doesn't wait even for the serialization, CRC and copying the RM into the commit log memory mapped segment. If this behaviour makes some problems, the old behaviour can be easily brought back, with only a performance penalty for additional synchronisation. 3. Segment syncing is done in parallel to CLS appending. This works perfectly at least on Linux. My observations while performing some limited performance testing while developing this patch: 1. CRC calculation is the CPU-heaviest operation while saving the RM. 2. Writing to the memory mapped buffer is extremely fast. My Dell Latitude can easily achieve copying speeds of several GB/s. The serial commit log executor was not loaded fully, even when everything was running on a RAMDISK and with 4 parallel serializer threads running on all the 4 cores of the CPU. Parallelizing CL appends might not improve performance by a huge factor, because probably we hit the memory throughput limit first, not the CPU. But anyway, I think it still makes sense to parallelize it in order to avoid temporary serialized buffer creation, which would offload GC. 3. When tested on a RAMDISK, I was able to get some minor performance improvement by being careful not-waiting unnecesarily on locks e.g. Future objects. It is very important that for small RMs, queues are long enough. If CL.add is blocking, the queues are usually short - their size is limited by the number of active worker threads using CL. And with short queues, the frequency of thread context switches rises. 4. I propose to limit the capacity of the BlockingQueues not by the number of RMs, but by the predicted size of RMs. For large RMs, we probably don't want to enqueue too many of them, not to waste memory or even get out of memory. On the other hand, for small RMs, longer queues are better for keeping thread context switches low.
          Hide
          mfiguiere Michaël Figuière added a comment - - edited

          I propose a different approach than Piotr in this patch. In this implementation, we only have one thread to handle syncs, all the processing, that is serialization, CRC and copying the RM into the mmap segment is handled directly in the writer threads. These threads exchange data with the syncer thread in a non blocking way, thus the ExecutorService abstraction has been replaced by a lighter structure.
          Several components of the CL presented some challenges to implement in such a manner:

          CL Segment switch

          Switching CL segment when it's full isn't straightforward without locks. Here we use a boolean mark the is atomically CASed by a writer thread giving him the responsibility for performing the switch. If the mark can't be grabbed, the thread is waiting on a condition which is later reused using stamps to avoid any ABA problem.

          Batch CL

          The Batch CL strategy is considered as a safer mode for Cassandra as it guarantee the client that the RM is synced on disk before answering. Making the CL multithreaded, we must ensure that we don't acknowledge a RM that is synced on disk but preceded by an unsynced RM in the CL Segment as it would make the replaying of the RM impossible. For this reason, we track the state of each RM processing, and mark as synced any continuous set of RM fully written when the sync() call is executed.

          Avoiding any blocking queue, we still need a way to put the writer threads on hold while the sync is being ensured. LockSupport.park()/unpark() provides a nice way the do it without relying on any coarse grain synchronization and avoiding any condition reuse/renewing issue.

          Periodic CL

          The Periodic CL's challenge is mostly around the throttling of the writers as here again we don't use any synchronized queue to reduce contention. Actually here we just need "half a blocking queue" as nothing is really added or consumed. For this reason, here we just use an atomic counter and a empty/full condition couple. Here again, a pool of conditions and a stamp are used to avoid the ABA problem.

          End of Segment marker

          Another point is that this implementation don't use any End of Segment marker. As we now have several concurrent writers, it's not possible anymore to write temporary marker after an entry. That mean that the recently committed code that fix CASSANDRA-3615 is obviously not included in this patch.

          Nevertheless, a mechanism to avoid unwanted replay of entry from recycled segment is still required. I haven't included it in the patch as I think it's a design choice that need to be debated but that seem straightforward to implement. The options I can see are the following:

          • Fill CL segment file with 0 on recycling. Doing so avoid any problem but will typically require a several second write on recycling that will lead to write latency hiccup.
          • Include segment id in every entry. This avoid any problem as well but increase the entry size by 8 bytes which has a cost but isn't a drama and can't be considered as spreading the cost of the previous option over the entire CL writing.
          • Salting the two checksums included in the entry with the segment id. Doing so lowers the probability of any unwanted replay to happen to a level that seems fairly acceptable. The advantage of this solution is that its performance cost is null.

          Other noteworthy observations

          Finally, here are some noteworthy observations:

          • Here the writer thread WAITS for the processing to complete. Compared to a push-on-queue-and-forget approach, this slightly increases write latency when using the Periodic CL (the Batch CL still being synchronous) especially for large RMs. Nevertheless, in a highly loaded server, the next writes waiting to be executed would have to wait anyway for their thread to be scheduled, thus the latency cost might eventually be paid. Increasing the number of writer thread should help to increase the insensitiveness of the small RMs to the large RMs.
          • If extensive benchmarks tend to show that the previous point is an issue, there's some room to make this Periodic CL asynchronous with the writer threads.
          • To reduce as much as possible the contention on the atomic states that can be modified several time by each thread, some naughty packing of several states within a single AtomicLong is used as it decreases the likeliness of an extra spin to happen compare to a more classical AtomicReference approach to non-blocking synchronization. The downside is code complexity, thus I think AtomicReference still stay an option to make the code more readable and maintainable.
          • Actually for now to ensure the required throttling of incoming RM we use a constant function with a fixed threshold of unsynced mutation. But we now have the tools to easily make the function more complex, like making it non constant and including some relation to the size of the mutations for instance.
          Show
          mfiguiere Michaël Figuière added a comment - - edited I propose a different approach than Piotr in this patch. In this implementation, we only have one thread to handle syncs, all the processing, that is serialization, CRC and copying the RM into the mmap segment is handled directly in the writer threads. These threads exchange data with the syncer thread in a non blocking way, thus the ExecutorService abstraction has been replaced by a lighter structure. Several components of the CL presented some challenges to implement in such a manner: CL Segment switch Switching CL segment when it's full isn't straightforward without locks. Here we use a boolean mark the is atomically CASed by a writer thread giving him the responsibility for performing the switch. If the mark can't be grabbed, the thread is waiting on a condition which is later reused using stamps to avoid any ABA problem. Batch CL The Batch CL strategy is considered as a safer mode for Cassandra as it guarantee the client that the RM is synced on disk before answering. Making the CL multithreaded, we must ensure that we don't acknowledge a RM that is synced on disk but preceded by an unsynced RM in the CL Segment as it would make the replaying of the RM impossible. For this reason, we track the state of each RM processing, and mark as synced any continuous set of RM fully written when the sync() call is executed. Avoiding any blocking queue, we still need a way to put the writer threads on hold while the sync is being ensured. LockSupport.park()/unpark() provides a nice way the do it without relying on any coarse grain synchronization and avoiding any condition reuse/renewing issue. Periodic CL The Periodic CL's challenge is mostly around the throttling of the writers as here again we don't use any synchronized queue to reduce contention. Actually here we just need "half a blocking queue" as nothing is really added or consumed. For this reason, here we just use an atomic counter and a empty/full condition couple. Here again, a pool of conditions and a stamp are used to avoid the ABA problem. End of Segment marker Another point is that this implementation don't use any End of Segment marker. As we now have several concurrent writers, it's not possible anymore to write temporary marker after an entry. That mean that the recently committed code that fix CASSANDRA-3615 is obviously not included in this patch. Nevertheless, a mechanism to avoid unwanted replay of entry from recycled segment is still required. I haven't included it in the patch as I think it's a design choice that need to be debated but that seem straightforward to implement. The options I can see are the following: Fill CL segment file with 0 on recycling. Doing so avoid any problem but will typically require a several second write on recycling that will lead to write latency hiccup. Include segment id in every entry. This avoid any problem as well but increase the entry size by 8 bytes which has a cost but isn't a drama and can't be considered as spreading the cost of the previous option over the entire CL writing. Salting the two checksums included in the entry with the segment id. Doing so lowers the probability of any unwanted replay to happen to a level that seems fairly acceptable. The advantage of this solution is that its performance cost is null. Other noteworthy observations Finally, here are some noteworthy observations: Here the writer thread WAITS for the processing to complete. Compared to a push-on-queue-and-forget approach, this slightly increases write latency when using the Periodic CL (the Batch CL still being synchronous) especially for large RMs. Nevertheless, in a highly loaded server, the next writes waiting to be executed would have to wait anyway for their thread to be scheduled, thus the latency cost might eventually be paid. Increasing the number of writer thread should help to increase the insensitiveness of the small RMs to the large RMs. If extensive benchmarks tend to show that the previous point is an issue, there's some room to make this Periodic CL asynchronous with the writer threads. To reduce as much as possible the contention on the atomic states that can be modified several time by each thread, some naughty packing of several states within a single AtomicLong is used as it decreases the likeliness of an extra spin to happen compare to a more classical AtomicReference approach to non-blocking synchronization. The downside is code complexity, thus I think AtomicReference still stay an option to make the code more readable and maintainable. Actually for now to ensure the required throttling of incoming RM we use a constant function with a fixed threshold of unsynced mutation. But we now have the tools to easily make the function more complex, like making it non constant and including some relation to the size of the mutations for instance.
          Hide
          jbellis Jonathan Ellis added a comment -

          Interesting that the author of Murmur claims that Murmur3 is 6-10x faster than CRC: http://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/c3lzl7b

          Show
          jbellis Jonathan Ellis added a comment - Interesting that the author of Murmur claims that Murmur3 is 6-10x faster than CRC: http://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/c3lzl7b
          Hide
          mfiguiere Michaël Figuière added a comment -

          I've used this public domain Java implementation on another project and it was behaving pretty well:
          https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java

          Show
          mfiguiere Michaël Figuière added a comment - I've used this public domain Java implementation on another project and it was behaving pretty well: https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java
          Hide
          jbellis Jonathan Ellis added a comment -

          I've seen several cases of nodes failing to keep up with write requests where the commitlog was the bottleneck. These were all workloads throwing MB-sized columns around. Granted, that's not exactly our bread and butter.

          Show
          jbellis Jonathan Ellis added a comment - I've seen several cases of nodes failing to keep up with write requests where the commitlog was the bottleneck. These were all workloads throwing MB-sized columns around. Granted, that's not exactly our bread and butter.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Pushed my changes to https://github.com/Vijay2win/cassandra/commits/3578

          • The above takes a different approach, we update commit log as a part of the mutation thread and no more threads to deal with serialization. CAS operation to reserve a block of bytes in the MMapped segment (Similar to slab allocator) and activate segments.
          • Sync is managed in the separate thread still.
          • Doesn't have a end of segment on each mutation, we just have header which will hold the end.

          We could clean up little more if it looks good.
          Performance test shows a slight improvements... May be once we remove other bottlenecks the improvements (also have to test on spinning drives) will be more visible.

          Show
          vijay2win@yahoo.com Vijay added a comment - Pushed my changes to https://github.com/Vijay2win/cassandra/commits/3578 The above takes a different approach, we update commit log as a part of the mutation thread and no more threads to deal with serialization. CAS operation to reserve a block of bytes in the MMapped segment (Similar to slab allocator) and activate segments. Sync is managed in the separate thread still. Doesn't have a end of segment on each mutation, we just have header which will hold the end. We could clean up little more if it looks good. Performance test shows a slight improvements... May be once we remove other bottlenecks the improvements (also have to test on spinning drives) will be more visible.
          Hide
          jbellis Jonathan Ellis added a comment -

          Maybe it's worth doing a custom CL test harness so the rest of the stack doesn't get in the way?

          Show
          jbellis Jonathan Ellis added a comment - Maybe it's worth doing a custom CL test harness so the rest of the stack doesn't get in the way?
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Ohhh great idea, will give it a shot...

          Show
          vijay2win@yahoo.com Vijay added a comment - Ohhh great idea, will give it a shot...
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Micro benchmark code attached, try's to update commit log as fast as possible (choose a small mutation to avoid active segment starvation, we are still creating ~1 CL per second).

          It was creating a commit log segment per second, not sure if this is valid comparison to real world at this time. But the good part it is that it the patch consumes less memory and has a less swings.

          http://pastebin.com/WeJ0QL8p

          Show
          vijay2win@yahoo.com Vijay added a comment - Micro benchmark code attached, try's to update commit log as fast as possible (choose a small mutation to avoid active segment starvation, we are still creating ~1 CL per second). It was creating a commit log segment per second, not sure if this is valid comparison to real world at this time. But the good part it is that it the patch consumes less memory and has a less swings. http://pastebin.com/WeJ0QL8p
          Hide
          jbellis Jonathan Ellis added a comment -

          I take it that max_mb,allocated_mb,free_mb are heap numbers? I'm not sure what to make of those, really; e.g. we could have higher free_mb for one b/c it's CMSing constantly. Suggest measuring pause time from JVM GC log instead.

          Show
          jbellis Jonathan Ellis added a comment - I take it that max_mb,allocated_mb,free_mb are heap numbers? I'm not sure what to make of those, really; e.g. we could have higher free_mb for one b/c it's CMSing constantly. Suggest measuring pause time from JVM GC log instead.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Hi Jonathan, Ohhh you can ignore those.... i was experimenting few other things (like UUID.random was locking and the numbers where all bad, etc) and hence added those metrics (didn't mean to confuse). But if you are interested with the GC profile please see the attached.

          Show
          vijay2win@yahoo.com Vijay added a comment - Hi Jonathan, Ohhh you can ignore those.... i was experimenting few other things (like UUID.random was locking and the numbers where all bad, etc) and hence added those metrics (didn't mean to confuse). But if you are interested with the GC profile please see the attached.
          Hide
          jbellis Jonathan Ellis added a comment -

          Hmm, interesting that the new MT code doesn't have pauses every 10s or so. Is that where the current code has to block when it runs out of segments?

          Show
          jbellis Jonathan Ellis added a comment - Hmm, interesting that the new MT code doesn't have pauses every 10s or so. Is that where the current code has to block when it runs out of segments?
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Yeah we do CAS instead of queue.take() in http://goo.gl/JbNWM5 , but we do allocate new segments every second, not sure why the dip... will do more profiling on it.

          Show
          vijay2win@yahoo.com Vijay added a comment - Yeah we do CAS instead of queue.take() in http://goo.gl/JbNWM5 , but we do allocate new segments every second, not sure why the dip... will do more profiling on it.
          Hide
          vijay2win@yahoo.com Vijay added a comment - - edited

          Found the bottleneck in the current!
          Actually this happens during buffer.force()... CL.add queue is capped by commitlog_periodic_queue_size

          public int commitlog_periodic_queue_size = 1024 * FBUtilities.getAvailableProcessors();
          

          Hence, till we flush() (is called every 10 seconds) the writes to the CL is blocked. Hope that makes sense...

          Show
          vijay2win@yahoo.com Vijay added a comment - - edited Found the bottleneck in the current! Actually this happens during buffer.force()... CL.add queue is capped by commitlog_periodic_queue_size public int commitlog_periodic_queue_size = 1024 * FBUtilities.getAvailableProcessors(); Hence, till we flush() (is called every 10 seconds) the writes to the CL is blocked. Hope that makes sense...
          Hide
          alexliu68 Alex Liu added a comment -

          The above takes a different approach, we update commit log as a part of the mutation thread and no more threads to deal with serialization

          It slows down the mutation thread by waiting for commitlog writing mutation is done. It's better to keep it async by using a separate thread.

          Show
          alexliu68 Alex Liu added a comment - The above takes a different approach, we update commit log as a part of the mutation thread and no more threads to deal with serialization It slows down the mutation thread by waiting for commitlog writing mutation is done. It's better to keep it async by using a separate thread.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          It slows down the mutation thread by waiting for commitlog writing mutation is done

          Well depends on where you are bottlenecking, updating mmap buffer is not that expensive and its usually cpu intensive, in the other hand it reduces the variability as shown in the stress.

          Show
          vijay2win@yahoo.com Vijay added a comment - It slows down the mutation thread by waiting for commitlog writing mutation is done Well depends on where you are bottlenecking, updating mmap buffer is not that expensive and its usually cpu intensive, in the other hand it reduces the variability as shown in the stress.
          Hide
          jbellis Jonathan Ellis added a comment -

          The above takes a different approach, we update commit log as a part of the mutation thread and no more threads to deal with serialization.

          I like it.

          Doesn't have a end of segment on each mutation, we just have header which will hold the end.

          Meaning the cfLastWrite business, right? I must be missing where this gets persisted back to disk, but in any case, I'm not a fan; we got rid of the old per-segment header in part because seeking back and forth between the header and the tail where the appends happen was bad for performance.

          mfiguiere had a good summary of the options for dealing with end-of-segment above. Note that with any of these, I think allocate needs to write the length to the segment before returning. That way if I have two threads writing mutations X and Y, if Y's is written before a crash and X's is not, at least the length will be there that allows replay to skip the [unfinished] write of X and replay Y.

          Show
          jbellis Jonathan Ellis added a comment - The above takes a different approach, we update commit log as a part of the mutation thread and no more threads to deal with serialization. I like it. Doesn't have a end of segment on each mutation, we just have header which will hold the end. Meaning the cfLastWrite business, right? I must be missing where this gets persisted back to disk, but in any case, I'm not a fan; we got rid of the old per-segment header in part because seeking back and forth between the header and the tail where the appends happen was bad for performance. mfiguiere had a good summary of the options for dealing with end-of-segment above. Note that with any of these, I think allocate needs to write the length to the segment before returning. That way if I have two threads writing mutations X and Y, if Y's is written before a crash and X's is not, at least the length will be there that allows replay to skip the [unfinished] write of X and replay Y.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Hi Jonathan,

          I must be missing where this gets persisted back to disk

          First 4 bytes at the beginning of the file, may be we can get rid of it and stop when the size and checksum doesn't match?

          But the header is pretty light, and will need one additional seek every 10 seconds (it just marks the end of the file at the beginning of the file just before fsync).

          I think allocate needs to write the length to the segment before returning

          The first thing the thread does after allocation is writing the size and its checksum.... are we talking about synchronization in the allocation, so only 1 thread writes the size and end (-1)? currently the atomic operation is only on AtomicLong (position)....

          We might be able to do something similar to the current implementation and without headers with a Read Write lock, where write lock will ensure that we write the end (write -1 to mark the end, lock to ensure no one else overwrites the end marker) just before fsync (but the OS can also write before we force the buffers too)... also that might not be desirable, since it might stall the system like the current one.

          Not sure if the header is that bad though.... Let me know what you think, thanks!

          Show
          vijay2win@yahoo.com Vijay added a comment - Hi Jonathan, I must be missing where this gets persisted back to disk First 4 bytes at the beginning of the file, may be we can get rid of it and stop when the size and checksum doesn't match? But the header is pretty light, and will need one additional seek every 10 seconds (it just marks the end of the file at the beginning of the file just before fsync). I think allocate needs to write the length to the segment before returning The first thing the thread does after allocation is writing the size and its checksum.... are we talking about synchronization in the allocation, so only 1 thread writes the size and end (-1)? currently the atomic operation is only on AtomicLong (position).... We might be able to do something similar to the current implementation and without headers with a Read Write lock, where write lock will ensure that we write the end (write -1 to mark the end, lock to ensure no one else overwrites the end marker) just before fsync (but the OS can also write before we force the buffers too)... also that might not be desirable, since it might stall the system like the current one. Not sure if the header is that bad though.... Let me know what you think, thanks!
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Other option is replace recycle with discard, we can always create new segments and not recycle (instead discard, in sync thread)... we can get rid of the header. We still need to skip the commit log recovery, if its corrupted/partial write...

          Show
          vijay2win@yahoo.com Vijay added a comment - Other option is replace recycle with discard, we can always create new segments and not recycle (instead discard, in sync thread)... we can get rid of the header. We still need to skip the commit log recovery, if its corrupted/partial write...
          Hide
          jbellis Jonathan Ellis added a comment -

          are we talking about synchronization in the allocation, so only 1 thread writes the size and end

          Right. Otherwise we can have a mutation that has been fsynced, that we can't actually replay because an earlier mutation didn't finish before the fsync.

          OTOH I think this is a non-problem for Batch mode, and maybe we can wave it away for Periodic as "you should use Batch if you care."

          Other option is replace recycle with discard

          That's basically what mfiguiere was saying about zeroing out recycled segments. (Probably still more efficient than creating new ones.)

          Show
          jbellis Jonathan Ellis added a comment - are we talking about synchronization in the allocation, so only 1 thread writes the size and end Right. Otherwise we can have a mutation that has been fsynced, that we can't actually replay because an earlier mutation didn't finish before the fsync. OTOH I think this is a non-problem for Batch mode, and maybe we can wave it away for Periodic as "you should use Batch if you care." Other option is replace recycle with discard That's basically what mfiguiere was saying about zeroing out recycled segments. (Probably still more efficient than creating new ones.)
          Hide
          benedict Benedict added a comment -

          I liked the basic approach of your patch, Vijay. There were a number of problems unaddressed:

          • sync() would mark things as flushed to disk that weren't, which would result in log messages never being persisted
          • BatchCommitLogExecutor would ack the first message before syncing it
          • markClean would mark current segment entirely clean, as opposed to up to latest ReplayPosition; guessing this was for testing, but has performance implications. Fixed.
          • log replay was broken, minor fix
          • checksum calculation was broken, minor fix
          • sync() could be called simultaneously at shutdown, and wasn't threadsafe - could have persisted garbled end pointer at start of log

          I've made the following changes:

          • data is marked as written in CLS after serializing to the buffer, so sync() only advances the replay position to the most recent contiguously written point in the buffer. Initially I used a skip list for this, but this seemed to be a bit of a bottleneck. Now I buffer write positions in the order they arrive, and merge them periodically (generally asynchronously, though blocking if we run out of room to buffer). Possibly needs a slight tweak to absolutely guarantee it can never run out of buffer space, but as it stands it's pretty much impossible due to compaction of any adjacent records.
          • Periodic and Batch CLEs are now the same class, with the only difference being batch commit waits for a signal from the CommitLogSegment that it has been syncd to disk
          • PeriodicCLE is now fixed rate rather than fixed delay, and blocks any writes if the previous sync hasn't completed after one commit log poll interval. This is to maintain similar guarantees as before, although potentially we should commit every 1/2 configured period, as we can lose up to 2 periods of data if the previous sync is failing badly.
          • Moved flushOldestKeyspaces into the CLE thread, and it now flushes keyspaces from multiple old log files, not just the oldest, up to the number needed to bring us below our limit. Since we can rapidly add many log files now, this seemed necessary

          I've also made some changes to encapsulate concurrency better, to minimise risk of bugs. Allocation is all done inside CLA, as opposed to split between CLA and CL. CLA also now allocates new segments as soon as the last reserve segment is used, as opposed to every 100ms, in case we ever have a situation where we exhaust segments in < 100ms.

          After making all of these changes, I actually found very little improvement in performance. It turns out this was because the point of contention is moving from the CL to the switchLock, and I still see huge spikes when waiting for a flush here. I've tested these changes with my in progress patch for 6271 and found up to 2x performance boost for high thread counts. It's quite likely a patch for 5549 will help further.

          https://github.com/belliottsmith/cassandra/tree/iss-3578
          https://github.com/belliottsmith/cassandra/tree/iss-3578+6271 - note this is far from prod ready, just useful for performance testing. might only work with stress.

          Show
          benedict Benedict added a comment - I liked the basic approach of your patch, Vijay. There were a number of problems unaddressed: sync() would mark things as flushed to disk that weren't, which would result in log messages never being persisted BatchCommitLogExecutor would ack the first message before syncing it markClean would mark current segment entirely clean, as opposed to up to latest ReplayPosition; guessing this was for testing, but has performance implications. Fixed. log replay was broken, minor fix checksum calculation was broken, minor fix sync() could be called simultaneously at shutdown, and wasn't threadsafe - could have persisted garbled end pointer at start of log I've made the following changes: data is marked as written in CLS after serializing to the buffer, so sync() only advances the replay position to the most recent contiguously written point in the buffer. Initially I used a skip list for this, but this seemed to be a bit of a bottleneck. Now I buffer write positions in the order they arrive, and merge them periodically (generally asynchronously, though blocking if we run out of room to buffer). Possibly needs a slight tweak to absolutely guarantee it can never run out of buffer space, but as it stands it's pretty much impossible due to compaction of any adjacent records. Periodic and Batch CLEs are now the same class, with the only difference being batch commit waits for a signal from the CommitLogSegment that it has been syncd to disk PeriodicCLE is now fixed rate rather than fixed delay, and blocks any writes if the previous sync hasn't completed after one commit log poll interval. This is to maintain similar guarantees as before, although potentially we should commit every 1/2 configured period, as we can lose up to 2 periods of data if the previous sync is failing badly. Moved flushOldestKeyspaces into the CLE thread, and it now flushes keyspaces from multiple old log files, not just the oldest, up to the number needed to bring us below our limit. Since we can rapidly add many log files now, this seemed necessary I've also made some changes to encapsulate concurrency better, to minimise risk of bugs. Allocation is all done inside CLA, as opposed to split between CLA and CL. CLA also now allocates new segments as soon as the last reserve segment is used, as opposed to every 100ms, in case we ever have a situation where we exhaust segments in < 100ms. After making all of these changes, I actually found very little improvement in performance. It turns out this was because the point of contention is moving from the CL to the switchLock, and I still see huge spikes when waiting for a flush here. I've tested these changes with my in progress patch for 6271 and found up to 2x performance boost for high thread counts. It's quite likely a patch for 5549 will help further. https://github.com/belliottsmith/cassandra/tree/iss-3578 https://github.com/belliottsmith/cassandra/tree/iss-3578+6271 - note this is far from prod ready, just useful for performance testing. might only work with stress.
          Hide
          jbellis Jonathan Ellis added a comment -

          How does your branch deal with replaying?

          I note that replay still checks for END_OF_SEGMENT_MARKER but nothing actually writes that value anymore.

          Show
          jbellis Jonathan Ellis added a comment - How does your branch deal with replaying? I note that replay still checks for END_OF_SEGMENT_MARKER but nothing actually writes that value anymore.
          Hide
          benedict Benedict added a comment -

          I left Vijay's solution mostly unchanged here (bar a minor fix) - sync() writes the end position to the start of the commit log before each buffer.force(). The END_OF_SEGMENT_MARKER is still read for backwards compatibility; MessagingService.VERSION is used to determine if we should read the start marker, although I'm not familiar enough with this property to know if this is suitable for deciding this. We might want to consider versioning the commit logs themselves, although we could do that easily with the name pattern.

          Show
          benedict Benedict added a comment - I left Vijay's solution mostly unchanged here (bar a minor fix) - sync() writes the end position to the start of the commit log before each buffer.force(). The END_OF_SEGMENT_MARKER is still read for backwards compatibility; MessagingService.VERSION is used to determine if we should read the start marker, although I'm not familiar enough with this property to know if this is suitable for deciding this. We might want to consider versioning the commit logs themselves, although we could do that easily with the name pattern.
          Hide
          jbellis Jonathan Ellis added a comment -

          We do have CommitLogDescriptor.version.

          Show
          jbellis Jonathan Ellis added a comment - We do have CommitLogDescriptor.version.
          Hide
          benedict Benedict added a comment -

          Ah, my mistake, it looks like this is the version that was being used.... but it was being compared with MessagingService.version. I would guess this is because of the slightly confusingly named CommitLogDescriptor.getMessagingVersion(). Should have looked at this part of the patch more closely.

          Are the versions meant to correspond to release versions? i.e. VERSION_21 -> C* 2.1 and above (so we're safe using the current version)? or should I bump to VERSION_22?

          Show
          benedict Benedict added a comment - Ah, my mistake, it looks like this is the version that was being used.... but it was being compared with MessagingService.version. I would guess this is because of the slightly confusingly named CommitLogDescriptor.getMessagingVersion(). Should have looked at this part of the patch more closely. Are the versions meant to correspond to release versions? i.e. VERSION_21 -> C* 2.1 and above (so we're safe using the current version)? or should I bump to VERSION_22?
          Hide
          jbellis Jonathan Ellis added a comment -

          Yes, the versions correspond w/ releases.

          Show
          jbellis Jonathan Ellis added a comment - Yes, the versions correspond w/ releases.
          Hide
          benedict Benedict added a comment -

          Thanks. Repos updated.

          Show
          benedict Benedict added a comment - Thanks. Repos updated.
          Hide
          benedict Benedict added a comment -

          I think there are two further improvements that need to be made, excluding some tidying up, before we should think about integration:

          1) Obviously address optimising the CL format; and
          2) Consider scrapping cfLastWrite beyond a set of names present in the log file

          I haven't worried about (1) up until now as I'm testing on SSDs so the seeking shouldn't cause a major performance penalty, and I've mostly been interested in finding out how we can improve performance, but if we were to scrap (2) we could make (1) resilient to reading past not-yet-synced records, and we could stop worrying about linearising the write positions, eliminating a huge amount of the threading overhead. We would just increment a counter indicating we've dirtied the segment. The work we currently do would still be necessary for batch executors.

          The only negatives here are that we might unnecessarily flush a Cf that has already fully synced in flushOldestKeyspaces (when trying to reclaim logs) and that we will be slightly slower to reclaim log files. On the whole I think it is a cost well worth paying for reducing the CL write path.

          Show
          benedict Benedict added a comment - I think there are two further improvements that need to be made, excluding some tidying up, before we should think about integration: 1) Obviously address optimising the CL format; and 2) Consider scrapping cfLastWrite beyond a set of names present in the log file I haven't worried about (1) up until now as I'm testing on SSDs so the seeking shouldn't cause a major performance penalty, and I've mostly been interested in finding out how we can improve performance, but if we were to scrap (2) we could make (1) resilient to reading past not-yet-synced records, and we could stop worrying about linearising the write positions, eliminating a huge amount of the threading overhead. We would just increment a counter indicating we've dirtied the segment. The work we currently do would still be necessary for batch executors. The only negatives here are that we might unnecessarily flush a Cf that has already fully synced in flushOldestKeyspaces (when trying to reclaim logs) and that we will be slightly slower to reclaim log files. On the whole I think it is a cost well worth paying for reducing the CL write path.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Hi Benedict, archiver.maybeArchive(segment.getPath(), segment.getName()) is a blocking call and will need to be a separate thread it might involve user defined archival.

          sync() would mark things as flushed to disk that weren't, which would result in log messages never being persisted

          My understand is that.... Calling force will sync the dirty pages and if we do a concurrent writes to the same page they will be marked as dirty and will be synched in the next call, how will we loose the log messages?

          I still like the original approach of creating files (it may be just me) because of simplicity and we can be aggressive in allocator threads similar to your patch (to create empty files and deleting them).

          Show
          vijay2win@yahoo.com Vijay added a comment - Hi Benedict, archiver.maybeArchive(segment.getPath(), segment.getName()) is a blocking call and will need to be a separate thread it might involve user defined archival. sync() would mark things as flushed to disk that weren't, which would result in log messages never being persisted My understand is that.... Calling force will sync the dirty pages and if we do a concurrent writes to the same page they will be marked as dirty and will be synched in the next call, how will we loose the log messages? I still like the original approach of creating files (it may be just me) because of simplicity and we can be aggressive in allocator threads similar to your patch (to create empty files and deleting them).
          Hide
          benedict Benedict added a comment -

          archiver.maybeArchive(segment.getPath(), segment.getName()) is a blocking call and will need to be a separate thread it might involve user defined archival.

          This is already an asynchronous call, which is from your patch

          how will we loose the log messages?

          You only call force when you think there is something dirty, not when the buffer does, and this is determined by where the allocation position is at the time we write the buffer; if it has been updated since the last force. A thread may have allocated itself some room but not yet written to it before the force() call, so the end point in the log will point past the position we have last written to (and any log records after it will be unreadable due to the junk inbetween, or worse we may have some old records), and we may never update them if the allocation position has reached its limit, or we die.

          If we're scrapping cfLastWrite we could plausibly just fall back on calling force() for every sync() and let it decide if any work needs to be done, though. I would prefer to still use a counter, as there's still a minor improbably concurrency bug where we could have a log file recycled before a thread has finished updating its portion of the buffer.

          I still like the original approach of creating files (it may be just me) because of simplicity and we can be aggressive in allocator threads similar to your patch (to create empty files and deleting them).

          Not sure I follow this one

          Show
          benedict Benedict added a comment - archiver.maybeArchive(segment.getPath(), segment.getName()) is a blocking call and will need to be a separate thread it might involve user defined archival. This is already an asynchronous call, which is from your patch how will we loose the log messages? You only call force when you think there is something dirty, not when the buffer does, and this is determined by where the allocation position is at the time we write the buffer; if it has been updated since the last force. A thread may have allocated itself some room but not yet written to it before the force() call, so the end point in the log will point past the position we have last written to (and any log records after it will be unreadable due to the junk inbetween, or worse we may have some old records), and we may never update them if the allocation position has reached its limit, or we die. If we're scrapping cfLastWrite we could plausibly just fall back on calling force() for every sync() and let it decide if any work needs to be done, though. I would prefer to still use a counter, as there's still a minor improbably concurrency bug where we could have a log file recycled before a thread has finished updating its portion of the buffer. I still like the original approach of creating files (it may be just me) because of simplicity and we can be aggressive in allocator threads similar to your patch (to create empty files and deleting them). Not sure I follow this one
          Hide
          benedict Benedict added a comment -

          Okay, so thinking on this a little further, I am going to try the following for the log file:

          1) Before each sync, we will reserve a portion of the log file for writing another header, and will update the previous header to point to this new header, so we chain headers without needing to seek;
          2) In order to allocate ourselves a position in the buffer, we place ourselves on a reverse-order linked list, with our position and size. Once we have written this we clear the size, but in the sync() we iterate over the list and ensure that the size (and size checksum) are written if they have not yet been cleared, so that the replay can be certain that an incomplete write of the size doesn't garble the log file. A race to write the size shouldn't matter, as they will both write the same value.
          3) Each log message's tail checksum will be salted with the segment id. This is only to ensure we never see an object that is exactly the right size in exactly the write position in the file.

          In doing this we should make ourselves able to cleanly replay with incompletely written log messages, and not worry about tracking exactly which log messages have been written (beyond whether or not we're dirty and/or completely written), shortening the CL write path.

          Show
          benedict Benedict added a comment - Okay, so thinking on this a little further, I am going to try the following for the log file: 1) Before each sync, we will reserve a portion of the log file for writing another header, and will update the previous header to point to this new header, so we chain headers without needing to seek; 2) In order to allocate ourselves a position in the buffer, we place ourselves on a reverse-order linked list, with our position and size. Once we have written this we clear the size, but in the sync() we iterate over the list and ensure that the size (and size checksum) are written if they have not yet been cleared, so that the replay can be certain that an incomplete write of the size doesn't garble the log file. A race to write the size shouldn't matter, as they will both write the same value. 3) Each log message's tail checksum will be salted with the segment id. This is only to ensure we never see an object that is exactly the right size in exactly the write position in the file. In doing this we should make ourselves able to cleanly replay with incompletely written log messages, and not worry about tracking exactly which log messages have been written (beyond whether or not we're dirty and/or completely written), shortening the CL write path.
          Hide
          jbellis Jonathan Ellis added a comment -

          I like the chained headers idea.

          I'm not sure it's worth the gymnastics in (2) since if we have unsynced data at power loss time all bets are off anyway.

          Show
          jbellis Jonathan Ellis added a comment - I like the chained headers idea. I'm not sure it's worth the gymnastics in (2) since if we have unsynced data at power loss time all bets are off anyway.
          Hide
          benedict Benedict added a comment -

          If we do neither (2) or my previous approach of linearizing the writes and only marking complete up until the furthest contiguously written point then, at the very least, our batch executor could ACK a commit that may be unreadable on replay because earlier log messages haven't been written (without their sizes being synced we can't read the later messages). Don't need a poweroff scenario, just a kill would do.

          For Periodic sync we probably don't care so much, although there's no guarantee how long a thread could have been paused for. Should be micros, but we could theoretically have multiple flushes unreadable due to a stalled / failed log write.

          Show
          benedict Benedict added a comment - If we do neither (2) or my previous approach of linearizing the writes and only marking complete up until the furthest contiguously written point then, at the very least, our batch executor could ACK a commit that may be unreadable on replay because earlier log messages haven't been written (without their sizes being synced we can't read the later messages). Don't need a poweroff scenario, just a kill would do. For Periodic sync we probably don't care so much, although there's no guarantee how long a thread could have been paused for. Should be micros, but we could theoretically have multiple flushes unreadable due to a stalled / failed log write.
          Hide
          jbellis Jonathan Ellis added a comment -

          How ugly would it get to either wait for previous (in CL order) mutations before syncing, or alternatively, waiting for any in-progress ops on sync which should amount to the same thing?

          Show
          jbellis Jonathan Ellis added a comment - How ugly would it get to either wait for previous (in CL order) mutations before syncing, or alternatively, waiting for any in-progress ops on sync which should amount to the same thing?
          Hide
          benedict Benedict added a comment -

          Well, one possibility is allocating a new ReadWriteLock object (well, a wrapper with an extra state flag) after each sync, which we readLock() before we allocate a section, and writeLock() when we come to sync, after which we invalidate the object and start again. There is some extra synchronization (two extra CASes, though might be able to eliminate one of the existing ones), but that's unlikely to hurt too much, and it is pretty simple.

          Obviously, we need to make sure that no new writes block on the attempt to lock this object, so before we go to writeLock on sync() we should probably swap in a new lock object, so that new writes go there instead. We'll consider the log complete up to the last position allocated prior to our swapping the lock object. This might permit some writes to be ahead of where the CL thinks we've gotten to, which won't be replayable, but that's fine - those are in "the future" anyway, so we're okay to drop them until next sync.

          Show
          benedict Benedict added a comment - Well, one possibility is allocating a new ReadWriteLock object (well, a wrapper with an extra state flag) after each sync, which we readLock() before we allocate a section, and writeLock() when we come to sync, after which we invalidate the object and start again. There is some extra synchronization (two extra CASes, though might be able to eliminate one of the existing ones), but that's unlikely to hurt too much, and it is pretty simple. Obviously, we need to make sure that no new writes block on the attempt to lock this object, so before we go to writeLock on sync() we should probably swap in a new lock object, so that new writes go there instead. We'll consider the log complete up to the last position allocated prior to our swapping the lock object. This might permit some writes to be ahead of where the CL thinks we've gotten to, which won't be replayable, but that's fine - those are in "the future" anyway, so we're okay to drop them until next sync.
          Hide
          vijay2win@yahoo.com Vijay added a comment - - edited

          You only call force when you think there is something dirty, not when the buffer does,

          Ahaa.... that might be an over sight, we can call buffer.force all the time and let the OS decide if it has to sync the filesystem. If we do that then we just need to stop during the recovery/replay when we see a corrupted columns (which are because the os or the force didn't complete the fsync completely).

          How ugly would it get to either wait for previous (in CL order) mutations before syncing

          We can do that with another counter which holds the bytes written by all the threads and comparing it with the allocated..... We dont need lock in that case.

          Show
          vijay2win@yahoo.com Vijay added a comment - - edited You only call force when you think there is something dirty, not when the buffer does, Ahaa.... that might be an over sight, we can call buffer.force all the time and let the OS decide if it has to sync the filesystem. If we do that then we just need to stop during the recovery/replay when we see a corrupted columns (which are because the os or the force didn't complete the fsync completely). How ugly would it get to either wait for previous (in CL order) mutations before syncing We can do that with another counter which holds the bytes written by all the threads and comparing it with the allocated..... We dont need lock in that case.
          Hide
          benedict Benedict added a comment -

          when we see a corrupted columns (which are because the os or the force didn't complete the fsync completely).

          Or because we haven't flushed because we didn't know there were unfinished writes behind us - this is easily solved with a dirty counter, but see above discussion for other issues with guarantees.

          We can do that with another counter which holds the bytes written by all the threads and comparing it with the allocated..... We dont need lock in that case.

          Counting bytes written isn't safe, as it is unordered - so we could have A allocate, B begin sync, C allocate, C write, B see counters equal, sync, A update counter. B thought it had got A, but it hadn't. Unless we spin in the allocator until a double-check of both counters is the same, but this is an unpleasant solution. The lock is asynchronous from the POV of the writers, only the sync() ever blocks, which is better than it spinning for an undetermined amount of time.

          Show
          benedict Benedict added a comment - when we see a corrupted columns (which are because the os or the force didn't complete the fsync completely). Or because we haven't flushed because we didn't know there were unfinished writes behind us - this is easily solved with a dirty counter, but see above discussion for other issues with guarantees. We can do that with another counter which holds the bytes written by all the threads and comparing it with the allocated..... We dont need lock in that case. Counting bytes written isn't safe, as it is unordered - so we could have A allocate, B begin sync, C allocate, C write, B see counters equal, sync, A update counter. B thought it had got A, but it hadn't. Unless we spin in the allocator until a double-check of both counters is the same, but this is an unpleasant solution. The lock is asynchronous from the POV of the writers, only the sync() ever blocks, which is better than it spinning for an undetermined amount of time.
          Hide
          benedict Benedict added a comment -

          Or because we haven't flushed because we didn't know there were unfinished writes behind us - this is easily solved with a dirty counter, but see above discussion for other issues with guarantees.

          Sorry, this wasn't at all clear. I meant in the case of shutdown or recycling. This is a highly unlikely scenario, but we might not realise we haven't finished updating the buffer.

          Show
          benedict Benedict added a comment - Or because we haven't flushed because we didn't know there were unfinished writes behind us - this is easily solved with a dirty counter, but see above discussion for other issues with guarantees. Sorry, this wasn't at all clear. I meant in the case of shutdown or recycling. This is a highly unlikely scenario, but we might not realise we haven't finished updating the buffer.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          we could have A allocate, B begin sync, C allocate, C write, B see counters equal

          I am talking about count all the allocation and written, within a segment.... Which is (A + B + C) != (A + B) (which means C or someone else is still writing).

          we didn't know there were unfinished writes behind us

          Thats fine we will skip those, thats what the current implementation does too, if you are writing in a sequence and the server stops... the commits which where in the queue are not written.... We are just moving that queue to the buffer.
          Practically this is less of an concern because there is few nano's out of sync.

          Anyways i should stop selling....

          Show
          vijay2win@yahoo.com Vijay added a comment - we could have A allocate, B begin sync, C allocate, C write, B see counters equal I am talking about count all the allocation and written, within a segment.... Which is (A + B + C) != (A + B) (which means C or someone else is still writing). we didn't know there were unfinished writes behind us Thats fine we will skip those, thats what the current implementation does too, if you are writing in a sequence and the server stops... the commits which where in the queue are not written.... We are just moving that queue to the buffer. Practically this is less of an concern because there is few nano's out of sync. Anyways i should stop selling....
          Hide
          benedict Benedict added a comment -

          I am talking about count all the allocation and written, within a segment.... Which is (A + B + C) != (A + B) (which means C or someone else is still writing).

          but this gets updated as you're reading, so unless you read allocationPosition = writtenPosition = allocationPosition you can fail - and so you need to spin until you see this three way check succeed, which I'm not keen on. Since they are both being updated all the time, this could potentially mean a lengthy period of spinning.

          Thats fine we will skip those, thats what the current implementation does too

          The current implementation never ACKs a message it cannot later replay under batch, but this approach would permit that to happen.

          Show
          benedict Benedict added a comment - I am talking about count all the allocation and written, within a segment.... Which is (A + B + C) != (A + B) (which means C or someone else is still writing). but this gets updated as you're reading, so unless you read allocationPosition = writtenPosition = allocationPosition you can fail - and so you need to spin until you see this three way check succeed, which I'm not keen on. Since they are both being updated all the time, this could potentially mean a lengthy period of spinning. Thats fine we will skip those, thats what the current implementation does too The current implementation never ACKs a message it cannot later replay under batch, but this approach would permit that to happen.
          Hide
          benedict Benedict added a comment -

          Please don't stop selling if you think I'm wrong though... a better solution is a better solution

          Show
          benedict Benedict added a comment - Please don't stop selling if you think I'm wrong though... a better solution is a better solution
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          The current implementation never ACKs a message it cannot later replay under batch

          We don't guarantee that in PeriodicCommitLogExecutorService, all this time i was trying to optimize for the general case (PeriodicCommitLogExecutorService)....
          For BatchCommitLogExecutorService in my patch (https://github.com/Vijay2win/cassandra/commit/0d982e840145d466b8bcbc863d6218b24b0842ad#diff-05c1e4fd86fea19b8e0552b1f289be85L191) does ACK only after we write (we wait for sync after that write), and hence the write and sync of that particular write happens before acking back.

          Show
          vijay2win@yahoo.com Vijay added a comment - The current implementation never ACKs a message it cannot later replay under batch We don't guarantee that in PeriodicCommitLogExecutorService, all this time i was trying to optimize for the general case (PeriodicCommitLogExecutorService).... For BatchCommitLogExecutorService in my patch ( https://github.com/Vijay2win/cassandra/commit/0d982e840145d466b8bcbc863d6218b24b0842ad#diff-05c1e4fd86fea19b8e0552b1f289be85L191 ) does ACK only after we write (we wait for sync after that write), and hence the write and sync of that particular write happens before acking back.
          Hide
          benedict Benedict added a comment -

          We don't guarantee that in PeriodicCommitLogExecutorService

          No, you're right. I misspoke. For PeriodicCLE we only don't commit anything we cannot later replay.

          Unfortunately the BatchCLE in your patch could still ACK messages that weren't replayable, as a message could reserve space, not finish writing it, and so not get placed on the waitIfNeeded() call/queue, but a later writer could race past it and get there. Since the earlier writer hasn't serialized its size, the later message is not replayable. Besides, the queue has two locks, so you'd still have a lock

          Show
          benedict Benedict added a comment - We don't guarantee that in PeriodicCommitLogExecutorService No, you're right. I misspoke. For PeriodicCLE we only don't commit anything we cannot later replay. Unfortunately the BatchCLE in your patch could still ACK messages that weren't replayable, as a message could reserve space, not finish writing it, and so not get placed on the waitIfNeeded() call/queue, but a later writer could race past it and get there. Since the earlier writer hasn't serialized its size, the later message is not replayable. Besides, the queue has two locks, so you'd still have a lock
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          https://github.com/Vijay2win/cassandra/commits/3578-v2 addresses most of the concerns here.

          Only thing we have discussed here and not been addressed yet is the aggressive allocation and deallocation of commit logs but not sure if its needed yet...

          Show
          vijay2win@yahoo.com Vijay added a comment - https://github.com/Vijay2win/cassandra/commits/3578-v2 addresses most of the concerns here. Only thing we have discussed here and not been addressed yet is the aggressive allocation and deallocation of commit logs but not sure if its needed yet...
          Hide
          benedict Benedict added a comment -

          A patch for this is available for review at 3578-2

          Already discussed:

          • Chained headers
          • Ensures commits are persistent, using the suggested synchronisation scheme (read/write lock)

          Further changes:

          • Writes are completely non-blocking unless the CLE is behind or you're using Batch CLE
          • On activating a new CLS, we trigger a sync() of the log; so now we sync() ever pollInterval elapsed, OR commit_log_segment_size_in_mb written, whichever condition is met first after the previous sync. This allows us to stay a little ahead of pollInterval, giving us some breathing room during "brief" spikes in write load in excess of what the disk can handle.
          • Once we've completely written a CLS we immediately close/unmap the buffer
          • On any drop keyspace or column family command, or on a node drain, we force the recycling of any CLS in use at the time of the call (this addresses CASSANDRA-5911. I included it in this ticket as it was easier to think about both at once)

          Some implementation detail changes:

          • We maintain a separate cfDirty and cfClean set now, which we merge on demand, to avoid allocating/deallocating AtomicIntegers all of the time
          • We now reject row mutations that are only HALF the size of the CL, as opposed to equal in size - this is to stop burning through lots of CLS if we try to switch to a new segment but then are beaten to allocating the first item in it.

          Some future work:

          • Could reasonably easily have a guaranteed non-blocking CL.add method, which yields a Future if blocking becomes necessary; this could allow us to short-circuit the write-path a little to reduce latency in the majority of cases where blocking doesn't happen
          • Compressed CL to improve IO
          • Need to improve error handling in CL in general

          Note, Vijay, that I briefly switched to a simpler blocking approach to switching in a new segment, as you suggested you preferred the simpler approach, but I decided to revert to non-blocking, due to potential future dividends with this guarantee.

          I've attached two graphs to demonstrate the effect of this patch in a real 4-node cluster. Note the latency graph has a logarithmic y-axis, so this patch looks to be an order of magnitude better at worst write latency measured; also variance in latency at the tail end is lower. This is also why there are fewer measurements, as the stderr of the measurements was smaller, so stress finished earlier. Also a roughly 12% increase in maximum throughput on this particular cluster.

          Show
          benedict Benedict added a comment - A patch for this is available for review at 3578-2 Already discussed: Chained headers Ensures commits are persistent, using the suggested synchronisation scheme (read/write lock) Further changes: Writes are completely non-blocking unless the CLE is behind or you're using Batch CLE On activating a new CLS, we trigger a sync() of the log; so now we sync() ever pollInterval elapsed, OR commit_log_segment_size_in_mb written, whichever condition is met first after the previous sync. This allows us to stay a little ahead of pollInterval, giving us some breathing room during "brief" spikes in write load in excess of what the disk can handle. Once we've completely written a CLS we immediately close/unmap the buffer On any drop keyspace or column family command, or on a node drain, we force the recycling of any CLS in use at the time of the call (this addresses CASSANDRA-5911 . I included it in this ticket as it was easier to think about both at once) Some implementation detail changes: We maintain a separate cfDirty and cfClean set now, which we merge on demand, to avoid allocating/deallocating AtomicIntegers all of the time We now reject row mutations that are only HALF the size of the CL, as opposed to equal in size - this is to stop burning through lots of CLS if we try to switch to a new segment but then are beaten to allocating the first item in it. Some future work: Could reasonably easily have a guaranteed non-blocking CL.add method, which yields a Future if blocking becomes necessary; this could allow us to short-circuit the write-path a little to reduce latency in the majority of cases where blocking doesn't happen Compressed CL to improve IO Need to improve error handling in CL in general Note, Vijay, that I briefly switched to a simpler blocking approach to switching in a new segment, as you suggested you preferred the simpler approach, but I decided to revert to non-blocking, due to potential future dividends with this guarantee. I've attached two graphs to demonstrate the effect of this patch in a real 4-node cluster. Note the latency graph has a logarithmic y-axis, so this patch looks to be an order of magnitude better at worst write latency measured; also variance in latency at the tail end is lower. This is also why there are fewer measurements, as the stderr of the measurements was smaller, so stress finished earlier. Also a roughly 12% increase in maximum throughput on this particular cluster.
          Hide
          jbellis Jonathan Ellis added a comment -

          Working through this still: https://github.com/jbellis/cassandra/commits/3578

          Some comments so far:

          Does it still make sense to track Pending Tasks when we don't really have an executor per se? I'd be okay with removing it.

          The guarantee people really want with PCLS is, "I won't lose more than X ms of writes." That's not what we deliver, but that's what they want. Is that even feasible?

          Why does sync re-check non-current segments when shutting down? They are synced in advanceAllocatingFrom.

          We use "allocate" to mean both creating new Segments and reserving space within a segment. Can we disambiguate these?

          Show
          jbellis Jonathan Ellis added a comment - Working through this still: https://github.com/jbellis/cassandra/commits/3578 Some comments so far: Does it still make sense to track Pending Tasks when we don't really have an executor per se? I'd be okay with removing it. The guarantee people really want with PCLS is, "I won't lose more than X ms of writes." That's not what we deliver, but that's what they want. Is that even feasible? Why does sync re-check non-current segments when shutting down? They are synced in advanceAllocatingFrom. We use "allocate" to mean both creating new Segments and reserving space within a segment. Can we disambiguate these?
          Hide
          benedict Benedict added a comment -

          Does it still make sense to track Pending Tasks when we don't really have an executor per se? I'd be okay with removing it.

          I think it still serves a purpose - it now means something different, though, which is 'those tasks that are blocked on a sync()'. This can happen for both PCLE and BCLE, and I expect if some scenario crops up that holds the sync() up, it could well be useful for debugging what's happening if you can see that statistic. Now that I think about it, another 'statistic' should be exposed as well, perhaps, which is timeOfLastSync(), which would help spot that scenario.

          The guarantee people really want with PCLS is, "I won't lose more than X ms of writes." That's not what we deliver, but that's what they want. Is that even feasible?

          We provide this guarantee just as well as we used to - the maybeWaitForSync() in your modified patch blocks if the sync() ever gets more than pollInterval behind (i.e. doesn't update its lastAliveAt heartbeat in that period), until it catches up. I've tested this by pumping too much data through, so that the sync() couldn't quite keep up, and it works as intended. The only major difference is I tack an extra 10ms on as a cludgey guarantee we don't wait when the poll is just waking up and getting itself together. We could, and perhaps should, instead subtract this from the pollInterval (with perhaps the cludge dropping to 1ms if the pollInterval is < 20ms).

          Why does sync re-check non-current segments when shutting down? They are synced in advanceAllocatingFrom.

          advanceAllocatingFrom only requests a sync(), it doesn't actually perform one - shutdown and the normal pollInterval sync are now the same process, so they actually both "sync" CLS we have already finished syncing. This is the same behaviour as before wrt resyncing already synced segments. I'm not too worried about this, since they're simply a monitor acquire/release followed by a comparison of two integers if they're already synced, so we're not wasting any time going to disk, or even more than a hundred cycles or so per segment, probably fewer since the monitor will be biased. We could move them into another queue of completed CLS once we've finished syncing, and only check that queue for recycling. I'd be happy with that. But I don't think it's an issue unless you've got tens of thousands of finished-but-not-yet-recycled segments.

          We use "allocate" to mean both creating new Segments and reserving space within a segment. Can we disambiguate these?

          You're right, and it looks to me like you've achieved this with your segmentManagementTasks rename - if we rename the thread to CommitLogSegmentManager or something, and the wake?() method accordingly, I think that probably suffices.

          One thing I would say about your cleanup: I deliberately allocated the Allocation object in the caller and set its fields in the callee to guarantee escape analysis kicks in to avoid allocating the objects on the heap. It's likely that it will find the allocations don't escape, but I'm not sure how well it handles back-tracking out of the allocating method to do so.

          Also, my IDEA fu is weak, I tried to find a SameThreadES in our dependencies, but couldn't! Obviously I didn't try very hard.

          Show
          benedict Benedict added a comment - Does it still make sense to track Pending Tasks when we don't really have an executor per se? I'd be okay with removing it. I think it still serves a purpose - it now means something different, though, which is 'those tasks that are blocked on a sync()'. This can happen for both PCLE and BCLE, and I expect if some scenario crops up that holds the sync() up, it could well be useful for debugging what's happening if you can see that statistic. Now that I think about it, another 'statistic' should be exposed as well, perhaps, which is timeOfLastSync(), which would help spot that scenario. The guarantee people really want with PCLS is, "I won't lose more than X ms of writes." That's not what we deliver, but that's what they want. Is that even feasible? We provide this guarantee just as well as we used to - the maybeWaitForSync() in your modified patch blocks if the sync() ever gets more than pollInterval behind (i.e. doesn't update its lastAliveAt heartbeat in that period), until it catches up. I've tested this by pumping too much data through, so that the sync() couldn't quite keep up, and it works as intended. The only major difference is I tack an extra 10ms on as a cludgey guarantee we don't wait when the poll is just waking up and getting itself together. We could, and perhaps should, instead subtract this from the pollInterval (with perhaps the cludge dropping to 1ms if the pollInterval is < 20ms). Why does sync re-check non-current segments when shutting down? They are synced in advanceAllocatingFrom. advanceAllocatingFrom only requests a sync(), it doesn't actually perform one - shutdown and the normal pollInterval sync are now the same process, so they actually both "sync" CLS we have already finished syncing. This is the same behaviour as before wrt resyncing already synced segments. I'm not too worried about this, since they're simply a monitor acquire/release followed by a comparison of two integers if they're already synced, so we're not wasting any time going to disk, or even more than a hundred cycles or so per segment, probably fewer since the monitor will be biased. We could move them into another queue of completed CLS once we've finished syncing, and only check that queue for recycling. I'd be happy with that. But I don't think it's an issue unless you've got tens of thousands of finished-but-not-yet-recycled segments. We use "allocate" to mean both creating new Segments and reserving space within a segment. Can we disambiguate these? You're right, and it looks to me like you've achieved this with your segmentManagementTasks rename - if we rename the thread to CommitLogSegmentManager or something, and the wake?() method accordingly, I think that probably suffices. One thing I would say about your cleanup: I deliberately allocated the Allocation object in the caller and set its fields in the callee to guarantee escape analysis kicks in to avoid allocating the objects on the heap. It's likely that it will find the allocations don't escape, but I'm not sure how well it handles back-tracking out of the allocating method to do so. Also, my IDEA fu is weak, I tried to find a SameThreadES in our dependencies, but couldn't! Obviously I didn't try very hard.
          Hide
          benedict Benedict added a comment -

          Thinking a little more on the PCLE guarantees about syncing, I think perhaps a better scheme is to attempt to sync() twice every "poll interval" (or once every 0.49*interval), and to only update the heartbeat at the end of a sync(); at the moment I update before and after, to deal with the fact that the blockInterval and pollInterval are approximately the same (so we don't simply block every time we start a sync()). This means you could lose upto 2x interval commits. I addressed this in a previous comment, but didn't deal with it in the code.

          It also occurs to me that pollInterval is a really bad name, and it should be called maxSyncInterval or something.

          Show
          benedict Benedict added a comment - Thinking a little more on the PCLE guarantees about syncing, I think perhaps a better scheme is to attempt to sync() twice every "poll interval" (or once every 0.49*interval), and to only update the heartbeat at the end of a sync(); at the moment I update before and after, to deal with the fact that the blockInterval and pollInterval are approximately the same (so we don't simply block every time we start a sync()). This means you could lose upto 2x interval commits. I addressed this in a previous comment, but didn't deal with it in the code. It also occurs to me that pollInterval is a really bad name, and it should be called maxSyncInterval or something.
          Hide
          jbellis Jonathan Ellis added a comment -

          We provide this guarantee just as well as we used to

          I get that, I'm just asking if we can improve on that.

          I think perhaps a better scheme is to attempt to sync() twice every "poll interval"

          Dunno, if we're going to do that let's just make more explicit what our guarantees are and let them half the interval themselves if that's what they want.

          It's likely that it will find the allocations don't escape, but I'm not sure how well it handles back-tracking out of the allocating method to do so.

          How can we find out?

          Show
          jbellis Jonathan Ellis added a comment - We provide this guarantee just as well as we used to I get that, I'm just asking if we can improve on that. I think perhaps a better scheme is to attempt to sync() twice every "poll interval" Dunno, if we're going to do that let's just make more explicit what our guarantees are and let them half the interval themselves if that's what they want. It's likely that it will find the allocations don't escape, but I'm not sure how well it handles back-tracking out of the allocating method to do so. How can we find out?
          Hide
          jbellis Jonathan Ellis added a comment -

          if we rename the thread to CommitLogSegmentManager or something, and the wake?() method accordingly, I think that probably suffices

          I dunno, and I realize my OCD is showing, but Allocation isn't actually allocating anything; it's really more of a reservation. Whereas creating a new segment actually is allocating a resource.

          Show
          jbellis Jonathan Ellis added a comment - if we rename the thread to CommitLogSegmentManager or something, and the wake?() method accordingly, I think that probably suffices I dunno, and I realize my OCD is showing, but Allocation isn't actually allocating anything; it's really more of a reservation. Whereas creating a new segment actually is allocating a resource.
          Hide
          benedict Benedict added a comment -

          Was browsing your changes and noticed your CL.getContext() comment clarification, which I think probably doesn't go far enough - also realised its return value no longer needs to be a Future (the callee also only uses immediateFuture). I think something like the following is appropriate:

               /**
                 * @return a ReplayPosition representing the current position of the commit log; 
                 * any CL.add() that completed prior to this call will occur at an earlier ReplayPosition. 
                 * This makes no guarantees about persistence, only ordering
                 */
                public ReplayPosition getContext()
                {
                       return allocator.allocatingFrom().getContext()
                }
          
          Show
          benedict Benedict added a comment - Was browsing your changes and noticed your CL.getContext() comment clarification, which I think probably doesn't go far enough - also realised its return value no longer needs to be a Future (the callee also only uses immediateFuture). I think something like the following is appropriate: /** * @return a ReplayPosition representing the current position of the commit log; * any CL.add() that completed prior to this call will occur at an earlier ReplayPosition. * This makes no guarantees about persistence, only ordering */ public ReplayPosition getContext() { return allocator.allocatingFrom().getContext() }
          Hide
          jbellis Jonathan Ellis added a comment -

          +1

          Show
          jbellis Jonathan Ellis added a comment - +1
          Hide
          benedict Benedict added a comment -

          I dunno, and I realize my OCD is showing, but Allocation isn't actually allocating anything; it's really more of a reservation. Whereas creating a new segment actually is allocating a resource.

          Well, I'm sanguine about this, and since I had OCD for realz when I was younger, that probably means it doesn't matter too much We could go crazy, and call neither an allocation - a LogRecordReservation, and SegmentManagement. Although it really depends how you define allocate - the dictionary definition definitely supports using it in either context. Just because we allocated the system resources earlier, doesn't mean we aren't also allocating them now, to specific use cases. I've already typed more on this than is sensible, so I'll let you decide!

          How can we find out?

          Experimentation, and checking the byte-code (or cludgily checking performance of an isolated run of each approach - it'll be a noticeable difference), and hoping they don't change the behaviour in future releases. There is literally no documentation on escape analysis, which frustrates me no end, given that it's correct deployment probably has one of the largest performance implications on critical code paths. I'm pretty confident it does backtrack object allocation outside of some methods, but the extent to which it is willing to do so, and what behaviour of the inner method might prevent it, is not clear.

          Dunno, if we're going to do that let's just make more explicit what our guarantees are and let them half the interval themselves if that's what they want.

          Sure, either works. If we make it explicit, then I think depending on how you define "no data loss of more than Xms" we meet the criteria with this patch: a sync() happens at least once every Xms; if (X+10)ms after starting it has either failed or blocked indefinitely any CL.add() from that point will block until it succeeds, preventing the writes from being acknowledged. So we guarantee that at most (2X+10)ms of acknowledged writes can be lost.

          Obviously, this isn't strictly no data loss beyond that interval, because there could be lots of blocked writes that will all also be lost, that could stack up for hours or days if they don't spot the problem. But we won't acknowledge any of them, and I think we'd need herds of ponies to save them.

          On that topic, I just noticed a pretty glaring bug though. The first lastAliveAt assignment in CLE should go at the very end of the try block, else on exceptions we'll continue to acknowledge when we haven't written anything, in which case we really aren't providing those guarantees at all. Whoops :-/

          Show
          benedict Benedict added a comment - I dunno, and I realize my OCD is showing, but Allocation isn't actually allocating anything; it's really more of a reservation. Whereas creating a new segment actually is allocating a resource. Well, I'm sanguine about this, and since I had OCD for realz when I was younger, that probably means it doesn't matter too much We could go crazy, and call neither an allocation - a LogRecordReservation, and SegmentManagement. Although it really depends how you define allocate - the dictionary definition definitely supports using it in either context. Just because we allocated the system resources earlier, doesn't mean we aren't also allocating them now, to specific use cases. I've already typed more on this than is sensible, so I'll let you decide! How can we find out? Experimentation, and checking the byte-code (or cludgily checking performance of an isolated run of each approach - it'll be a noticeable difference), and hoping they don't change the behaviour in future releases. There is literally no documentation on escape analysis, which frustrates me no end, given that it's correct deployment probably has one of the largest performance implications on critical code paths. I'm pretty confident it does backtrack object allocation outside of some methods, but the extent to which it is willing to do so, and what behaviour of the inner method might prevent it, is not clear. Dunno, if we're going to do that let's just make more explicit what our guarantees are and let them half the interval themselves if that's what they want. Sure, either works. If we make it explicit, then I think depending on how you define "no data loss of more than Xms" we meet the criteria with this patch: a sync() happens at least once every Xms; if (X+10)ms after starting it has either failed or blocked indefinitely any CL.add() from that point will block until it succeeds, preventing the writes from being acknowledged. So we guarantee that at most (2X+10)ms of acknowledged writes can be lost. Obviously, this isn't strictly no data loss beyond that interval, because there could be lots of blocked writes that will all also be lost, that could stack up for hours or days if they don't spot the problem. But we won't acknowledge any of them, and I think we'd need herds of ponies to save them. On that topic, I just noticed a pretty glaring bug though. The first lastAliveAt assignment in CLE should go at the very end of the try block, else on exceptions we'll continue to acknowledge when we haven't written anything, in which case we really aren't providing those guarantees at all. Whoops :-/
          Hide
          benedict Benedict added a comment -

          Hmm.. Actually, now that I think about it worst case is 2(X+10)ms lost writes, as we only sync() those segments that were active when we started the sync(), so we could have X+10ms writes backed up, plus the proceeding X+10ms of new writes that will come in before we realise we're stuffed.

          Show
          benedict Benedict added a comment - Hmm.. Actually, now that I think about it worst case is 2(X+10)ms lost writes, as we only sync() those segments that were active when we started the sync(), so we could have X+10ms writes backed up, plus the proceeding X+10ms of new writes that will come in before we realise we're stuffed.
          Hide
          benedict Benedict added a comment -

          How can we find out?

          I just had a quick hack at the cludgey route, and it looks like my suspicions were well founded that this particular spot wouldn't benefit from escape analysis without preallocation. I've replicated a very basic comparison of the two possible code paths in the file I'm uploading called TestEA.java. The results of the runs, with GC logging, are:

          0.128: [GC 7874K->936K(120832K), 0.0024800 secs]
          0.131: [Full GC 936K->763K(120832K), 0.0131020 secs]
          0.163: [GC 32507K->795K(120832K), 0.0008780 secs]
          0.169: [GC 32539K->795K(120832K), 0.0012210 secs]
          0.175: [GC 32539K->827K(152576K), 0.0016170 secs]
          0.194: [GC 64315K->827K(152576K), 0.0006820 secs]
          0.203: [GC 64315K->827K(211456K), 0.0012330 secs]
          0.239: [GC 127803K->795K(215552K), 0.0012810 secs]
          0.259: [GC 131355K->779K(338944K), 0.0012560 secs]
          0.331: [GC 254731K->779K(338944K), 0.0017270 secs]
          0.366: [GC 254731K->779K(491008K), 0.0006980 secs]
          0.471: [GC 407307K->779K(592384K), 0.0015460 secs]
          0.567: [GC 508171K->779K(735744K), 0.0011170 secs]
          0.692: [GC 651531K->779K(735744K), 0.0008890 secs]
          0.777: [GC 651531K->779K(755712K), 0.0018440 secs]
          0.870: [GC 671499K->779K(755712K), 0.0009420 secs]
          0.971: [GC 671499K->779K(755712K), 0.0014780 secs]
          1.060: [GC 672011K->779K(755712K), 0.0050500 secs]
          1.154: [GC 672011K->779K(756224K), 0.0013940 secs]
          1.241: [GC 672523K->779K(756224K), 0.0010910 secs]
          1.332: [GC 672523K->779K(756224K), 0.0010210 secs]
          1.421: [GC 672523K->779K(756224K), 0.0015880 secs]
          1.514: [GC 672523K->779K(756224K), 0.0011860 secs]
          1.602: [GC 672523K->779K(756224K), 0.0047140 secs]
          1.696: [GC 672523K->779K(756224K), 0.0011850 secs]
          1.783: [GC 672523K->779K(756224K), 0.0037350 secs]
          1.875: [GC 672523K->779K(756224K), 0.0013470 secs]
          1.967: [GC 672523K->779K(756224K), 0.0011120 secs]
          2.053: [GC 672523K->779K(756224K), 0.0014090 secs]
          2.143: [GC 672523K->779K(756224K), 0.0015660 secs]
          2.231: [GC 672523K->779K(756224K), 0.0028500 secs]
          2.328: [GC 672523K->779K(756224K), 0.0009810 secs]
          2.424: [GC 672523K->779K(756224K), 0.0010350 secs]
          2.511: [GC 672523K->779K(756224K), 0.0014540 secs]
          2.596: [GC 672523K->779K(756224K), 0.0017300 secs]
          2.685: [GC 672523K->779K(756224K), 0.0034170 secs]
          2.774: [GC 672523K->779K(756224K), 0.0006720 secs]
          2.868: [GC 672523K->779K(756224K), 0.0014270 secs]
          2.961: [GC 672523K->779K(756224K), 0.0016080 secs]
          3.047: [GC 672523K->779K(756224K), 0.0020300 secs]
          3.134: [GC 672523K->779K(756224K), 0.0012080 secs]
          3.230: [GC 672523K->779K(756224K), 0.0014360 secs]
          3.318: [GC 672523K->779K(756224K), 0.0020730 secs]
          3.408: [GC 672523K->779K(756224K), 0.0013830 secs]
          3.498: [GC 672523K->779K(756224K), 0.0012430 secs]
          3.585: [GC 672523K->779K(756224K), 0.0012390 secs]
          3.681: [GC 672523K->779K(756224K), 0.0074200 secs]
          3.781: [GC 672523K->779K(756224K), 0.0011200 secs]
          3.869: [GC 672523K->779K(756224K), 0.0053850 secs]
          3.973: [GC 672523K->779K(756224K), 0.0014770 secs]
          4.062: [GC 672523K->779K(756224K), 0.0017590 secs]
          4.149: [GC 672523K->779K(756224K), 0.0011470 secs]
          4.236: [GC 672523K->779K(756224K), 0.0011720 secs]
          4.325: [GC 672523K->779K(756224K), 0.0007650 secs]
          4.418: [GC 672523K->779K(756224K), 0.0006690 secs]
          4.510: [GC 672523K->779K(756224K), 0.0030140 secs]
          NoPreAlloc: 4435.45ms
          4.580: [GC 511644K->779K(756224K), 0.0008540 secs]
          4.581: [Full GC 779K->554K(756224K), 0.0113780 secs]
          PreAlloc: 1656.08ms
          

          So we're definitely producing garbage, and incurring a slightly higher cost as a result. Admittedly this won't make a huge difference here, but given it's a fairly hot path I think it's worth the optimisation, given that (IMO) it isn't really at all ugly either.

          I just hacked in a third test to see if escape analysis ever works out backwards to allocate in the caller if we know all callers don't let the result escape, and this appears to not be the case. Which is disappointing. For this test I simply loop calling a method that allocates the object - the method should be inlined, and it still isn't optimising it away, so presumably escape analysis happens prior to inlining.

          Show
          benedict Benedict added a comment - How can we find out? I just had a quick hack at the cludgey route, and it looks like my suspicions were well founded that this particular spot wouldn't benefit from escape analysis without preallocation. I've replicated a very basic comparison of the two possible code paths in the file I'm uploading called TestEA.java. The results of the runs, with GC logging, are: 0.128: [GC 7874K->936K(120832K), 0.0024800 secs] 0.131: [Full GC 936K->763K(120832K), 0.0131020 secs] 0.163: [GC 32507K->795K(120832K), 0.0008780 secs] 0.169: [GC 32539K->795K(120832K), 0.0012210 secs] 0.175: [GC 32539K->827K(152576K), 0.0016170 secs] 0.194: [GC 64315K->827K(152576K), 0.0006820 secs] 0.203: [GC 64315K->827K(211456K), 0.0012330 secs] 0.239: [GC 127803K->795K(215552K), 0.0012810 secs] 0.259: [GC 131355K->779K(338944K), 0.0012560 secs] 0.331: [GC 254731K->779K(338944K), 0.0017270 secs] 0.366: [GC 254731K->779K(491008K), 0.0006980 secs] 0.471: [GC 407307K->779K(592384K), 0.0015460 secs] 0.567: [GC 508171K->779K(735744K), 0.0011170 secs] 0.692: [GC 651531K->779K(735744K), 0.0008890 secs] 0.777: [GC 651531K->779K(755712K), 0.0018440 secs] 0.870: [GC 671499K->779K(755712K), 0.0009420 secs] 0.971: [GC 671499K->779K(755712K), 0.0014780 secs] 1.060: [GC 672011K->779K(755712K), 0.0050500 secs] 1.154: [GC 672011K->779K(756224K), 0.0013940 secs] 1.241: [GC 672523K->779K(756224K), 0.0010910 secs] 1.332: [GC 672523K->779K(756224K), 0.0010210 secs] 1.421: [GC 672523K->779K(756224K), 0.0015880 secs] 1.514: [GC 672523K->779K(756224K), 0.0011860 secs] 1.602: [GC 672523K->779K(756224K), 0.0047140 secs] 1.696: [GC 672523K->779K(756224K), 0.0011850 secs] 1.783: [GC 672523K->779K(756224K), 0.0037350 secs] 1.875: [GC 672523K->779K(756224K), 0.0013470 secs] 1.967: [GC 672523K->779K(756224K), 0.0011120 secs] 2.053: [GC 672523K->779K(756224K), 0.0014090 secs] 2.143: [GC 672523K->779K(756224K), 0.0015660 secs] 2.231: [GC 672523K->779K(756224K), 0.0028500 secs] 2.328: [GC 672523K->779K(756224K), 0.0009810 secs] 2.424: [GC 672523K->779K(756224K), 0.0010350 secs] 2.511: [GC 672523K->779K(756224K), 0.0014540 secs] 2.596: [GC 672523K->779K(756224K), 0.0017300 secs] 2.685: [GC 672523K->779K(756224K), 0.0034170 secs] 2.774: [GC 672523K->779K(756224K), 0.0006720 secs] 2.868: [GC 672523K->779K(756224K), 0.0014270 secs] 2.961: [GC 672523K->779K(756224K), 0.0016080 secs] 3.047: [GC 672523K->779K(756224K), 0.0020300 secs] 3.134: [GC 672523K->779K(756224K), 0.0012080 secs] 3.230: [GC 672523K->779K(756224K), 0.0014360 secs] 3.318: [GC 672523K->779K(756224K), 0.0020730 secs] 3.408: [GC 672523K->779K(756224K), 0.0013830 secs] 3.498: [GC 672523K->779K(756224K), 0.0012430 secs] 3.585: [GC 672523K->779K(756224K), 0.0012390 secs] 3.681: [GC 672523K->779K(756224K), 0.0074200 secs] 3.781: [GC 672523K->779K(756224K), 0.0011200 secs] 3.869: [GC 672523K->779K(756224K), 0.0053850 secs] 3.973: [GC 672523K->779K(756224K), 0.0014770 secs] 4.062: [GC 672523K->779K(756224K), 0.0017590 secs] 4.149: [GC 672523K->779K(756224K), 0.0011470 secs] 4.236: [GC 672523K->779K(756224K), 0.0011720 secs] 4.325: [GC 672523K->779K(756224K), 0.0007650 secs] 4.418: [GC 672523K->779K(756224K), 0.0006690 secs] 4.510: [GC 672523K->779K(756224K), 0.0030140 secs] NoPreAlloc: 4435.45ms 4.580: [GC 511644K->779K(756224K), 0.0008540 secs] 4.581: [Full GC 779K->554K(756224K), 0.0113780 secs] PreAlloc: 1656.08ms So we're definitely producing garbage, and incurring a slightly higher cost as a result. Admittedly this won't make a huge difference here, but given it's a fairly hot path I think it's worth the optimisation, given that (IMO) it isn't really at all ugly either. I just hacked in a third test to see if escape analysis ever works out backwards to allocate in the caller if we know all callers don't let the result escape, and this appears to not be the case. Which is disappointing. For this test I simply loop calling a method that allocates the object - the method should be inlined, and it still isn't optimising it away, so presumably escape analysis happens prior to inlining.
          Hide
          jbellis Jonathan Ellis added a comment -

          Dropped the Allocation refactor and added some more: https://github.com/jbellis/cassandra/commits/3578-2 (stopped tonight before digging into CLS).

          • By the time we request sync in advanceAllocatingFrom, we've already modified allocatingFrom so the new empty segment is what will get synced. No?
          • "Now recycle segments that are unused, as we may not have triggered a discardCompletedSegments." Why not?
          Show
          jbellis Jonathan Ellis added a comment - Dropped the Allocation refactor and added some more: https://github.com/jbellis/cassandra/commits/3578-2 (stopped tonight before digging into CLS). By the time we request sync in advanceAllocatingFrom, we've already modified allocatingFrom so the new empty segment is what will get synced. No? "Now recycle segments that are unused, as we may not have triggered a discardCompletedSegments." Why not?
          Hide
          benedict Benedict added a comment - - edited

          By the time we request sync in advanceAllocatingFrom, we've already modified allocatingFrom so the new empty segment is what will get synced. No?

          We only request an out-of-band regular sync, so it will just try to sync anything that hasn't yet been synced (which should only be the just finished segment). It's possible we will also sync the beginning of the new segment if it has already been written to. We could put in behaviour to prevent that if we want to avoid the potential seek cost, but since it's only oncurred once per CLS allocation, I'm not sure it's worth making the code anymore convoluted than it currently is.

          "Now recycle segments that are unused, as we may not have triggered a discardCompletedSegments." Why not?

          discardCompletedSegments() is only called if a memtable flushes something. We could already be marked clean (both in CL and Memtable), but the most recent segment would have still been "in use" (because "in use" is effectively !current && clean), so once we switch in a new current segment we can recycle it, but discardCompletedSegments() won't be invoked to do so.

          A couple of comments on your changes:

          List<CommitLogSegment> segmentsToRecycle = new ArrayList<>(availableSegments);
          

          should be

          List<CommitLogSegment> segmentsToRecycle = new ArrayList<>(activeSegments);
          

          ... and I'm a little concerned about bringing the flushDataFrom() calls to forceFlush() into the same thread in general, as most calls will be from the CLSManager thread, and switchMemtable() is currently (and may well remain, to some degree) a blocking call. So we could block incoming CL.add() calls unnecessarily as they wait for a new segment. We could perform the entire flushDataFrom() call from the CLSManager thread on the optionalTasks executor, though, to keep your clarification.

          Show
          benedict Benedict added a comment - - edited By the time we request sync in advanceAllocatingFrom, we've already modified allocatingFrom so the new empty segment is what will get synced. No? We only request an out-of-band regular sync, so it will just try to sync anything that hasn't yet been synced (which should only be the just finished segment). It's possible we will also sync the beginning of the new segment if it has already been written to. We could put in behaviour to prevent that if we want to avoid the potential seek cost, but since it's only oncurred once per CLS allocation, I'm not sure it's worth making the code anymore convoluted than it currently is. "Now recycle segments that are unused, as we may not have triggered a discardCompletedSegments." Why not? discardCompletedSegments() is only called if a memtable flushes something. We could already be marked clean (both in CL and Memtable), but the most recent segment would have still been "in use" (because "in use" is effectively !current && clean), so once we switch in a new current segment we can recycle it, but discardCompletedSegments() won't be invoked to do so. A couple of comments on your changes: List<CommitLogSegment> segmentsToRecycle = new ArrayList<>(availableSegments); should be List<CommitLogSegment> segmentsToRecycle = new ArrayList<>(activeSegments); ... and I'm a little concerned about bringing the flushDataFrom() calls to forceFlush() into the same thread in general, as most calls will be from the CLSManager thread, and switchMemtable() is currently (and may well remain, to some degree) a blocking call. So we could block incoming CL.add() calls unnecessarily as they wait for a new segment. We could perform the entire flushDataFrom() call from the CLSManager thread on the optionalTasks executor, though, to keep your clarification.
          Hide
          benedict Benedict added a comment - - edited

          Just realised there's a deadlock possibility when you move flushDataFrom() into executing on the CLSManager thread: somebody owning the switchLock can be performing a CL.add(), which is waiting on the CLSManager, which could itself be waiting on the switch lock.

          This same scenario can occur even when we move to the changes I propose in CASSANDRA-5549.

          Show
          benedict Benedict added a comment - - edited Just realised there's a deadlock possibility when you move flushDataFrom() into executing on the CLSManager thread: somebody owning the switchLock can be performing a CL.add(), which is waiting on the CLSManager, which could itself be waiting on the switch lock. This same scenario can occur even when we move to the changes I propose in CASSANDRA-5549 .
          Hide
          jbellis Jonathan Ellis added a comment -

          Updated my github branch to address above and dig into CLS.

          • We don't assume that segments we recycle are the "right" size, but I don't see anywhere that we actually extend a segment past the standard length.
          • Looks like discardUnusedTail can be called before sync() in a couple ways. This means that we won't have room to write a sync marker at the end of the allocated space which could confuse replay.
          • Is there a race in removeCleanFromDirty where we compare the dirty position w/ the synced one, while someone dirties it with a higher position?
          Show
          jbellis Jonathan Ellis added a comment - Updated my github branch to address above and dig into CLS. We don't assume that segments we recycle are the "right" size, but I don't see anywhere that we actually extend a segment past the standard length. Looks like discardUnusedTail can be called before sync() in a couple ways. This means that we won't have room to write a sync marker at the end of the allocated space which could confuse replay. Is there a race in removeCleanFromDirty where we compare the dirty position w/ the synced one, while someone dirties it with a higher position?
          Hide
          benedict Benedict added a comment -

          We don't assume that segments we recycle are the "right" size, but I don't see anywhere that we actually extend a segment past the standard length.

          Are you referring to the fact that I always use buffer.capacity() instead of the constant defining their size? I just felt it was clearer whilst writing the code, but you're right we could switch it, and in particular in CLS.allocate(int) it might be a good idea to do so, though it is unlikely to have a measurable impact. It might do if we start pushing closer to a million writes/sec/node, so no harm future proofing.

          Looks like discardUnusedTail can be called before sync() in a couple ways. This means that we won't have room to write a sync marker at the end of the allocated space which could confuse replay.

          This is actually the intended use case. The sync marker doesn't really occur at the end of a sync(), but at the beginning. When we sync we are always guaranteed to be directly preceeded by a preallocated marker, that has been zeroed out for us (the zeroing is not necessary, but allows us to log warnings if it is missing) - when we sync we attempt to allocate a new marker for the next sync, and we point the previous marker to the next one. If there's no room, we point to the end of the file, which is completely safe. Note that this can occur without calling discardUnusedTail().

          It might clarify to call them headers, though I settled on marker because at some point my nomenclature felt like it was getting confused with header. I'm sure we could make it work though.

          Is there a race in removeCleanFromDirty where we compare the dirty position w/ the synced one, while someone dirties it with a higher position?

          No, for this very reason (and simpliciy's sake) removeCleanFromDirty NOOPs until isFullySynced() holds, which only occurs once the segment is both fully used and synced to disk, i.e. is no longer being updated.

          Show
          benedict Benedict added a comment - We don't assume that segments we recycle are the "right" size, but I don't see anywhere that we actually extend a segment past the standard length. Are you referring to the fact that I always use buffer.capacity() instead of the constant defining their size? I just felt it was clearer whilst writing the code, but you're right we could switch it, and in particular in CLS.allocate(int) it might be a good idea to do so, though it is unlikely to have a measurable impact. It might do if we start pushing closer to a million writes/sec/node, so no harm future proofing. Looks like discardUnusedTail can be called before sync() in a couple ways. This means that we won't have room to write a sync marker at the end of the allocated space which could confuse replay. This is actually the intended use case. The sync marker doesn't really occur at the end of a sync(), but at the beginning. When we sync we are always guaranteed to be directly preceeded by a preallocated marker, that has been zeroed out for us (the zeroing is not necessary, but allows us to log warnings if it is missing) - when we sync we attempt to allocate a new marker for the next sync, and we point the previous marker to the next one. If there's no room, we point to the end of the file, which is completely safe. Note that this can occur without calling discardUnusedTail(). It might clarify to call them headers, though I settled on marker because at some point my nomenclature felt like it was getting confused with header. I'm sure we could make it work though. Is there a race in removeCleanFromDirty where we compare the dirty position w/ the synced one, while someone dirties it with a higher position? No, for this very reason (and simpliciy's sake) removeCleanFromDirty NOOPs until isFullySynced() holds, which only occurs once the segment is both fully used and synced to disk, i.e. is no longer being updated.
          Hide
          benedict Benedict added a comment -

          Nice catch on the RAF leak - I just copied the close(false) without thinking about it. Though I'm not sure why it wasn't leaking already in that case, as it doesn't look like we were ever closing them before either.

          I would suggest rolling back your isFullySynced() removal and next=>last rename changes though

          Show
          benedict Benedict added a comment - Nice catch on the RAF leak - I just copied the close(false) without thinking about it. Though I'm not sure why it wasn't leaking already in that case, as it doesn't look like we were ever closing them before either. I would suggest rolling back your isFullySynced() removal and next=>last rename changes though
          Hide
          jbellis Jonathan Ellis added a comment -

          Are you referring to the fact that I always use buffer.capacity() instead of the constant defining their size?

          No, I'm talking about

                      // Map the segment, extending or truncating it to the standard segment size
                      logFileAccessor.setLength(DatabaseDescriptor.getCommitLogSegmentSize());
          

          There's also

                  // check against SEGMENT_SIZE avoids recycling odd-sized or empty segments from old C* versions and unit tests
          

          which I suspect is obsolete although I'm not sure what the effect on tests is.

          Show
          jbellis Jonathan Ellis added a comment - Are you referring to the fact that I always use buffer.capacity() instead of the constant defining their size? No, I'm talking about // Map the segment, extending or truncating it to the standard segment size logFileAccessor.setLength(DatabaseDescriptor.getCommitLogSegmentSize()); There's also // check against SEGMENT_SIZE avoids recycling odd-sized or empty segments from old C* versions and unit tests which I suspect is obsolete although I'm not sure what the effect on tests is.
          Hide
          benedict Benedict added a comment -

          I assumed that bit was to resize segments we recycle at restart after the user redefines their commit segment size

          Show
          benedict Benedict added a comment - I assumed that bit was to resize segments we recycle at restart after the user redefines their commit segment size
          Hide
          jbellis Jonathan Ellis added a comment -

          If there's no room, we point to the end of the file, which is completely safe

          My point is that the "no room" case can be a false positive when discardUnusedTail has consumed space that isn't actually written to.

          I would suggest rolling back your isFullySynced() removal and next=>last rename changes though

          Which part? The nonclemature is obviously correct this way. (Re-read sync, we point the "last" to the "next", before we were pointing "next" to "next" which is kind of nonsensical.)

          Show
          jbellis Jonathan Ellis added a comment - If there's no room, we point to the end of the file, which is completely safe My point is that the "no room" case can be a false positive when discardUnusedTail has consumed space that isn't actually written to. I would suggest rolling back your isFullySynced() removal and next=>last rename changes though Which part? The nonclemature is obviously correct this way. (Re-read sync, we point the "last" to the "next", before we were pointing "next" to "next" which is kind of nonsensical.)
          Hide
          jbellis Jonathan Ellis added a comment -

          I assumed that bit was to resize segments we recycle at restart after the user redefines their commit segment size

          That makes sense. I'll add a comment.

          Show
          jbellis Jonathan Ellis added a comment - I assumed that bit was to resize segments we recycle at restart after the user redefines their commit segment size That makes sense. I'll add a comment.
          Hide
          benedict Benedict added a comment -

          Huh, just looked and I see what you mean. In recycle we always delete it if it's the wrong size. Not sure that's the correct behaviour

          Show
          benedict Benedict added a comment - Huh, just looked and I see what you mean. In recycle we always delete it if it's the wrong size. Not sure that's the correct behaviour
          Hide
          benedict Benedict added a comment -

          My point is that the "no room" case can be a false positive when discardUnusedTail has consumed space that isn't actually written to.

          Check out sync() (the bit inside if (nextMarker < 0) { }) - we check for this by checking the discardedTailFrom property, and write a marker if there's room at the start of the discarded region.

          Which part? The nonclemature is obviously correct this way.

          We point next to next, which is absolutely correct! In my book, though maybe we have different books The nextMarker we define in the sync() method is the next value we want to set the nextSyncMarkerPosition, and it is also the next sync marker we will write a length to at the next sync (during the current sync we only zero it). Although you could legitimately consider it the previous tail marker, I don't because I consider the file to be a list of effectively [length:data][length:data][0:undefined]. The tail marker is a convenience of an unitialised head marker. I would prefer to rename the local variable 'offset' to 'lastMarker'

          The isFullySynced() is necessary to avoid the race condition you raised in your previous comment.

          Show
          benedict Benedict added a comment - My point is that the "no room" case can be a false positive when discardUnusedTail has consumed space that isn't actually written to. Check out sync() (the bit inside if (nextMarker < 0) { }) - we check for this by checking the discardedTailFrom property, and write a marker if there's room at the start of the discarded region. Which part? The nonclemature is obviously correct this way. We point next to next, which is absolutely correct! In my book, though maybe we have different books The nextMarker we define in the sync() method is the next value we want to set the nextSyncMarkerPosition, and it is also the next sync marker we will write a length to at the next sync (during the current sync we only zero it). Although you could legitimately consider it the previous tail marker, I don't because I consider the file to be a list of effectively [length:data] [length:data] [0:undefined] . The tail marker is a convenience of an unitialised head marker. I would prefer to rename the local variable 'offset' to 'lastMarker' The isFullySynced() is necessary to avoid the race condition you raised in your previous comment.
          Hide
          benedict Benedict added a comment - - edited

          On the topic of recycling / size issues, I'd drop the following line from CLSM.recycle(File):

                      || file.length() != DatabaseDescriptor.getCommitLogSegmentSize()
          
          Show
          benedict Benedict added a comment - - edited On the topic of recycling / size issues, I'd drop the following line from CLSM.recycle(File): || file.length() != DatabaseDescriptor.getCommitLogSegmentSize()
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          We point next to next, which is absolutely correct

          I understand what sync is doing; my point is that xxxxSyncMarkerPosition always points to the last offset before which the contents are synced, which we'll revisit when we sync again / write the next marker (referred to appropriately as nextMarker in the code). Your own comments refer to "last" or "previous" in multiple places which speaks to how difficult it is to avoid thinking of it that way.

          (N.B. This is also clear in the usage of isFullySynced – intuitively, we want to know if "the last place we synced was the end of the buffer" which is unclear if the variable is named next.)

          Show
          jbellis Jonathan Ellis added a comment - - edited We point next to next, which is absolutely correct I understand what sync is doing; my point is that xxxxSyncMarkerPosition always points to the last offset before which the contents are synced, which we'll revisit when we sync again / write the next marker (referred to appropriately as nextMarker in the code). Your own comments refer to "last" or "previous" in multiple places which speaks to how difficult it is to avoid thinking of it that way. (N.B. This is also clear in the usage of isFullySynced – intuitively, we want to know if "the last place we synced was the end of the buffer" which is unclear if the variable is named next .)
          Hide
          jbellis Jonathan Ellis added a comment -

          we check for this by checking the discardedTailFrom property, and write a marker if there's room at the start of the discarded region.

          Ah, right. I misread that initially.

          Show
          jbellis Jonathan Ellis added a comment - we check for this by checking the discardedTailFrom property, and write a marker if there's room at the start of the discarded region. Ah, right. I misread that initially.
          Hide
          jbellis Jonathan Ellis added a comment -

          isFullySynced() is necessary to avoid the race condition you raised in your previous comment

          Pushed https://github.com/jbellis/cassandra/commits/3578-3 with that restored.

          Last question: couldn't we dispense with WaitQueue in favor of AtomicReference<Condition>?

          Show
          jbellis Jonathan Ellis added a comment - isFullySynced() is necessary to avoid the race condition you raised in your previous comment Pushed https://github.com/jbellis/cassandra/commits/3578-3 with that restored. Last question: couldn't we dispense with WaitQueue in favor of AtomicReference<Condition>?
          Hide
          benedict Benedict added a comment - - edited

          I understand what sync is doing

          I just wanted to spell it all out so I could write next three times

          Your own comments refer to "last" or "previous" in multiple places which speaks to how difficult it is to avoid thinking of it that way

          I agree it's difficult to avoid talking about last when talking about next, but this is typical since it's chained. I still disagree about the rename, but as I alluded to do agree that both are justifiable positions. To me it is very unintuitive in the sync() method (which is the most complex method) to refer to the sync position we haven't yet populated as the "last" sync marker position. Also, in the isFullySynced(), I think both work - as if the next as past the end of the file, we're clearly referring to the next segment, no? I think we wouldn't be having this argument if I'd stuck with "header" instead of "marker"

          That all said, it's definitely not worth any more argument, so I'll bow out at this point if I still haven't persuaded you!

          couldn't we dispense with WaitQueue in favor of AtomicReference<Condition>

          Well, WaitQueue is non-blocking, in a few senses:

          1. It is non-blocking in its abstraction which makes it absolutely the correct abstraction to use here. To use a Condition object would require considerably more ugliness than a WaitQueue, as we attempted to mimic its behaviour. However, you could implement a WaitQueue supporting only signalAll() using a Lock and an AtomicReference<Condition> under the hood.
          2. It is non-blocking in its implementation (well, almost; certainly in this use it is non-blocking with respect to consumers/producers blocking each other, but not necessarily consumers/consumers or producers/producers, although this is easily rectified in its current design). In this use case we could certainly handle blocking each other if we had to, but it is ugly to do so when it isn't necessary in my book.
          3. When signalling threads, all waiting threads are activated immediately, whereas with a Condition object they must wake up serially, incurring scheduler delay for each wake-up.

          Something tangential also occurred to me, maybe worth spinning out a ticket for: As we currently stand we can only support max_concurrent_write TOTAL writes per batch with BatchCLE. If we were now to split out the CL.add() into the calling thread instead of the write stage, returning a Future to wait on in the write stage, we could support max_connection writes instead, which would be much larger. Although admittedly we would have to be careful about not OOMing if there are a lot of large writes outstanding. It could do a lot for throughput in BatchCLE potentially though.

          Show
          benedict Benedict added a comment - - edited I understand what sync is doing I just wanted to spell it all out so I could write next three times Your own comments refer to "last" or "previous" in multiple places which speaks to how difficult it is to avoid thinking of it that way I agree it's difficult to avoid talking about last when talking about next, but this is typical since it's chained. I still disagree about the rename, but as I alluded to do agree that both are justifiable positions. To me it is very unintuitive in the sync() method (which is the most complex method) to refer to the sync position we haven't yet populated as the "last" sync marker position. Also, in the isFullySynced(), I think both work - as if the next as past the end of the file, we're clearly referring to the next segment, no? I think we wouldn't be having this argument if I'd stuck with "header" instead of "marker" That all said, it's definitely not worth any more argument, so I'll bow out at this point if I still haven't persuaded you! couldn't we dispense with WaitQueue in favor of AtomicReference<Condition> Well, WaitQueue is non-blocking, in a few senses: It is non-blocking in its abstraction which makes it absolutely the correct abstraction to use here. To use a Condition object would require considerably more ugliness than a WaitQueue, as we attempted to mimic its behaviour. However, you could implement a WaitQueue supporting only signalAll() using a Lock and an AtomicReference<Condition> under the hood. It is non-blocking in its implementation (well, almost; certainly in this use it is non-blocking with respect to consumers/producers blocking each other, but not necessarily consumers/consumers or producers/producers, although this is easily rectified in its current design). In this use case we could certainly handle blocking each other if we had to, but it is ugly to do so when it isn't necessary in my book. When signalling threads, all waiting threads are activated immediately, whereas with a Condition object they must wake up serially, incurring scheduler delay for each wake-up. Something tangential also occurred to me, maybe worth spinning out a ticket for: As we currently stand we can only support max_concurrent_write TOTAL writes per batch with BatchCLE. If we were now to split out the CL.add() into the calling thread instead of the write stage, returning a Future to wait on in the write stage, we could support max_connection writes instead, which would be much larger. Although admittedly we would have to be careful about not OOMing if there are a lot of large writes outstanding. It could do a lot for throughput in BatchCLE potentially though.
          Hide
          jbellis Jonathan Ellis added a comment -

          It is non-blocking in its abstraction which makes it absolutely the correct abstraction to use here

          How is WaitQueue different from using Object.wait/notifyAll?

          If we were now to split out the CL.add() into the calling thread instead of the write stage, returning a Future to wait on in the write stage, we could support max_connection writes instead, which would be much larger

          Well, in the general case of writing to more than just a single local replica, this distinction isn't really meaningful. At some point you have to do blocking i/o to disk so whether you call that concurrent_writes or fake_aio_threads doesn't really matter.

          Show
          jbellis Jonathan Ellis added a comment - It is non-blocking in its abstraction which makes it absolutely the correct abstraction to use here How is WaitQueue different from using Object.wait/notifyAll? If we were now to split out the CL.add() into the calling thread instead of the write stage, returning a Future to wait on in the write stage, we could support max_connection writes instead, which would be much larger Well, in the general case of writing to more than just a single local replica, this distinction isn't really meaningful. At some point you have to do blocking i/o to disk so whether you call that concurrent_writes or fake_aio_threads doesn't really matter.
          Hide
          benedict Benedict added a comment -

          How is WaitQueue different from using Object.wait/notifyAll?

          Monitor acquisition. For Object.wait/notifyAll, in order to either issue a notifyAll or wait, the monitor must first be obtained. This means that the notifier can block whilst a waiter is checking if it should wait, but also that even if the monitor is immediately released after waking up, all waiters must still acquire the monitor one at a time (i.e. one wakes up, releases monitor, exits, another wakes up, releases monitor, exits, etc.) meaning for N waiting threads you have N scheduler delays. With WaitQueue the notifier "never" (in the context of this use case never, and easily extended to absolutely never) blocks, and issues wake-ups to all waiting threads pretty much simultaneously, all of which may in theory wake up immediately, incurring 1 scheduler delay (in reality N/cores delays)

          Well, in the general case of writing to more than just a single local replica

          True in RF>1, but for RF=1 smart clients are hopefully becoming the norm with the Java Driver now knocking around, so it's not a small use case. It could potentially be extended to RF>1 as well, by having a separate queue of blocking and non-blocking stages of writing, so that we don't issue a response until the blocking portion of the write completes, but we can continue to service more incoming requests for the non-blocking portion. This would mean the total cluster throughput would be limited to nodes * max_connections, instead of nodes * concurrent_writes.

          At some point you have to do blocking i/o to disk so whether you call that concurrent_writes or fake_aio_threads doesn't really matter.

          But the blocking i/o isn't the rate limiting factor here; with BatchCLE the limit is the batch period * writes per batch. We don't really want to limit writes per batch artificially if we can avoid it. Obviously we have a limit on the number of clients that can be connected, so that limit can't be avoided, but concurrent_writes can be and probably should be much lower, so constraining by that is undesirable.

          At some point we could eliminate the connection limit as well, by returning to the client a request id that will be followed up later by a confirmation of write/failure. I'm not necessarily suggesting we do this any time soon, but it is a possibility if we make the proposed change.

          Show
          benedict Benedict added a comment - How is WaitQueue different from using Object.wait/notifyAll? Monitor acquisition. For Object.wait/notifyAll, in order to either issue a notifyAll or wait, the monitor must first be obtained. This means that the notifier can block whilst a waiter is checking if it should wait, but also that even if the monitor is immediately released after waking up, all waiters must still acquire the monitor one at a time (i.e. one wakes up, releases monitor, exits, another wakes up, releases monitor, exits, etc.) meaning for N waiting threads you have N scheduler delays. With WaitQueue the notifier "never" (in the context of this use case never, and easily extended to absolutely never) blocks, and issues wake-ups to all waiting threads pretty much simultaneously, all of which may in theory wake up immediately, incurring 1 scheduler delay (in reality N/cores delays) Well, in the general case of writing to more than just a single local replica True in RF>1, but for RF=1 smart clients are hopefully becoming the norm with the Java Driver now knocking around, so it's not a small use case. It could potentially be extended to RF>1 as well, by having a separate queue of blocking and non-blocking stages of writing, so that we don't issue a response until the blocking portion of the write completes, but we can continue to service more incoming requests for the non-blocking portion. This would mean the total cluster throughput would be limited to nodes * max_connections, instead of nodes * concurrent_writes. At some point you have to do blocking i/o to disk so whether you call that concurrent_writes or fake_aio_threads doesn't really matter. But the blocking i/o isn't the rate limiting factor here; with BatchCLE the limit is the batch period * writes per batch. We don't really want to limit writes per batch artificially if we can avoid it. Obviously we have a limit on the number of clients that can be connected, so that limit can't be avoided, but concurrent_writes can be and probably should be much lower, so constraining by that is undesirable. At some point we could eliminate the connection limit as well, by returning to the client a request id that will be followed up later by a confirmation of write/failure. I'm not necessarily suggesting we do this any time soon, but it is a possibility if we make the proposed change.
          Hide
          jbellis Jonathan Ellis added a comment -

          for RF=1 smart clients are hopefully becoming the norm with the Java Driver now knocking around, so it's not a small use case

          CL.ONE is common, but RF=1 is not.

          Show
          jbellis Jonathan Ellis added a comment - for RF=1 smart clients are hopefully becoming the norm with the Java Driver now knocking around, so it's not a small use case CL.ONE is common, but RF=1 is not.
          Hide
          jbellis Jonathan Ellis added a comment -

          I agree it's difficult to avoid talking about last when talking about next, but this is typical since it's chained.

          How about this?

              // Everything before this offset has been synced and written.  The SYNC_MARKER_SIZE bytes after
              // this offset are reserved, and point forwards to the next sync position.  The final
              // sync marker in a segment will be zeroed out, or point to EOF.
              private volatile int lastSyncedOffset;
          
          Show
          jbellis Jonathan Ellis added a comment - I agree it's difficult to avoid talking about last when talking about next, but this is typical since it's chained. How about this? // Everything before this offset has been synced and written. The SYNC_MARKER_SIZE bytes after // this offset are reserved, and point forwards to the next sync position. The final // sync marker in a segment will be zeroed out, or point to EOF. private volatile int lastSyncedOffset;
          Hide
          benedict Benedict added a comment -

          Sounds good to me

          Show
          benedict Benedict added a comment - Sounds good to me
          Show
          jbellis Jonathan Ellis added a comment - https://github.com/jbellis/cassandra/commits/3578-3 good to commit?
          Hide
          benedict Benedict added a comment -

          I think we've hashed this one out pretty thoroughly

          Looks good to me, had another quick scan but I don't think I'm likely to spot anything we haven't already dealt with, and I'm happy we tied up everything we discussed.

          Show
          benedict Benedict added a comment - I think we've hashed this one out pretty thoroughly Looks good to me, had another quick scan but I don't think I'm likely to spot anything we haven't already dealt with, and I'm happy we tied up everything we discussed.
          Hide
          jbellis Jonathan Ellis added a comment -

          If I don't commit anything from tools/stress or tools/lib is that sufficient to keep the stress-ng stuff separate?

          Show
          jbellis Jonathan Ellis added a comment - If I don't commit anything from tools/stress or tools/lib is that sufficient to keep the stress-ng stuff separate?
          Hide
          jbellis Jonathan Ellis added a comment -

          Committed!

          Show
          jbellis Jonathan Ellis added a comment - Committed!
          Hide
          benedict Benedict added a comment -

          As discussed on IRC, with Batch CL in particular you could see a lot of warning messages in the log about the sync lagging. Whilst this was working as intended, it was a bit much for BatchCL with a small window.

          I've uploaded a patch to https://github.com/belliottsmith/cassandra/tree/iss-3578-4 that should address this, by aggregating lags into a 5m period, for which we only report if the lag exceeds 5% of the wall time of the period. Meaning it should only crop up when it's really appreciably affecting performance.

          Show
          benedict Benedict added a comment - As discussed on IRC, with Batch CL in particular you could see a lot of warning messages in the log about the sync lagging. Whilst this was working as intended, it was a bit much for BatchCL with a small window. I've uploaded a patch to https://github.com/belliottsmith/cassandra/tree/iss-3578-4 that should address this, by aggregating lags into a 5m period, for which we only report if the lag exceeds 5% of the wall time of the period. Meaning it should only crop up when it's really appreciably affecting performance.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm not really sure what users should take away from the % numbers. v2 simplifies to just a count and average lag duration.

          Show
          jbellis Jonathan Ellis added a comment - I'm not really sure what users should take away from the % numbers. v2 simplifies to just a count and average lag duration.
          Hide
          benedict Benedict added a comment - - edited

          The idea of the % was to give some sense of scale of the problem. The problem with just offering the average is that it's only averaged over the number of lagged commits, which doesn't tell the whole story (if most of the commits didn't lag, it might not be worth worrying about), or averaged over all the commits which might be overly dismissive because a lot of commits may be NO-OPs, so giving a ratio of time spent committing to time spent lagged seemed a useful compromise. Though perhaps simply total time spent lagged and total time spent committing might be more useful.

          Show
          benedict Benedict added a comment - - edited The idea of the % was to give some sense of scale of the problem. The problem with just offering the average is that it's only averaged over the number of lagged commits, which doesn't tell the whole story (if most of the commits didn't lag, it might not be worth worrying about), or averaged over all the commits which might be overly dismissive because a lot of commits may be NO-OPs, so giving a ratio of time spent committing to time spent lagged seemed a useful compromise. Though perhaps simply total time spent lagged and total time spent committing might be more useful.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          v3 logs average sync duration (for all syncs) and average overage (for just the laggy ones). think that's as good as we'll get without logging histograms

          Show
          jbellis Jonathan Ellis added a comment - - edited v3 logs average sync duration (for all syncs) and average overage (for just the laggy ones). think that's as good as we'll get without logging histograms
          Hide
          benedict Benedict added a comment -

          +1

          Show
          benedict Benedict added a comment - +1
          Hide
          jbellis Jonathan Ellis added a comment -

          committed

          Show
          jbellis Jonathan Ellis added a comment - committed

            People

            • Assignee:
              benedict Benedict
              Reporter:
              jbellis Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development