Cassandra
  1. Cassandra
  2. CASSANDRA-3411

Pre-allocated, Recycled Commitlog Segment Files

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: Core
    • Labels:

      Description

      An approach for improving commitlog performance is to pre-allocate the full 128MB segment files and reuse them once all the mutations have been flushed. Pre-allocation allows writes to be performed without modifying the file size metadata, and should (in theory) allow the filesystem to allocate a contiguous block of space for the file. Recycling the segment files prevents the overhead of pre-allocation from impacting overall performance.

      1. 001-pre-allocated-recycled-commitlog-segment-files-CASSANDRA-3411.patch
        33 kB
        Rick Branson
      2. 003-pre-allocated-recycled-commitlog-segment-files-CASSANDRA-3411.patch
        34 kB
        Rick Branson
      3. 004-pre-allocated-recycled-commitlog-segment-files-CASSANDRA-3411.patch
        34 kB
        Rick Branson
      4. 006-pre-allocated-recycled-commitlog-segment-files-CASSANDRA-3411.patch
        40 kB
        Rick Branson
      5. 3411-cleaned.txt
        29 kB
        Jonathan Ellis
      6. 3411-v5.txt
        35 kB
        Jonathan Ellis
      7. 3411-v6-retry.txt
        49 kB
        Rick Branson
      8. 3411-v7.txt
        56 kB
        Jonathan Ellis
      9. 3411-v8.txt
        58 kB
        Jonathan Ellis

        Activity

        Hide
        Hudson added a comment -

        Integrated in Cassandra #1217 (See https://builds.apache.org/job/Cassandra/1217/)
        Recycle commitlog segments for improved performance
        patch by Rick Branson and jbellis for CASSANDRA-3411

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1205203
        Files :

        • /cassandra/trunk/CHANGES.txt
        • /cassandra/trunk/NEWS.txt
        • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLog.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLogAllocator.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLogSegment.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/AbstractCassandraDaemon.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/CommitLogTest.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManager2Test.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManager3Test.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManagerTest.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManagerTruncateTest.java
        Show
        Hudson added a comment - Integrated in Cassandra #1217 (See https://builds.apache.org/job/Cassandra/1217/ ) Recycle commitlog segments for improved performance patch by Rick Branson and jbellis for CASSANDRA-3411 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1205203 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/NEWS.txt /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java /cassandra/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLog.java /cassandra/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLogAllocator.java /cassandra/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLogSegment.java /cassandra/trunk/src/java/org/apache/cassandra/service/AbstractCassandraDaemon.java /cassandra/trunk/test/unit/org/apache/cassandra/db/CommitLogTest.java /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManager2Test.java /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManager3Test.java /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManagerTest.java /cassandra/trunk/test/unit/org/apache/cassandra/db/RecoveryManagerTruncateTest.java
        Hide
        Jonathan Ellis added a comment -

        committed!

        Show
        Jonathan Ellis added a comment - committed!
        Hide
        Rick Branson added a comment -

        +1

        Show
        Rick Branson added a comment - +1
        Hide
        Jonathan Ellis added a comment -

        v8 attached. Decided the gentlest upgrade path is to continue to accept oversize mutations but skip the commitlog (see: CL.add). Documented in NEWS.

        Show
        Jonathan Ellis added a comment - v8 attached. Decided the gentlest upgrade path is to continue to accept oversize mutations but skip the commitlog (see: CL.add). Documented in NEWS.
        Hide
        Rick Branson added a comment -

        Nice, definitely some refactoring that needed to be done I was a little conservative about, specifically moving more work into the CommitLogAllocator and that magic flag We discussed elsewhere that this patch will enforce a commit segment size (128MB) limit on the row mutation size, which needs to be communicated to the users via the NEWS or CHANGES files. I also didn't see anything that would catch a mutation that was too large and log an error. This could lead to some pretty nasty behavior if the user attempts to do such a thing.

        Show
        Rick Branson added a comment - Nice, definitely some refactoring that needed to be done I was a little conservative about, specifically moving more work into the CommitLogAllocator and that magic flag We discussed elsewhere that this patch will enforce a commit segment size (128MB) limit on the row mutation size, which needs to be communicated to the users via the NEWS or CHANGES files. I also didn't see anything that would catch a mutation that was too large and log an error. This could lead to some pretty nasty behavior if the user attempts to do such a thing.
        Hide
        Jonathan Ellis added a comment -

        v7 attached.

        Cleaned out uses of CLS.length which were mostly incorrect now.

        Cleaned up the size tracking code.

        Made allocator responsible for recycle/discard decision.

        Added enableReserveSegmentCreation "magic flag" to avoid so much churn on restart/reply.

        Switched to pre-append capacity check instead of post-, so no need to mmap more than a single buffer per segment.

        Moved all segment tracking into allocator.

        Will create tickets for switching to a txid-based reply, and saving the extra local copy.

        Show
        Jonathan Ellis added a comment - v7 attached. Cleaned out uses of CLS.length which were mostly incorrect now. Cleaned up the size tracking code. Made allocator responsible for recycle/discard decision. Added enableReserveSegmentCreation "magic flag" to avoid so much churn on restart/reply. Switched to pre-append capacity check instead of post-, so no need to mmap more than a single buffer per segment. Moved all segment tracking into allocator. Will create tickets for switching to a txid-based reply, and saving the extra local copy.
        Hide
        Jonathan Ellis added a comment -

        Oops, hit submit prematurely. Still reviewing the rest.

        Show
        Jonathan Ellis added a comment - Oops, hit submit prematurely. Still reviewing the rest.
        Hide
        Jonathan Ellis added a comment -

        One more concern. I'm not a huge fan of END_OF_SEGMENT_MARKER:

        • Zero is a very common value. With a recycled log, it doesn't seem impossible to me that we'd encounter a zero in a partially synced record prematurely
        • More generally, writing a value and then seeking back feels like a bug waiting to happen (if we forget the seek-back step)
        • Finally and most importantly, the marker approach generalizes poorly to CASSANDRA-622 (should we ever bite that bullet)

        The postgresql approach addresses these without much complexity: add an incrementing transaction id to each record, and if we encounter a lower transaction id than the previous on replay, then that's the end of segment.

        Show
        Jonathan Ellis added a comment - One more concern. I'm not a huge fan of END_OF_SEGMENT_MARKER: Zero is a very common value. With a recycled log, it doesn't seem impossible to me that we'd encounter a zero in a partially synced record prematurely More generally, writing a value and then seeking back feels like a bug waiting to happen (if we forget the seek-back step) Finally and most importantly, the marker approach generalizes poorly to CASSANDRA-622 (should we ever bite that bullet) The postgresql approach addresses these without much complexity: add an incrementing transaction id to each record, and if we encounter a lower transaction id than the previous on replay, then that's the end of segment.
        Hide
        Rick Branson added a comment -

        Here's a fixed version of the v6 patch.

        Show
        Rick Branson added a comment - Here's a fixed version of the v6 patch.
        Hide
        Rick Branson added a comment -

        006 recycles commit log segments after recovery. This holds the segment count steady across restarts. I removed the initial segment allocation as it seemed unnecessary after this change.

        At this point, there's a deadlock condition that arises if the cap is held hard. If the cap is hit, in order to make some room in the commit log, a memtable flush must take place. The memtable flush thread submits a task for execution on the commit log thread, which is already blocked waiting for the memtable flush to finish.

        Show
        Rick Branson added a comment - 006 recycles commit log segments after recovery. This holds the segment count steady across restarts. I removed the initial segment allocation as it seemed unnecessary after this change. At this point, there's a deadlock condition that arises if the cap is held hard. If the cap is hit, in order to make some room in the commit log, a memtable flush must take place. The memtable flush thread submits a task for execution on the commit log thread, which is already blocked waiting for the memtable flush to finish.
        Hide
        Jonathan Ellis added a comment -

        corrected v5 includes updated CommitLogAllocator.java

        I think the sync should probably be pushed into the Periodic shutdown – Batch shouldn't need it.

        Show
        Jonathan Ellis added a comment - corrected v5 includes updated CommitLogAllocator.java I think the sync should probably be pushed into the Periodic shutdown – Batch shouldn't need it.
        Hide
        Rick Branson added a comment -

        Another thing is that it probably needs to take into consideration the segments pending recovery to hold hard to the size cap limit.

        Show
        Rick Branson added a comment - Another thing is that it probably needs to take into consideration the segments pending recovery to hold hard to the size cap limit.
        Hide
        Rick Branson added a comment -

        Recycling the replayed segments is a good idea. Very elegant.

        I wasn't quite sure how strict the cap limit was – 004 tries to hit the cap with a "best effort" and doesn't really treat it as a hard limit. If it's strict-in-bold it definitely needs to be reworked. CommitLog should detect a potential overflow before it even allows the write to take place and "stop the world" to perform a flush.

        submitCreateFreshSegment was an oversight for sure, a refactor relic.

        CLS.recycle() returning a new CLS is more of a preference than anything else – I tend to reach for more immutability when it doesn't hurt, and overall it seemed to be a cleaner approach than reseting the state.

        You have a good point about the order of shutdownBlocking, and actually, is the sync() even necessary here or is it cargo cult?

        Show
        Rick Branson added a comment - Recycling the replayed segments is a good idea. Very elegant. I wasn't quite sure how strict the cap limit was – 004 tries to hit the cap with a "best effort" and doesn't really treat it as a hard limit. If it's strict-in-bold it definitely needs to be reworked. CommitLog should detect a potential overflow before it even allows the write to take place and "stop the world" to perform a flush. submitCreateFreshSegment was an oversight for sure, a refactor relic. CLS.recycle() returning a new CLS is more of a preference than anything else – I tend to reach for more immutability when it doesn't hurt, and overall it seemed to be a cleaner approach than reseting the state. You have a good point about the order of shutdownBlocking, and actually, is the sync() even necessary here or is it cargo cult?
        Hide
        Jonathan Ellis added a comment -

        v5 attached. minor changes like log lines to debug. also changed the LogRecordAdder rotate check to >=.

        I think we still need to solve the segment sizing problem though. If you don't want to actually record the high water mark, how about recycling the segments we just replayed?

        Basically I don't want to inflict bad performance on people who have > 500MB of unflushed data while CL "extends" itself. Once is acceptable, but every server restart is not.

        I think the logic around the size cap needs some work too. We should be strict about keeping under the cap – the idea is we could be given a partition that size, so overflowing is Bad.

        If we get stricter there then I think we can simplify the "high water mark" tracking: you never discard segments (since you never violate the size constraints) so the high water mark must be exactly the current number of segments.

        submitCreateFreshSegment is unused. Oversight?

        Why does recycle bother creating a new Segment object? (Contrawise, I think I'd prefer creating a new SequentialWriter object, to using seek here.)

        Is the order in CL.shutdownBlocking correct? Seems to me it would be less dangerous to leave the allocator active until after we're done processing mutations.

        Show
        Jonathan Ellis added a comment - v5 attached. minor changes like log lines to debug. also changed the LogRecordAdder rotate check to >=. I think we still need to solve the segment sizing problem though. If you don't want to actually record the high water mark, how about recycling the segments we just replayed? Basically I don't want to inflict bad performance on people who have > 500MB of unflushed data while CL "extends" itself. Once is acceptable, but every server restart is not. I think the logic around the size cap needs some work too. We should be strict about keeping under the cap – the idea is we could be given a partition that size, so overflowing is Bad. If we get stricter there then I think we can simplify the "high water mark" tracking: you never discard segments (since you never violate the size constraints) so the high water mark must be exactly the current number of segments. submitCreateFreshSegment is unused. Oversight? Why does recycle bother creating a new Segment object? (Contrawise, I think I'd prefer creating a new SequentialWriter object, to using seek here.) Is the order in CL.shutdownBlocking correct? Seems to me it would be less dangerous to leave the allocator active until after we're done processing mutations.
        Hide
        Jonathan Ellis added a comment -

        Sounds reasonable.

        Show
        Jonathan Ellis added a comment - Sounds reasonable.
        Hide
        Rick Branson added a comment -

        So as soon as the number of spare segments drops to 0, we'll pre-allocate a new one. It ties in with the last point you made about using a separate thread, which is a no brainer. This sounds like a good strategy, and should smooth out some of the latency spikes.

        I like the idea of throwing away the extra config variable, but I'm not sure if I'd like to save the high watermark to the system table. An anomaly could lead this to getting permanently set too high and riding up against the total commitlog size limit. We'd have to provide some user-accessible way to adjust the high watermark at that point or add some "intelligence" that could bring the high watermark back down.

        I propose an initial 4 segments (or up to the total commitlog size limit, whichever is lower) at startup, allocating more as needed, but no longer discarding them. Combined with the eager background pre-allocation, the vast majority of use cases should quickly and smoothly reach a steady state.

        I'll also drag my way back to svn-land

        Show
        Rick Branson added a comment - So as soon as the number of spare segments drops to 0, we'll pre-allocate a new one. It ties in with the last point you made about using a separate thread, which is a no brainer. This sounds like a good strategy, and should smooth out some of the latency spikes. I like the idea of throwing away the extra config variable, but I'm not sure if I'd like to save the high watermark to the system table. An anomaly could lead this to getting permanently set too high and riding up against the total commitlog size limit. We'd have to provide some user-accessible way to adjust the high watermark at that point or add some "intelligence" that could bring the high watermark back down. I propose an initial 4 segments (or up to the total commitlog size limit, whichever is lower) at startup, allocating more as needed, but no longer discarding them. Combined with the eager background pre-allocation, the vast majority of use cases should quickly and smoothly reach a steady state. I'll also drag my way back to svn-land
        Hide
        Jonathan Ellis added a comment -

        attaching -cleaned, which is mostly undoing some renaming that obscured the real changes here.

        I think there's some work to do around the commitlog preallocation:

        • Seems like we should try to keep an extra segment ready, so that after we exhaust our preallocated quota we don't block for each new segment
        • segments to keep around should be our high water mark of active
        • if we record that high water mark in the system table, that feels like a better way of self-tuning to me than exposing the preallocation count as a tuneable
        • segment allocation should be done off of the CLExecutor thread. if we make "segments" just "inactive segments" instead of "all segments" then we can make it into a blocking queue for that purpose
        Show
        Jonathan Ellis added a comment - attaching -cleaned, which is mostly undoing some renaming that obscured the real changes here. I think there's some work to do around the commitlog preallocation: Seems like we should try to keep an extra segment ready, so that after we exhaust our preallocated quota we don't block for each new segment segments to keep around should be our high water mark of active if we record that high water mark in the system table, that feels like a better way of self-tuning to me than exposing the preallocation count as a tuneable segment allocation should be done off of the CLExecutor thread. if we make "segments" just "inactive segments" instead of "all segments" then we can make it into a blocking queue for that purpose
        Hide
        Rick Branson added a comment -

        Should be in there, see DatabaseDescriptor.validateCommitLogSegmentCount().

        Show
        Rick Branson added a comment - Should be in there, see DatabaseDescriptor.validateCommitLogSegmentCount().
        Hide
        Jonathan Ellis added a comment -

        can you add a check that segment size * preallocated segments <= total commitlog size parameter?

        Show
        Jonathan Ellis added a comment - can you add a check that segment size * preallocated segments <= total commitlog size parameter?
        Hide
        Rick Branson added a comment - - edited

        This patch pre-allocates the 128MB segment files, writes using memory-mapped I/O, and recycles the segment files once they're safe to be reused. End-to-end performance gains found across the board, increasing with the size of the row mutation.

        EDIT: should also mention that the number of segments initially allocated is controlled by commitlog_preallocate_segments in cassandra.yaml and defaults to 4. If more are needed, they will be created on-demand and deleted when no longer necessary.

        Show
        Rick Branson added a comment - - edited This patch pre-allocates the 128MB segment files, writes using memory-mapped I/O, and recycles the segment files once they're safe to be reused. End-to-end performance gains found across the board, increasing with the size of the row mutation. EDIT: should also mention that the number of segments initially allocated is controlled by commitlog_preallocate_segments in cassandra.yaml and defaults to 4. If more are needed, they will be created on-demand and deleted when no longer necessary.

          People

          • Assignee:
            Rick Branson
            Reporter:
            Rick Branson
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development