Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8729

Commitlog causes read before write when overwriting

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 2.2.0 beta 1
    • Component/s: None
    • Labels:

      Description

      The memory mapped commit log implementation writes directly to the page cache. If a page is not in the cache the kernel will read it in even though we are going to overwrite.

      The way to avoid this is to write to private memory, and then pad the write with 0s at the end so it is page (4k) aligned before writing to a file.

      The commit log would benefit from being refactored into something that looks more like a pipeline with incoming requests receiving private memory to write in, completed buffers being submitted to a parallelized compression/checksum step, followed by submission to another thread for writing to a file that preserves the order.

        Issue Links

          Activity

          Hide
          blambov Branimir Lambov added a comment -

          This particual problem is caused by segment recycling. Mixing segment reuse with memory-mapped writing appears to have been a very bad idea; removing reuse solves the problem immediately. If we insist on reuse, we need to get rid of memory-mapping and necessarily use 4k padding (padding will have a benefit without reuse as well, but not that pronounced).

          I'm not that sold on the benefits of recycling, though. If you delete a segment file and immediately create a new one with the same size, isn't the OS supposed to reuse the space anyway? I remember that's what they did ~15yrs ago, but things have probably changed.

          On the other hand I am seeing quite different performance writing to memmapped vs. writing to channel (using a very thin non-compressing version of the compression path of CASSANDRA-6809 with direct buffers). Memmapped appears to allow a ~20% higher throughput on Windows.

          I think we should get rid of the recycling, and later do the rest of the improvements you list.

          Show
          blambov Branimir Lambov added a comment - This particual problem is caused by segment recycling. Mixing segment reuse with memory-mapped writing appears to have been a very bad idea; removing reuse solves the problem immediately. If we insist on reuse, we need to get rid of memory-mapping and necessarily use 4k padding (padding will have a benefit without reuse as well, but not that pronounced). I'm not that sold on the benefits of recycling, though. If you delete a segment file and immediately create a new one with the same size, isn't the OS supposed to reuse the space anyway? I remember that's what they did ~15yrs ago, but things have probably changed. On the other hand I am seeing quite different performance writing to memmapped vs. writing to channel (using a very thin non-compressing version of the compression path of CASSANDRA-6809 with direct buffers). Memmapped appears to allow a ~20% higher throughput on Windows. I think we should get rid of the recycling, and later do the rest of the improvements you list.
          Hide
          aweisberg Ariel Weisberg added a comment -

          There are other reasons to use private memory that maybe aren't so important. For in-memory write workloads you get outliers if you have threads write to a memory mapped files. They did tend to show up in the very long tail P99.99, P99.999. With a dedicated thread draining to the filesystem you can control how much data is buffered when the filesystem is out to lunch.

          If you write a quick benchmark that just spits out zeroes to a file via write vs a memory mapped file do you see a difference in throughput or CPU utilization? I am skeptical that mmap is actually much faster (or even slower!).

          Show
          aweisberg Ariel Weisberg added a comment - There are other reasons to use private memory that maybe aren't so important. For in-memory write workloads you get outliers if you have threads write to a memory mapped files. They did tend to show up in the very long tail P99.99, P99.999. With a dedicated thread draining to the filesystem you can control how much data is buffered when the filesystem is out to lunch. If you write a quick benchmark that just spits out zeroes to a file via write vs a memory mapped file do you see a difference in throughput or CPU utilization? I am skeptical that mmap is actually much faster (or even slower!).
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          While testing for CASSANDRA-6890 (debating removing mmap path), it was pretty clear from my testing that mmap'ed I/O on Windows has a considerable advantage over buffered to a degree that linux does not. I'm of the opinion we should make efforts to memory-map our I/O on Windows wherever possible, with the known caveat that it makes deleting and renaming files more complicated (all segments have to be unmapped before either of those ops).

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - While testing for CASSANDRA-6890 (debating removing mmap path), it was pretty clear from my testing that mmap'ed I/O on Windows has a considerable advantage over buffered to a degree that linux does not. I'm of the opinion we should make efforts to memory-map our I/O on Windows wherever possible, with the known caveat that it makes deleting and renaming files more complicated (all segments have to be unmapped before either of those ops).
          Hide
          aweisberg Ariel Weisberg added a comment -

          Jira ate my response to this twice so far so I will be super brief.

          The linked ticket is a different use case (cacheable random reads?) and not bulk append.

          I put together a quick benchmark. Code http://pastebin.com/TFstk2uA
          Tested on Windows 8.1, Samsung 840 EVO 250 gigabyte

          Testing with sync at end
          Channel took 5575
          Preallocated Channel took 7445
          Mapped took 8517
          Preallocated Mapped took 7859
          Testing with periodic syncing
          Channel took 6795
          Preallocated Channel took 8728
          Mapped took 9991
          Preallocated Mapped took 10123
          

          There is no scenario where memory mapped IO is faster at bulk appending.

          Show
          aweisberg Ariel Weisberg added a comment - Jira ate my response to this twice so far so I will be super brief. The linked ticket is a different use case (cacheable random reads?) and not bulk append. I put together a quick benchmark. Code http://pastebin.com/TFstk2uA Tested on Windows 8.1, Samsung 840 EVO 250 gigabyte Testing with sync at end Channel took 5575 Preallocated Channel took 7445 Mapped took 8517 Preallocated Mapped took 7859 Testing with periodic syncing Channel took 6795 Preallocated Channel took 8728 Mapped took 9991 Preallocated Mapped took 10123 There is no scenario where memory mapped IO is faster at bulk appending.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          You're totally right - reads vs. writes are different beasts w/regards to this and my work was regarding reads.

          Also: I highly approve of your benchmark class name.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - You're totally right - reads vs. writes are different beasts w/regards to this and my work was regarding reads. Also: I highly approve of your benchmark class name.
          Hide
          tjake T Jake Luciani added a comment -

          I've been testing on low memory high cpu machines and have found this as a potential major issue.

          In a write heavy workload when mmapped data > ram I think there is a massive number of page faults when writing to the CL and causes writes to stall and spin out of control. Is there any workaround here? When I disable CL on the same system the issues go away which makes me think it's this issue.

          Show
          tjake T Jake Luciani added a comment - I've been testing on low memory high cpu machines and have found this as a potential major issue. In a write heavy workload when mmapped data > ram I think there is a massive number of page faults when writing to the CL and causes writes to stall and spin out of control. Is there any workaround here? When I disable CL on the same system the issues go away which makes me think it's this issue.
          Hide
          tjake T Jake Luciani added a comment -

          I found a workaround by setting the segment size > the total commit log space size. this drops the segment after each segment is full and it resolved this issue.

          Show
          tjake T Jake Luciani added a comment - I found a workaround by setting the segment size > the total commit log space size. this drops the segment after each segment is full and it resolved this issue.
          Hide
          aweisberg Ariel Weisberg added a comment -

          This was fixed as part of CASSANDRA-6809 by not recycling commit log segments so there is no read before write issue.

          Show
          aweisberg Ariel Weisberg added a comment - This was fixed as part of CASSANDRA-6809 by not recycling commit log segments so there is no read before write issue.
          Hide
          jjordan Jeremiah Jordan added a comment -

          Do we have a fix for 2.1 users?

          Show
          jjordan Jeremiah Jordan added a comment - Do we have a fix for 2.1 users?
          Hide
          aweisberg Ariel Weisberg added a comment -

          There is a work around I think. T Jake Luciani said he changed the size of the commit log to 0 and that caused it to not retain any segments.

          Show
          aweisberg Ariel Weisberg added a comment - There is a work around I think. T Jake Luciani said he changed the size of the commit log to 0 and that caused it to not retain any segments.
          Hide
          jjordan Jeremiah Jordan added a comment -

          His fix is actually to make the segment size = commit log size, aka 2 GB, which means you get one giant segment. That seems a little extreme, and makes commit log archiving much harder. If this really causes such a big performance degradation, can we just turn off segment recycle in 2.1? Seems like that isn't too invasive of a change, since we don't always recycle anyway?

          Show
          jjordan Jeremiah Jordan added a comment - His fix is actually to make the segment size = commit log size, aka 2 GB, which means you get one giant segment. That seems a little extreme, and makes commit log archiving much harder. If this really causes such a big performance degradation, can we just turn off segment recycle in 2.1? Seems like that isn't too invasive of a change, since we don't always recycle anyway?

            People

            • Assignee:
              Unassigned
              Reporter:
              aweisberg Ariel Weisberg
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development