Cassandra
  1. Cassandra
  2. CASSANDRA-4784

Create separate sstables for each token range handled by a node

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      Currently, each sstable has data for all the ranges that node is handling. If we change that and rather have separate sstables for each range that node is handling, it can lead to some improvements.
      Improvements
      1) Node rebuild will be very fast as sstables can be directly copied over to the bootstrapping node. It will minimize any application level logic. We can directly use Linux native methods to transfer sstables without using CPU and putting less pressure on the serving node. I think in theory it will be the fastest way to transfer data.
      2) Backup can only transfer sstables for a node which belong to its primary keyrange.
      3) ETL process can only copy one replica of data and will be much faster.
      Changes:
      We can split the writes into multiple memtables for each range it is handling. The sstables being flushed from these can have details of which range of data it is handling.
      There will be no change I think for any reads as they work with interleaved data anyway. But may be we can improve there as well?

      Complexities:
      The change does not look very complicated. I am not taking into account how it will work when ranges are being changed for nodes.
      Vnodes might make this work more complicated. We can also have a bit on each sstable which says whether it is primary data or not.

      1. 4784.patch
        31 kB
        Benjamin Coverston

        Issue Links

          Activity

          sankalp kohli created issue -
          Hide
          sankalp kohli added a comment -

          We also need to change the compaction so that we don't merge sstables of two different token ranges.

          Show
          sankalp kohli added a comment - We also need to change the compaction so that we don't merge sstables of two different token ranges.
          Hide
          sankalp kohli added a comment -

          As I have said in the description, this will also enable backing up of only one copy of data. This will greatly reduce the space requirements for backups.

          Show
          sankalp kohli added a comment - As I have said in the description, this will also enable backing up of only one copy of data. This will greatly reduce the space requirements for backups.
          Hide
          Jonathan Ellis added a comment -

          It may be worth trying. The backup/restore duplication of data is a pain point right now (CASSANDRA-4756). Not sure if we can actually synchronize sstable/index data enough that we can avoid rebuilding that on a stream; if not the difference is negligible in that respect (CASSANDRA-4297).

          Show
          Jonathan Ellis added a comment - It may be worth trying. The backup/restore duplication of data is a pain point right now ( CASSANDRA-4756 ). Not sure if we can actually synchronize sstable/index data enough that we can avoid rebuilding that on a stream; if not the difference is negligible in that respect ( CASSANDRA-4297 ).
          Hide
          sankalp kohli added a comment -

          The difference will be marginal and can be only the data in memtable that has not been flushed. We can copy all the sstables which are present in the replica and keep a View of sstables copied. If there is any addition/deletion of sstables in the mean while, we can so another sync.
          So the diff will only be the content in memtable. So we can run a repair like we do today after a bootstrap.
          The main advantage will be speed of recovery for a node specially with lots of data. Currently it is bound by application. Also the node serving the data will not have to do any work in the application.
          Another small benefit is that you will not create objects in JVMs while transferring data.

          Show
          sankalp kohli added a comment - The difference will be marginal and can be only the data in memtable that has not been flushed. We can copy all the sstables which are present in the replica and keep a View of sstables copied. If there is any addition/deletion of sstables in the mean while, we can so another sync. So the diff will only be the content in memtable. So we can run a repair like we do today after a bootstrap. The main advantage will be speed of recovery for a node specially with lots of data. Currently it is bound by application. Also the node serving the data will not have to do any work in the application. Another small benefit is that you will not create objects in JVMs while transferring data.
          Hide
          sankalp kohli added a comment -

          So here is how a bootstrap will work on the node serving the data.

          { 1. Get a View of sstables. Start copying all of them to the bootstrapped node. 2. compare the current view and View from step 1 and calculte the number of sstables to copy and remove(because of flush(addition of stables) and compaction(removal and addition of sstables). ) }

          do While(Number of sstables to transfer or remove are > N(we can decide that))

          Run a repair like we normally do after a bootstrap.

          Show
          sankalp kohli added a comment - So here is how a bootstrap will work on the node serving the data. { 1. Get a View of sstables. Start copying all of them to the bootstrapped node. 2. compare the current view and View from step 1 and calculte the number of sstables to copy and remove(because of flush(addition of stables) and compaction(removal and addition of sstables). ) } do While(Number of sstables to transfer or remove are > N(we can decide that)) Run a repair like we normally do after a bootstrap.
          Hide
          Brandon Williams added a comment -

          It's worth noting that vnodes in 1.2 will already solve the bootstrap performance problem.

          Run a repair like we normally do after a bootstrap.

          We don't do that, we begin forwarding the writes to the new node as a first step to obviate the need for repair.

          Show
          Brandon Williams added a comment - It's worth noting that vnodes in 1.2 will already solve the bootstrap performance problem. Run a repair like we normally do after a bootstrap. We don't do that, we begin forwarding the writes to the new node as a first step to obviate the need for repair.
          Hide
          sankalp kohli added a comment - - edited

          vnodes will improve the performance, but still we need to go through application layer to filter out data from each sstable that needs to be transferred. This will affect the CPU and page cache and create short lived java objects. I have another JIRA which states how a new connection is created for each sstable transferred.

          My point is that this change will make the bootstrap of a node fastest in theory. This is the reason many people restore the data from backup and then run a repair instead of bootstrapping a node and streaming the data.

          Show
          sankalp kohli added a comment - - edited vnodes will improve the performance, but still we need to go through application layer to filter out data from each sstable that needs to be transferred. This will affect the CPU and page cache and create short lived java objects. I have another JIRA which states how a new connection is created for each sstable transferred. My point is that this change will make the bootstrap of a node fastest in theory. This is the reason many people restore the data from backup and then run a repair instead of bootstrapping a node and streaming the data.
          Hide
          Jonathan Ellis added a comment -

          But if vnodes (which is already complete and ready to use in 1.2) makes bootstrap fast enough to saturate 10 gigE, I don't see much utility in adding further complexity for 1.3.

          I do still see this as useful for backup/restore in the general case though.

          Show
          Jonathan Ellis added a comment - But if vnodes (which is already complete and ready to use in 1.2) makes bootstrap fast enough to saturate 10 gigE, I don't see much utility in adding further complexity for 1.3. I do still see this as useful for backup/restore in the general case though.
          Hide
          Sylvain Lebresne added a comment -

          I do still see this as useful for backup/restore in the general case though

          While I kind of agree, I want to note that backuping data from only one replica runs the risk of missing whatever data on which the node is not consistent, unless you've run a repair just before. I suppose some may prefer just running the repair prior to the backing or, especially if they are running regular repairs, consider that good enough, but just wanted to note that at least in theory that's not completely bulletproof.

          Show
          Sylvain Lebresne added a comment - I do still see this as useful for backup/restore in the general case though While I kind of agree, I want to note that backuping data from only one replica runs the risk of missing whatever data on which the node is not consistent, unless you've run a repair just before. I suppose some may prefer just running the repair prior to the backing or, especially if they are running regular repairs, consider that good enough, but just wanted to note that at least in theory that's not completely bulletproof.
          Hide
          sankalp kohli added a comment -

          Yes it is required. But running repair is easier and cheaper than stores N copies of same data.

          Show
          sankalp kohli added a comment - Yes it is required. But running repair is easier and cheaper than stores N copies of same data.
          Hide
          Jonathan Ellis added a comment -

          Ben Coverston points out that since we already have an intervaltree for LCS we wouldn't need any extra region-tracking code layer for the read path.

          Show
          Jonathan Ellis added a comment - Ben Coverston points out that since we already have an intervaltree for LCS we wouldn't need any extra region-tracking code layer for the read path.
          Benjamin Coverston made changes -
          Field Original Value New Value
          Assignee Benjamin Coverston [ bcoverston ]
          Hide
          Benjamin Coverston added a comment -

          I have a working implementation of this for STCS, one issue is it has the unfortunate (or fortunate) side effect of also partitioning up the SSTables for LCS as I put the implementation inside the CompactionTask making the currently (small) SSTables much smaller.

          I feel like this puts us at a crossroads: Should we create a completely partitioned data strategy for vnodes (a directory per vnode), or should we continue to mix the data files in a single data directory?

          L0 to L1 compactions become particularly hairy if we do that unless we first partition the L0 SSTables then subsequently compact the partitioned L0 with L1 for the vnode.

          Show
          Benjamin Coverston added a comment - I have a working implementation of this for STCS, one issue is it has the unfortunate (or fortunate) side effect of also partitioning up the SSTables for LCS as I put the implementation inside the CompactionTask making the currently (small) SSTables much smaller. I feel like this puts us at a crossroads: Should we create a completely partitioned data strategy for vnodes (a directory per vnode), or should we continue to mix the data files in a single data directory? L0 to L1 compactions become particularly hairy if we do that unless we first partition the L0 SSTables then subsequently compact the partitioned L0 with L1 for the vnode.
          Hide
          Benjamin Coverston added a comment -

          Adding patch to partition SSTables into discrete ranges per vnode.

          Show
          Benjamin Coverston added a comment - Adding patch to partition SSTables into discrete ranges per vnode.
          Benjamin Coverston made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 1.2.0 beta 1 [ 12319262 ]
          Fix Version/s 1.3 [ 12322954 ]
          Benjamin Coverston made changes -
          Attachment 4784.patch [ 12552212 ]
          Hide
          Jonathan Ellis added a comment -

          Sankalp, can you make a first pass at review?

          Show
          Jonathan Ellis added a comment - Sankalp, can you make a first pass at review?
          Hide
          Jonathan Ellis added a comment -

          Ben, it might be worth publishing a github tree so we don't need to keep rebasing during review.

          Show
          Jonathan Ellis added a comment - Ben, it might be worth publishing a github tree so we don't need to keep rebasing during review.
          Show
          Benjamin Coverston added a comment - Like this? https://github.com/bcoverston/apache-hosted-cassandra/branches/4784
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12729003 ] patch-available, re-open possible [ 12753716 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12753716 ] reopen-resolved, no closed status, patch-avail, testing [ 12758895 ]
          Hide
          Jouni Hartikainen added a comment -

          I'm not really sure if I understood this correctly, but wouldn't this change lead to memtable flushes creating much more random I/O than previously? Especially when using vnodes wouldn't the incoming data be spread to num_tokens files per CF instead of one per CF? Wouldn't this affect compactions as well? E.g. for default size tiered strategy, instead of compacting 4 larger SSTables into one even larger per CF, we would be compacting num_tokens * 4 smaller files into num_tokens larger ones per CF.

          Am I missing something here?

          Show
          Jouni Hartikainen added a comment - I'm not really sure if I understood this correctly, but wouldn't this change lead to memtable flushes creating much more random I/O than previously? Especially when using vnodes wouldn't the incoming data be spread to num_tokens files per CF instead of one per CF? Wouldn't this affect compactions as well? E.g. for default size tiered strategy, instead of compacting 4 larger SSTables into one even larger per CF, we would be compacting num_tokens * 4 smaller files into num_tokens larger ones per CF. Am I missing something here?
          Hide
          Jonathan Ellis added a comment -

          The data itself could still be sequential on disk even if the blocks belong to different files... Not sure how much overhead the file creation itself would have. Certainly worth testing.

          Show
          Jonathan Ellis added a comment - The data itself could still be sequential on disk even if the blocks belong to different files... Not sure how much overhead the file creation itself would have. Certainly worth testing.
          Hide
          Jonathan Ellis added a comment -

          Someone (Axel at Spotify?) pointed out to me that another use case here would be mapping different vnodes to separate disks so that on disk failure, we can invalidate the affected vnodes and repair them, rather than continuing to serve incomplete data or halting the entire node.

          Show
          Jonathan Ellis added a comment - Someone (Axel at Spotify?) pointed out to me that another use case here would be mapping different vnodes to separate disks so that on disk failure, we can invalidate the affected vnodes and repair them, rather than continuing to serve incomplete data or halting the entire node.
          Jonathan Ellis made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Jonathan Ellis added a comment -

          another use case here would be mapping different vnodes to separate disks so that on disk failure, we can invalidate the affected vnodes and repair them

          (This would require a lot more work to support "migrating" vnodes when new disks are added.)

          Show
          Jonathan Ellis added a comment - another use case here would be mapping different vnodes to separate disks so that on disk failure, we can invalidate the affected vnodes and repair them (This would require a lot more work to support "migrating" vnodes when new disks are added.)
          Jonathan Ellis made changes -
          Assignee Benjamin Coverston [ bcoverston ]
          Fix Version/s 2.1 [ 12324159 ]
          Fix Version/s 2.0 [ 12322954 ]
          Jonathan Ellis made changes -
          Labels perfomance
          sankalp kohli made changes -
          Link This issue is related to CASSANDRA-6696 [ CASSANDRA-6696 ]
          Sylvain Lebresne made changes -
          Fix Version/s 2.1 beta2 [ 12326276 ]
          Fix Version/s 2.1 [ 12324159 ]
          Hide
          T Jake Luciani added a comment -

          Another issue would be splitting the sstables when new vnode ranges are added to the cluster. Replicas can change on any node causing the old ranges to split.
          Also if anyone decomissions a node.

          It may be simpler to start with taking a contiguous range of vnodes and putting them in a directory partition. You could have a fixed number of these (one per disk?). It would limit the damage of a dead drive in JBOD mode to a section of vnode ranges.

          Show
          T Jake Luciani added a comment - Another issue would be splitting the sstables when new vnode ranges are added to the cluster. Replicas can change on any node causing the old ranges to split. Also if anyone decomissions a node. It may be simpler to start with taking a contiguous range of vnodes and putting them in a directory partition. You could have a fixed number of these (one per disk?). It would limit the damage of a dead drive in JBOD mode to a section of vnode ranges.
          Sylvain Lebresne made changes -
          Fix Version/s 2.1 rc1 [ 12326658 ]
          Fix Version/s 2.1 beta2 [ 12326276 ]
          Hide
          Jonathan Ellis added a comment -

          superseded by CASSANDRA-6696

          Show
          Jonathan Ellis added a comment - superseded by CASSANDRA-6696
          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Jonathan Ellis made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Jonathan Ellis made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 2.1 rc1 [ 12326658 ]
          Resolution Duplicate [ 3 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          27d 5h 21m 1 Benjamin Coverston 06/Nov/12 04:30
          Patch Available Patch Available Open Open
          149d 12h 13m 1 Jonathan Ellis 04/Apr/13 17:44
          Open Open Resolved Resolved
          402d 22h 57m 1 Jonathan Ellis 12/May/14 16:41
          Resolved Resolved Reopened Reopened
          43d 6h 44m 1 Jonathan Ellis 24/Jun/14 23:25
          Reopened Reopened Resolved Resolved
          11s 1 Jonathan Ellis 24/Jun/14 23:26

            People

            • Assignee:
              Unassigned
              Reporter:
              sankalp kohli
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development