Cassandra
  1. Cassandra
  2. CASSANDRA-6737

A batch statements on a single partition should not create a new CF object for each update

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.6
    • Component/s: None
    • Labels:

      Description

      BatchStatement creates a new ColumnFamily object (as well as a new RowMutation object) for every update in the batch, even if all those update are actually on the same partition. This is particularly inefficient when bulkloading data into a single partition (which is not all that uncommon).

      1. 6737.txt
        10 kB
        Sylvain Lebresne
      2. 6737.2.patch
        10 kB
        Benedict

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        12m 37s 1 Sylvain Lebresne 19/Feb/14 18:36
        Patch Available Patch Available Resolved Resolved
        1d 15h 18m 1 Sylvain Lebresne 21/Feb/14 09:54
        Sylvain Lebresne made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Sylvain Lebresne added a comment -

        Committed, thanks.

        I did realized while rebasing post-CASSANDRA-6561 that there was no point in rebuilding the clustering prefix multiple time when a statement was on multiple keys so inlined the new addUpdatesForKey inside BatchStatement to avoid (which ends up being almost cleaner imo). There was also a bit of now unused code in ModificationStatement which I removed. Anyway, that was just moving code around so I just went ahead with the commit.

        Show
        Sylvain Lebresne added a comment - Committed, thanks. I did realized while rebasing post- CASSANDRA-6561 that there was no point in rebuilding the clustering prefix multiple time when a statement was on multiple keys so inlined the new addUpdatesForKey inside BatchStatement to avoid (which ends up being almost cleaner imo). There was also a bit of now unused code in ModificationStatement which I removed. Anyway, that was just moving code around so I just went ahead with the commit.
        Jonathan Ellis made changes -
        Labels performance
        Benedict made changes -
        Attachment 6737.2.patch [ 12630047 ]
        Hide
        Benedict added a comment -

        Mostly LGTM. One actual problem, and a couple of suggestions:

        • Looks like in MS.addUpdatesForKey() "rows" should be being passed as the last parameter to the constructor of "params".
        • It would be nice if addUpdateForKey and addUpdatesForKey were next to each other in MS
        • Maybe fast path in unzipMutations should use values().iterator instead of lookup based on keyspace hash. Pretty ambivalent about that though.
        • Nit: unused import in BS

        Attached new patch with my suggestions.

        Show
        Benedict added a comment - Mostly LGTM. One actual problem, and a couple of suggestions: Looks like in MS.addUpdatesForKey() "rows" should be being passed as the last parameter to the constructor of "params". It would be nice if addUpdateForKey and addUpdatesForKey were next to each other in MS Maybe fast path in unzipMutations should use values().iterator instead of lookup based on keyspace hash. Pretty ambivalent about that though. Nit: unused import in BS Attached new patch with my suggestions.
        Hide
        Sylvain Lebresne added a comment -

        FTR I'm at the point where other things being equal I'd prefer to put optimizations in 2.1.

        FTR, I agree in general. But in this case, we have a rather simple to fix bottleneck that makes a non-really crazy use case be 20 times slower than what you would go with thrift. At this point, it's not really an optimization imo, it's a bug that needs to be fixed. But I do not mean this ticket to be "let's do every optimizations for BatchStatement on single partition that comes into mind", that would definitively belong to 2.1.

        Show
        Sylvain Lebresne added a comment - FTR I'm at the point where other things being equal I'd prefer to put optimizations in 2.1. FTR, I agree in general. But in this case, we have a rather simple to fix bottleneck that makes a non-really crazy use case be 20 times slower than what you would go with thrift. At this point, it's not really an optimization imo, it's a bug that needs to be fixed. But I do not mean this ticket to be "let's do every optimizations for BatchStatement on single partition that comes into mind", that would definitively belong to 2.1.
        Hide
        Aleksey Yeschenko added a comment -

        (I'd prefer this one to go into 2.0)

        Show
        Aleksey Yeschenko added a comment - (I'd prefer this one to go into 2.0)
        Hide
        Jonathan Ellis added a comment -

        FTR I'm at the point where other things being equal I'd prefer to put optimizations in 2.1.

        Show
        Jonathan Ellis added a comment - FTR I'm at the point where other things being equal I'd prefer to put optimizations in 2.1.
        Jonathan Ellis made changes -
        Reviewer Benedict [ benedict ]
        Hide
        Jonathan Ellis added a comment -

        Benedict to review

        Show
        Jonathan Ellis added a comment - Benedict to review
        Sylvain Lebresne made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Sylvain Lebresne made changes -
        Attachment 6737.txt [ 12629839 ]
        Hide
        Sylvain Lebresne added a comment -

        Attaching patch to make batch statements only create one CF and RowMutation object per partition. On a relatively simple benchmark inserting a 10k rows batch into a single partition (using the DataStax java driver, code here: https://gist.github.com/pcmanus/9098347, this isn't meant to be fancy) I get up to more than 20x improvement with this patch (on batch insertion drop from >1.2 seconds to ~50-100ms).

        Note that there is more optimization that we can be done for single partition batches through some special casing, but this is a very simple start.

        Show
        Sylvain Lebresne added a comment - Attaching patch to make batch statements only create one CF and RowMutation object per partition. On a relatively simple benchmark inserting a 10k rows batch into a single partition (using the DataStax java driver, code here: https://gist.github.com/pcmanus/9098347 , this isn't meant to be fancy) I get up to more than 20x improvement with this patch (on batch insertion drop from >1.2 seconds to ~50-100ms). Note that there is more optimization that we can be done for single partition batches through some special casing, but this is a very simple start.
        Jonathan Ellis made changes -
        Field Original Value New Value
        Summary A batch statements on a single partition should create a new CF object for each update A batch statements on a single partition should not create a new CF object for each update
        Sylvain Lebresne created issue -

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Sylvain Lebresne
            Reviewer:
            Benedict
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development