Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Labels:
      None

      Description

      This could be done using an the atomic ConcurrentMap operations from the Memtable and something like http://code.google.com/p/pcollections/ to replace the ConcurrentSkipListMap in ThreadSafeSortedColumns. The trick is that pcollections does not provide a SortedMap, so we probably need to write our own.

      Googling [persistent sortedmap] I found http://code.google.com/p/actord/source/browse/trunk/actord/src/main/scala/ff/collection (in scala) and http://clojure.org/data_structures#Data Structures-Maps.

      1. snaptree-0.1-SNAPSHOT.jar
        111 kB
        Sylvain Lebresne
      2. row_isolation_test.py
        3 kB
        Cathy Daw
      3. latency-plain.svg
        79 kB
        Brandon Williams
      4. latency.svg
        78 kB
        Brandon Williams
      5. 2893-pauses.txt
        1 kB
        Jonathan Ellis
      6. 0003-Add-AtomicSortedColumn-and-snapTree-v2.patch
        29 kB
        Sylvain Lebresne
      7. 0003-Add-AtomicSortedColumn-and-snapTree.patch
        28 kB
        Sylvain Lebresne
      8. 0002-Make-memtable-use-CF.addAll-v2.patch
        11 kB
        Sylvain Lebresne
      9. 0002-Make-memtable-use-CF.addAll.patch
        11 kB
        Sylvain Lebresne
      10. 0001-Move-deletion-infos-into-ISortedColumns-v2.patch
        31 kB
        Sylvain Lebresne
      11. 0001-Move-deletion-infos-into-ISortedColumns.patch
        30 kB
        Sylvain Lebresne

        Activity

        Jonathan Ellis created issue -
        Jonathan Ellis made changes -
        Field Original Value New Value
        Description This could be done using something an AtomicReference and something like http://code.google.com/p/pcollections/ but with a SortedMap (for the columnfamily.columns collection). So we probably need to write our own. This could be done using an the atomic ConcurrentMap operations from the Memtable and something like http://code.google.com/p/pcollections/ to replace the ConcurrentSkipListMap in ThreadSafeSortedColumns. The trick is that pcollections does not provide a SortedMap, so we probably need to write our own.

        Googling [persistent sortedmap] I found http://code.google.com/p/actord/source/browse/trunk/actord/src/main/scala/ff/collection (in scala) and http://clojure.org/data_structures#Data Structures-Maps.
        Jonathan Ellis made changes -
        Assignee Sylvain Lebresne [ slebresne ]
        Hide
        Jonathan Ellis added a comment - - edited
        Show
        Jonathan Ellis added a comment - - edited https://github.com/nbronson/snaptree boasts O(1) clone. (See also CASSANDRA-3389 .)
        Hide
        Jonathan Ellis added a comment -

        ... which means we would have the option to clone on read, instead of the persistent collections which essentially clone on write.

        Show
        Jonathan Ellis added a comment - ... which means we would have the option to clone on read, instead of the persistent collections which essentially clone on write.
        Hide
        Brandon Williams added a comment -

        The bad news from CASSANDRA-3389 is that I didn't see any performance difference at all with snaptree, but that's good news too since it's quite possibly viable here.

        Show
        Brandon Williams added a comment - The bad news from CASSANDRA-3389 is that I didn't see any performance difference at all with snaptree, but that's good news too since it's quite possibly viable here.
        Hide
        Sylvain Lebresne added a comment -

        Attaching initial patches.

        The basic idea is to make it so that applying a mutation to a memtable is atomic, or in other words, make it use CF.addAll() and have that last operation be atomic and isolated (adding to the row cache also needs to be atomic and isolated but it uses CF.addAll already so making CF.addAll atomic is the solution for that too).

        To do that, addAll copies the initial cf, add the new columns and atomically compare and swap with the old one cf. To make this efficient, the patch uses the snapTree O(1) cloning (copy-on-write) facilities.

        I'm attaching the snapTree jar, but note that it's modified from the original (https://github.com/nbronson/snaptree) because it has bug. The modified version with the small fix is at https://github.com/pcmanus/snaptree (I've issued a pull request). Btw, I don't know if the license of snapTree is compatible with the ASF one. Note that we only use the copy-on-write clone facility of snapTree, and not really the fact that it is thread-safe outside of that. So in particular a persistent sorted map could be used in place of snapTree if we wanted to, though the copy-on-write used by the latter is likely to generate less garbage overall.

        I'm attaching 3 patches:

        • The first patch pushes the CF deletion infos from the AbstractColumnContainer to the ISortedColums implementation. Reason being that we will want that both updates and deletes are atomic and isolated so we'll need to have those in the same structure.
        • The second patch modifies Memtable.apply() to use CF.addAll directly.
        • The third patch introduces AtomicSortedColumns using snapTree and uses it whenever thread-safety/isolation is needed. Note that it fully replace ThreadSafeSortedColumns that is removed, and also that the patch tries to limit the use of AtomicSortedColumns to concurrent context, making TreeMapBackedSortedColumns the default for other non-concurrent context.

        There is two gray areas with this patch that I know of:

        • It would be easy to break isolation for super columns. If cf is an AtomicSortedColumns backed (super) column family and you do a sc = cf.getColumn(someSCname) and then do sc.addAll(cols), then that last operation won't be in isolation. I don't think we do that in any context where it matters, but still something to be aware of.
        • Iterator based removal is not thread-safe. Basically, if you do an iteration, doing removes using the iterator remove() method and there is a concurrent mutation on the cf, the remove may well just be ignored. I think the main place where we do iterator based removes is during CFS.removeDeleted(). But it's mostly done during queries/compaction so not in a concurrent context. We do a removeDeleted on cachedRow sometimes during compaction but in that case it won't be the end of the world if that remove is ignored because of a concurrent mutation. Still, not very beautiful but I don't see a simple solution (outside of not using iterator based removes that is).

        Overall, I think the patch is ready for benchmarking (all unit tests are passing). I did a very quick stress test on my localhost and I didn't see any noticeable difference with or without the patch (neither writes nor reads). But 1) that was not a very scientific benchmark and 2) it was a short benchmark. I don't think raw performance will a problem with this patch, the problem is that it generates more garbage, which itself may degrade performance on the long run. That's probably what we'll want to benchmark.

        Show
        Sylvain Lebresne added a comment - Attaching initial patches. The basic idea is to make it so that applying a mutation to a memtable is atomic, or in other words, make it use CF.addAll() and have that last operation be atomic and isolated (adding to the row cache also needs to be atomic and isolated but it uses CF.addAll already so making CF.addAll atomic is the solution for that too). To do that, addAll copies the initial cf, add the new columns and atomically compare and swap with the old one cf. To make this efficient, the patch uses the snapTree O(1) cloning (copy-on-write) facilities. I'm attaching the snapTree jar, but note that it's modified from the original ( https://github.com/nbronson/snaptree ) because it has bug. The modified version with the small fix is at https://github.com/pcmanus/snaptree (I've issued a pull request). Btw, I don't know if the license of snapTree is compatible with the ASF one. Note that we only use the copy-on-write clone facility of snapTree, and not really the fact that it is thread-safe outside of that. So in particular a persistent sorted map could be used in place of snapTree if we wanted to, though the copy-on-write used by the latter is likely to generate less garbage overall. I'm attaching 3 patches: The first patch pushes the CF deletion infos from the AbstractColumnContainer to the ISortedColums implementation. Reason being that we will want that both updates and deletes are atomic and isolated so we'll need to have those in the same structure. The second patch modifies Memtable.apply() to use CF.addAll directly. The third patch introduces AtomicSortedColumns using snapTree and uses it whenever thread-safety/isolation is needed. Note that it fully replace ThreadSafeSortedColumns that is removed, and also that the patch tries to limit the use of AtomicSortedColumns to concurrent context, making TreeMapBackedSortedColumns the default for other non-concurrent context. There is two gray areas with this patch that I know of: It would be easy to break isolation for super columns. If cf is an AtomicSortedColumns backed (super) column family and you do a sc = cf.getColumn(someSCname) and then do sc.addAll(cols) , then that last operation won't be in isolation. I don't think we do that in any context where it matters, but still something to be aware of. Iterator based removal is not thread-safe. Basically, if you do an iteration, doing removes using the iterator remove() method and there is a concurrent mutation on the cf, the remove may well just be ignored. I think the main place where we do iterator based removes is during CFS.removeDeleted(). But it's mostly done during queries/compaction so not in a concurrent context. We do a removeDeleted on cachedRow sometimes during compaction but in that case it won't be the end of the world if that remove is ignored because of a concurrent mutation. Still, not very beautiful but I don't see a simple solution (outside of not using iterator based removes that is). Overall, I think the patch is ready for benchmarking (all unit tests are passing). I did a very quick stress test on my localhost and I didn't see any noticeable difference with or without the patch (neither writes nor reads). But 1) that was not a very scientific benchmark and 2) it was a short benchmark. I don't think raw performance will a problem with this patch, the problem is that it generates more garbage, which itself may degrade performance on the long run. That's probably what we'll want to benchmark.
        Sylvain Lebresne made changes -
        Sylvain Lebresne made changes -
        Attachment snaptree-0.1-SNAPSHOT.jar [ 12503858 ]
        Sylvain Lebresne made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 1.1 [ 12317615 ]
        Hide
        Jonathan Ellis added a comment -

        SnapTree License looks like standard bsd-with-attribution, which is ASL-compatible.

        Show
        Jonathan Ellis added a comment - SnapTree License looks like standard bsd-with-attribution, which is ASL-compatible.
        Hide
        Jonathan Ellis added a comment -

        I didn't see any noticeable difference

        Do we even have an insert test mode that can generate same-row contention?

        it generates more garbage

        but young-gen garbage, which is close to free.

        Show
        Jonathan Ellis added a comment - I didn't see any noticeable difference Do we even have an insert test mode that can generate same-row contention? it generates more garbage but young-gen garbage, which is close to free.
        Hide
        Sylvain Lebresne added a comment -

        Do we even have an insert test mode that can generate same-row contention?

        I don't think so, but that would be useful, even outside of testing this.

        but young-gen garbage, which is close to free

        I suppose it depends but I don't think it will be necessarily young-gen, on the contrary even. The added garbage is that when we add new columns to a row already in the memtable, we'll copy a number of nodes of the underlying SnapTree. But those old nodes are live since their initial insertion and the new ones will be until the next insertion into this row (or longer, we don't do a full copy). So this roughly depends of the average between two updates to the same row in a memtable. Not totally sure it fits into a young-gen. That being I'm not particularly worried.

        Show
        Sylvain Lebresne added a comment - Do we even have an insert test mode that can generate same-row contention? I don't think so, but that would be useful, even outside of testing this. but young-gen garbage, which is close to free I suppose it depends but I don't think it will be necessarily young-gen, on the contrary even. The added garbage is that when we add new columns to a row already in the memtable, we'll copy a number of nodes of the underlying SnapTree. But those old nodes are live since their initial insertion and the new ones will be until the next insertion into this row (or longer, we don't do a full copy). So this roughly depends of the average between two updates to the same row in a memtable. Not totally sure it fits into a young-gen. That being I'm not particularly worried.
        Hide
        Jonathan Ellis added a comment -

        Do we even have an insert test mode that can generate same-row contention?

        CASSANDRA-3498

        Show
        Jonathan Ellis added a comment - Do we even have an insert test mode that can generate same-row contention? CASSANDRA-3498
        Hide
        Jonathan Ellis added a comment -

        Turns out we already have the -F option which lets us test overwrites. A more realistic row contention test would be for appends: CASSANDRA-3541

        Show
        Jonathan Ellis added a comment - Turns out we already have the -F option which lets us test overwrites. A more realistic row contention test would be for appends: CASSANDRA-3541
        Hide
        Brandon Williams added a comment -

        Sylvain, can you rebase for trunk? Thanks.

        Show
        Brandon Williams added a comment - Sylvain, can you rebase for trunk? Thanks.
        Hide
        Sylvain Lebresne added a comment -

        Rebased

        Show
        Sylvain Lebresne added a comment - Rebased
        Sylvain Lebresne made changes -
        Hide
        Brandon Williams added a comment - - edited

        I tested both trunk and trunk w/2893 applied. I did 100M+ appends to a single row in each case with 300 threads/conns at a time. Attached is a chart of the write latency histogram for both.

        The median latency was the same, though overall 2893 was higher. Overall this doesn't seem too bad, especially considering 300 clients contending for a single row isn't too common a scenario.

        Throughput was roughly the same for both (2893 was actually a tiny bit faster, but well within standard deviation.)

        Show
        Brandon Williams added a comment - - edited I tested both trunk and trunk w/2893 applied. I did 100M+ appends to a single row in each case with 300 threads/conns at a time. Attached is a chart of the write latency histogram for both. The median latency was the same, though overall 2893 was higher. Overall this doesn't seem too bad, especially considering 300 clients contending for a single row isn't too common a scenario. Throughput was roughly the same for both (2893 was actually a tiny bit faster, but well within standard deviation.)
        Brandon Williams made changes -
        Attachment latency.svg [ 12508280 ]
        Hide
        Jonathan Ellis added a comment -

        Just to make sure, can you verify that there's no latency penalty when NOT contending for same row?

        Show
        Jonathan Ellis added a comment - Just to make sure, can you verify that there's no latency penalty when NOT contending for same row?
        Hide
        Brandon Williams added a comment -

        Here is a chart of write latencies against both in 'plain old stress' mode.

        Show
        Brandon Williams added a comment - Here is a chart of write latencies against both in 'plain old stress' mode.
        Brandon Williams made changes -
        Attachment latency-plain.svg [ 12508301 ]
        Hide
        Jonathan Ellis added a comment -

        So, we're talking about a 10% higher latency in the non-contended case, and about 100% higher in the contended case? How can throughput be about the same given the same client concurrency in each caes? That doesn't make sense to me.

        Show
        Jonathan Ellis added a comment - So, we're talking about a 10% higher latency in the non-contended case, and about 100% higher in the contended case? How can throughput be about the same given the same client concurrency in each caes? That doesn't make sense to me.
        Hide
        Brandon Williams added a comment - - edited

        Maybe a chart isn't the best way to look at this. First, let me note that I mislabeled the X axis, this is _micro_seconds since I measured from cfhistograms to avoid any network interference, so the differences are smaller than I originally implied (and may explain the throughput similarity since network latency would dominate.)

        Let's look at this statistically (as derived from the histogram data) instead.

        Non-contended

        type trunk 2893
        Mean 9.0105 9.5486
        SD 7.5795 7.6594
        10th % 5 5
        25th % 6 7
        50th % 8 8
        75th % 10 10
        95th % 42 42
        99th % 103 103

        Contended

        type trunk 2893
        Mean 10.2212 14.9205
        SD 8.1631 11.7420
        10th % 6 8
        25th % 7 10
        50th % 8 12
        75th % 12 17
        95th % 42 72
        99th % 103 124

        Non-contended is almost exactly the same in every way. Things begin to differ with a contended row, but nothing I would call twice as bad.

        Show
        Brandon Williams added a comment - - edited Maybe a chart isn't the best way to look at this. First, let me note that I mislabeled the X axis, this is _micro_seconds since I measured from cfhistograms to avoid any network interference, so the differences are smaller than I originally implied (and may explain the throughput similarity since network latency would dominate.) Let's look at this statistically (as derived from the histogram data) instead. Non-contended type trunk 2893 Mean 9.0105 9.5486 SD 7.5795 7.6594 10th % 5 5 25th % 6 7 50th % 8 8 75th % 10 10 95th % 42 42 99th % 103 103 Contended type trunk 2893 Mean 10.2212 14.9205 SD 8.1631 11.7420 10th % 6 8 25th % 7 10 50th % 8 12 75th % 12 17 95th % 42 72 99th % 103 124 Non-contended is almost exactly the same in every way. Things begin to differ with a contended row, but nothing I would call twice as bad.
        Hide
        Jonathan Ellis added a comment -

        Ah, I was curious about the millis vs micros. I figured you must be including network latency. micros makes more sense.

        I think something is wrong with either your chart or your percentiles though. The contended graph clearly shows a mean of 12+ – there's about 1/3 around 10, 1/3 around 12, and 1/3 14+. No way does that work out to 10.3.

        Show
        Jonathan Ellis added a comment - Ah, I was curious about the millis vs micros. I figured you must be including network latency. micros makes more sense. I think something is wrong with either your chart or your percentiles though. The contended graph clearly shows a mean of 12+ – there's about 1/3 around 10, 1/3 around 12, and 1/3 14+. No way does that work out to 10.3.
        Hide
        Brandon Williams added a comment -

        Oops, you're right, I accidentally used the same total for both when calculating the mean. Edited with correct mean and sd for 2893 (percentiles not affected.)

        Show
        Brandon Williams added a comment - Oops, you're right, I accidentally used the same total for both when calculating the mean. Edited with correct mean and sd for 2893 (percentiles not affected.)
        Hide
        Jonathan Ellis added a comment -

        Okay, so more accurately about 6% on uncontended and 50% on contended. Since the contended case is likely to be where people want the isolation, that seems reasonable.

        Show
        Jonathan Ellis added a comment - Okay, so more accurately about 6% on uncontended and 50% on contended. Since the contended case is likely to be where people want the isolation, that seems reasonable.
        Hide
        Brandon Williams added a comment -

        Okay, so more accurately about 6% on uncontended and 50% on contended. Since the contended case is likely to be where people want the isolation, that seems reasonable.

        +1 (though it's also important to remember this is in a local context; network latency makes these percentages lower in practice)

        Show
        Brandon Williams added a comment - Okay, so more accurately about 6% on uncontended and 50% on contended. Since the contended case is likely to be where people want the isolation, that seems reasonable. +1 (though it's also important to remember this is in a local context; network latency makes these percentages lower in practice)
        Hide
        Jonathan Ellis added a comment -

        On the code side: should we just get rid of AbstractColumnContainer now since it's basically just a wrapper around ISortedColumns?

        Doesn't Guava Functions.identity do the same as ACC.identity?

        + * In case we are adding a lot of columns, failing the final compare
        + * and swap could be expensive. To mitigate, we check we haven't been
        + * beaten by another thread after every column addition. If we have,
        + * we bail early, avoiding unnecessary work if possible.

        I wonder if this is premature optimization or even counterproductive for the uncontended case. Did you do any testing around this?

        More generally it may be worth looking at the before- and after- yourkit traces for uncontended to see if there's any improvement we can make in AtomicSortedColumns or if most of the overhead is from SnapTreeMap.

        Show
        Jonathan Ellis added a comment - On the code side: should we just get rid of AbstractColumnContainer now since it's basically just a wrapper around ISortedColumns? Doesn't Guava Functions.identity do the same as ACC.identity? + * In case we are adding a lot of columns, failing the final compare + * and swap could be expensive. To mitigate, we check we haven't been + * beaten by another thread after every column addition. If we have, + * we bail early, avoiding unnecessary work if possible. I wonder if this is premature optimization or even counterproductive for the uncontended case. Did you do any testing around this? More generally it may be worth looking at the before- and after- yourkit traces for uncontended to see if there's any improvement we can make in AtomicSortedColumns or if most of the overhead is from SnapTreeMap.
        Hide
        Sylvain Lebresne added a comment -

        should we just get rid of AbstractColumnContainer now since it's basically just a wrapper around ISortedColumns?

        It's still useful to factor the code between ColumnFamily and SuperColumn. If we remove it, we'll basically have to duplicate ACC into both CF and SC. But I'll be more than happy to remove it when/if we remove SCs

        Doesn't Guava Functions.identity do the same as ACC.identity?

        Totally, I guess I don't know Guava too well. I'll replace it.

        I wonder if this is premature optimization or even counterproductive for the uncontended case. Did you do any testing around this?

        To be honest, I did not test it. But the rational is that I would be surprised if checking an atomic ref get (even on each iteration) has any kind of visible impact (compared to adding a column to a snaptree), while I have no doubt that in the contended case with non trivially small batches not doing this will have a visible impact. That was more about trading a probably non-noticeable hit on non-contended to avoid any bad outliers.

        I also though about doing the the reference check only every few iterations, but I figured that would likely don't change anything since again an atomic ref get should be really fast.

        But obviously it doesn't cost much to check those assumptions and I'll do some testing.

        I'll also note that we could easily change snaptree by a persistent sorted map. This could be faster than snaptree because there would be any internal synchronization (while snaptree has internal synchronization, that we mostly don't use, except for the fact that they allow a concurrently safe copy-on-write). I have a persistent store map implementation almost ready so I'll probably do that test at some point, but if we think snaptree is acceptable I suggest we go with that at first since it is absolutely unclear the persistent sorted map will be faster (it will generate more garbage for instance) and I'm sure how long it will take to do such a test.

        Show
        Sylvain Lebresne added a comment - should we just get rid of AbstractColumnContainer now since it's basically just a wrapper around ISortedColumns? It's still useful to factor the code between ColumnFamily and SuperColumn. If we remove it, we'll basically have to duplicate ACC into both CF and SC. But I'll be more than happy to remove it when/if we remove SCs Doesn't Guava Functions.identity do the same as ACC.identity? Totally, I guess I don't know Guava too well. I'll replace it. I wonder if this is premature optimization or even counterproductive for the uncontended case. Did you do any testing around this? To be honest, I did not test it. But the rational is that I would be surprised if checking an atomic ref get (even on each iteration) has any kind of visible impact (compared to adding a column to a snaptree), while I have no doubt that in the contended case with non trivially small batches not doing this will have a visible impact. That was more about trading a probably non-noticeable hit on non-contended to avoid any bad outliers. I also though about doing the the reference check only every few iterations, but I figured that would likely don't change anything since again an atomic ref get should be really fast. But obviously it doesn't cost much to check those assumptions and I'll do some testing. I'll also note that we could easily change snaptree by a persistent sorted map. This could be faster than snaptree because there would be any internal synchronization (while snaptree has internal synchronization, that we mostly don't use, except for the fact that they allow a concurrently safe copy-on-write). I have a persistent store map implementation almost ready so I'll probably do that test at some point, but if we think snaptree is acceptable I suggest we go with that at first since it is absolutely unclear the persistent sorted map will be faster (it will generate more garbage for instance) and I'm sure how long it will take to do such a test.
        Hide
        Jonathan Ellis added a comment - - edited

        Rebased and committed with the Functions.identity change. Leaving open for potential performance enhancements.

        Show
        Jonathan Ellis added a comment - - edited Rebased and committed with the Functions.identity change. Leaving open for potential performance enhancements.
        Hide
        Hudson added a comment -

        Integrated in Cassandra #1270 (See https://builds.apache.org/job/Cassandra/1270/)
        add row-level isolation via SnapTree
        patch by slebresne; reviewed by jbellis for CASSANDRA-2893

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225001
        Files :

        • /cassandra/trunk/lib/licenses/snaptree-0.1-SNAPSHOT.txt
        • /cassandra/trunk/lib/snaptree-0.1-SNAPSHOT.jar
        • /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractColumnContainer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractThreadUnsafeSortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ArrayBackedSortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/AtomicSortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamily.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilySerializer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ISortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/Memtable.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/Row.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/SuperColumn.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ThreadSafeSortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/TreeMapBackedSortedColumns.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/ArrayBackedSortedColumnsTest.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyTest.java
        Show
        Hudson added a comment - Integrated in Cassandra #1270 (See https://builds.apache.org/job/Cassandra/1270/ ) add row-level isolation via SnapTree patch by slebresne; reviewed by jbellis for CASSANDRA-2893 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225001 Files : /cassandra/trunk/lib/licenses/snaptree-0.1-SNAPSHOT.txt /cassandra/trunk/lib/snaptree-0.1-SNAPSHOT.jar /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractColumnContainer.java /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractThreadUnsafeSortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/db/ArrayBackedSortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/db/AtomicSortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamily.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilySerializer.java /cassandra/trunk/src/java/org/apache/cassandra/db/ISortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/db/Memtable.java /cassandra/trunk/src/java/org/apache/cassandra/db/Row.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java /cassandra/trunk/src/java/org/apache/cassandra/db/SuperColumn.java /cassandra/trunk/src/java/org/apache/cassandra/db/ThreadSafeSortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/db/TreeMapBackedSortedColumns.java /cassandra/trunk/test/unit/org/apache/cassandra/db/ArrayBackedSortedColumnsTest.java /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyTest.java
        Hide
        Radim Kolar added a comment -

        what is improvement over cassandra 1.0? Writes to same key in CF were atomic already.

        You mean that reader is now waiting until new data are written to row by writer instead of returning old data?

        Show
        Radim Kolar added a comment - what is improvement over cassandra 1.0? Writes to same key in CF were atomic already. You mean that reader is now waiting until new data are written to row by writer instead of returning old data?
        Hide
        Jonathan Ellis added a comment -
        Show
        Jonathan Ellis added a comment - Atomic != Isolated http://en.wikipedia.org/wiki/ACID
        Hide
        Cathy Daw added a comment -

        I have a question about the test case attached: row_isolation_test.py

        Based on how the inserts or deletes are constructed, if I continually poll the column count of the test key, while a large mutation is being applied, the only value I should see if this transaction is isolated is 0, 100 or 200k. Otherwise, I print an error message (this is my version of a 'diff').

        Running this script on different versions, I saw the following behavior. I am not sure if the experiment is wrong, or if I have a bug because my expectation would be to see no diffs (or messages from the 2nd thread).

        • 0.8.6: incremental column counts on the large column delete and insert (expected)
        • 1.0.7: incremental column counts on the large column delete (expected to see insert diffs as well)
        • 1.0 branch: incremental column counts on the large column delete (no diffs expected)
        • trunk: incremental column counts on the large column delete (no diffs expected)

        0.8.9 test run

        Thread 1: Insert to 100 - Start
        Thread 1: Insert End - Expect # columns = 100
        Thread 1: Insert to 200k - Start
        Thread 1: Insert End - Expect # columns = 200000
        Thread 1: Remove 200k - Start
        --- Thread 2: Mismatch Column Count.  Current # columns: 184628 Current Test: test3 - remove 200k
        --- Thread 2: Mismatch Column Count.  Current # columns: 44765 Current Test: test3 - remove 200k
        Thread 1: Remove End - Expect # columns = 0
        Thread 1: Insert to 100 - Start
        Thread 1: Insert End - Expect # columns = 100
        Thread 1: Remove 100 - Start
        Thread 1: Remove End - Expect # columns = 0
        Thread 1: Insert 100k to 300k - Start
        --- Thread 2: Mismatch Column Count.  Current # columns: 569 Current Test: test5 - insert 200k
        Thread 1: Insert End - Expect # columns = 200000
        
        Show
        Cathy Daw added a comment - I have a question about the test case attached: row_isolation_test.py Based on how the inserts or deletes are constructed, if I continually poll the column count of the test key, while a large mutation is being applied, the only value I should see if this transaction is isolated is 0, 100 or 200k. Otherwise, I print an error message (this is my version of a 'diff'). Running this script on different versions, I saw the following behavior. I am not sure if the experiment is wrong, or if I have a bug because my expectation would be to see no diffs (or messages from the 2nd thread). 0.8.6: incremental column counts on the large column delete and insert (expected) 1.0.7: incremental column counts on the large column delete (expected to see insert diffs as well) 1.0 branch: incremental column counts on the large column delete (no diffs expected) trunk: incremental column counts on the large column delete (no diffs expected) 0.8.9 test run Thread 1: Insert to 100 - Start Thread 1: Insert End - Expect # columns = 100 Thread 1: Insert to 200k - Start Thread 1: Insert End - Expect # columns = 200000 Thread 1: Remove 200k - Start --- Thread 2: Mismatch Column Count. Current # columns: 184628 Current Test: test3 - remove 200k --- Thread 2: Mismatch Column Count. Current # columns: 44765 Current Test: test3 - remove 200k Thread 1: Remove End - Expect # columns = 0 Thread 1: Insert to 100 - Start Thread 1: Insert End - Expect # columns = 100 Thread 1: Remove 100 - Start Thread 1: Remove End - Expect # columns = 0 Thread 1: Insert 100k to 300k - Start --- Thread 2: Mismatch Column Count. Current # columns: 569 Current Test: test5 - insert 200k Thread 1: Insert End - Expect # columns = 200000
        Cathy Daw made changes -
        Attachment row_isolation_test.py [ 12511215 ]
        Hide
        Brandon Williams added a comment -

        This is only in 1.1, are you seeing problems in trunk?

        Show
        Brandon Williams added a comment - This is only in 1.1, are you seeing problems in trunk?
        Hide
        Cathy Daw added a comment -

        I see the issue with the delete columns on trunk.

        I guess what is confusing me is that I only see an issue inserting columns on 0.8.9, and the insert columns doesn't reproduce on 1.0.x or trunk.

        Show
        Cathy Daw added a comment - I see the issue with the delete columns on trunk. I guess what is confusing me is that I only see an issue inserting columns on 0.8.9, and the insert columns doesn't reproduce on 1.0.x or trunk.
        Hide
        Sylvain Lebresne added a comment -

        The problem you're seeing is due to the call to get_count and more precisely to the paging we do server side since CASSANDRA-2894. As a result, get_count is not atomic. I'm not sure we have a better solution for that than documenting it's a limitation. But we'd have to keep that in mind if we introduce paging for CQL.

        Note that not only does that explain why you're seeing delete mismatches on trunk, but also why you don't see mismatches on insert on 1.0 (insertion is done in the same order than the pages are done, so I suspect the insertions were just faster than the paging; and somehow the timing changed things for deletes).

        Anyway, I've tested with get_count paging disabled (which require hacking the server), and the result are the one expected.

        • on 1.0, there is mismatches for insertions
        • on trunk, no mismatches are found

        For your tests, you'll want to use a get_slice call instead of get_count since we don't page those.

        Show
        Sylvain Lebresne added a comment - The problem you're seeing is due to the call to get_count and more precisely to the paging we do server side since CASSANDRA-2894 . As a result, get_count is not atomic. I'm not sure we have a better solution for that than documenting it's a limitation. But we'd have to keep that in mind if we introduce paging for CQL. Note that not only does that explain why you're seeing delete mismatches on trunk, but also why you don't see mismatches on insert on 1.0 (insertion is done in the same order than the pages are done, so I suspect the insertions were just faster than the paging; and somehow the timing changed things for deletes). Anyway, I've tested with get_count paging disabled (which require hacking the server), and the result are the one expected. on 1.0, there is mismatches for insertions on trunk, no mismatches are found For your tests, you'll want to use a get_slice call instead of get_count since we don't page those.
        Hide
        Jonathan Ellis added a comment -

        I can also reproduce the apparently atomic insert behavior in 1.0. I'm not sure what quirk of Java optimization is causing it, to be honest. But I'm not worried about it because when I forced pauses between the columns being added ("pauses" patch attached) then I did see the counts change. (Specifically, I initialized the row first with "stress -t 1 -n 1 -c 1 -K 1", then sent 10000 columns with "stress -t 1 -n 1 -c 10000 -K 1".)

        So whatever is making the updates highly likely to be isolated does not guarantee it, so I'm comfortable that our mental model of the 1.0 Memtable is valid.

        Show
        Jonathan Ellis added a comment - I can also reproduce the apparently atomic insert behavior in 1.0. I'm not sure what quirk of Java optimization is causing it, to be honest. But I'm not worried about it because when I forced pauses between the columns being added ("pauses" patch attached) then I did see the counts change. (Specifically, I initialized the row first with "stress -t 1 -n 1 -c 1 -K 1", then sent 10000 columns with "stress -t 1 -n 1 -c 10000 -K 1".) So whatever is making the updates highly likely to be isolated does not guarantee it, so I'm comfortable that our mental model of the 1.0 Memtable is valid.
        Jonathan Ellis made changes -
        Attachment 2893-pauses.txt [ 12512166 ]
        Hide
        Sylvain Lebresne added a comment -

        Closing as, as far as I can tell, this seems to be working as expected.

        On the performance front, I did some small benchmark and as far as the 'bailing early during addAll' code is concern, while it does remain to be proven how much it helps in practice, I'm pretty sure it doesn't have much noticeable bad impact on the uncontended case. So it may be worth investigating further, but I propose to keep the code as it is, at least as far a 1.1.0 is concerned.

        Show
        Sylvain Lebresne added a comment - Closing as, as far as I can tell, this seems to be working as expected. On the performance front, I did some small benchmark and as far as the 'bailing early during addAll' code is concern, while it does remain to be proven how much it helps in practice, I'm pretty sure it doesn't have much noticeable bad impact on the uncontended case. So it may be worth investigating further, but I propose to keep the code as it is, at least as far a 1.1.0 is concerned.
        Sylvain Lebresne made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Reviewer jbellis
        Resolution Fixed [ 1 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12620177 ] patch-available, re-open possible [ 12748948 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12748948 ] reopen-resolved, no closed status, patch-avail, testing [ 12756746 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        126d 10h 10m 1 Sylvain Lebresne 16/Nov/11 09:16
        Patch Available Patch Available Resolved Resolved
        85d 1h 7m 1 Sylvain Lebresne 09/Feb/12 10:24

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Jonathan Ellis
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development