Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: None
    • Labels:
      None

      Activity

      Hide
      Sylvain Lebresne added a comment -

      @Vijay: are you actively working on this? I'd keen to give it a shot myself unless you do.

      Show
      Sylvain Lebresne added a comment - @Vijay: are you actively working on this? I'd keen to give it a shot myself unless you do.
      Hide
      Vijay added a comment -

      I just started looking into this hence plz go ahead.

      Show
      Vijay added a comment - I just started looking into this hence plz go ahead.
      Hide
      Sylvain Lebresne added a comment -

      I've pushed a first version of this at https://github.com/pcmanus/cassandra/commits/3708.

      It's an inefficient version in that range tombstones are stored where the row deletion infos is stored today. In other words, we deserialize all those range tombstones (for a row) every time. Thus this won't scale to more than a few tombstones. I have plans to improve that, but in the meantime this version felt like a good first step. In particular, outside of not scaling, it is fully functional and an improved version will build on this anyway. As such, feedbak on this would already be useful.

      Basically, what this does is abstracting the deletion details inside the (existing) DeletionInfo class, to then adds support for range tombstones (stored with an interval tree) to that class. Amongst various details, this change the serialized form of a ColumnFamily, and thus required to push the message versioning down to the ColumnFamilySerializer.

      Unit tests are passing but I've also put a simple at https://github.com/pcmanus/cassandra-dtest/commits/3708_tests. Note that only CQL3 has support for those range tombstones.

      Show
      Sylvain Lebresne added a comment - I've pushed a first version of this at https://github.com/pcmanus/cassandra/commits/3708 . It's an inefficient version in that range tombstones are stored where the row deletion infos is stored today. In other words, we deserialize all those range tombstones (for a row) every time. Thus this won't scale to more than a few tombstones. I have plans to improve that, but in the meantime this version felt like a good first step. In particular, outside of not scaling, it is fully functional and an improved version will build on this anyway. As such, feedbak on this would already be useful. Basically, what this does is abstracting the deletion details inside the (existing) DeletionInfo class, to then adds support for range tombstones (stored with an interval tree) to that class. Amongst various details, this change the serialized form of a ColumnFamily, and thus required to push the message versioning down to the ColumnFamilySerializer. Unit tests are passing but I've also put a simple at https://github.com/pcmanus/cassandra-dtest/commits/3708_tests . Note that only CQL3 has support for those range tombstones.
      Hide
      Sylvain Lebresne added a comment -

      I've updated my branch at https://github.com/pcmanus/cassandra/commits/3708 to add efficient on-disk handling of the new range tombstones.

      The idea is that we don't want to have to read every range tombstone for each query, but only the ones corresponding to the columns queried. For that, the idea is to write the range tombstone along with the columns themselves. So the basic principal of the patch is that if we have a range tombstone RT[x, y] deleting all columns between x and y, we write a tombstone marker on disk before column x. Of course in practice that's more complicated because we want to be sure to read that tombstone even if we read only say y. To ensure that, such tombstone marker is repeated at the beginning of every column block (index block) the range covers (the code is smart enough to not repeat a marker that is superseded by other ones so there won't be a lot of such repeated marker at the beginning of each block in practice).

      Note that those tombstone marker are only specific for the on-disk format (in memory we use an interval tree), which has 2 consequences for the patch:

      1. the on-disk format now diverges a little bit from the wire format. So the code separates (hopefullly) cleanly serialization functions that deal with on-disk format from the others. I don't think it's a bad idea to have that distinction anyway since we don't want to break the wire protocol but it's ok to change the on-disk one.
      2. on-disk column iterators (SSTable {Slice,Name}

        Iterator) have to handle those tombstone markers that are not columns per-se. I.e, after having read them from disk we want to store them in the interval tree of the ColumnFamily object, not as an IColumn in the ColumnFamily map. To do this distinction, the code introduces an interface called OnDiskAtom, which represent basically either a column or a range tombstone. And the sstable iterators return those OnDiskAtom which are then ultimately added correctly to the resulting ColumnFamily object. I do think this is the clean way to handle this, but this is responsible for quite a bit of code diffs.

      I'll also note that both those changes should be useful for CASSANDRA-4180 too to handle the end-of-row marker described in that issue.

      Now I admit this patch is not a small one, but unit tests are passing and there is a few basic tests at https://github.com/pcmanus/cassandra-dtest/commits/3708_tests.

      Lastly, I'll add that the support for this by CQL3 is minimal as of this patch. We only allow what is basically the equivalent of the 'delete a whole super column' behavior. But it would be simple to allow for more generic use of range tombstones, i.e to allow stuff like:

      DELETE FROM test WHERE k=0 AND c > 3 and c <= 10
      

      But the patch is big enough that we can see that later.

      Show
      Sylvain Lebresne added a comment - I've updated my branch at https://github.com/pcmanus/cassandra/commits/3708 to add efficient on-disk handling of the new range tombstones. The idea is that we don't want to have to read every range tombstone for each query, but only the ones corresponding to the columns queried. For that, the idea is to write the range tombstone along with the columns themselves. So the basic principal of the patch is that if we have a range tombstone RT [x, y] deleting all columns between x and y, we write a tombstone marker on disk before column x. Of course in practice that's more complicated because we want to be sure to read that tombstone even if we read only say y. To ensure that, such tombstone marker is repeated at the beginning of every column block (index block) the range covers (the code is smart enough to not repeat a marker that is superseded by other ones so there won't be a lot of such repeated marker at the beginning of each block in practice). Note that those tombstone marker are only specific for the on-disk format (in memory we use an interval tree), which has 2 consequences for the patch: the on-disk format now diverges a little bit from the wire format. So the code separates (hopefullly) cleanly serialization functions that deal with on-disk format from the others. I don't think it's a bad idea to have that distinction anyway since we don't want to break the wire protocol but it's ok to change the on-disk one. on-disk column iterators (SSTable {Slice,Name} Iterator) have to handle those tombstone markers that are not columns per-se. I.e, after having read them from disk we want to store them in the interval tree of the ColumnFamily object, not as an IColumn in the ColumnFamily map. To do this distinction, the code introduces an interface called OnDiskAtom, which represent basically either a column or a range tombstone. And the sstable iterators return those OnDiskAtom which are then ultimately added correctly to the resulting ColumnFamily object. I do think this is the clean way to handle this, but this is responsible for quite a bit of code diffs. I'll also note that both those changes should be useful for CASSANDRA-4180 too to handle the end-of-row marker described in that issue. Now I admit this patch is not a small one, but unit tests are passing and there is a few basic tests at https://github.com/pcmanus/cassandra-dtest/commits/3708_tests . Lastly, I'll add that the support for this by CQL3 is minimal as of this patch. We only allow what is basically the equivalent of the 'delete a whole super column' behavior. But it would be simple to allow for more generic use of range tombstones, i.e to allow stuff like: DELETE FROM test WHERE k=0 AND c > 3 and c <= 10 But the patch is big enough that we can see that later.
      Hide
      Sylvain Lebresne added a comment -

      I've rebased the patches to https://github.com/pcmanus/cassandra/commits/3708-2. I also changed a bit what was in which patch to hopefully make it easier to review, but I don't know how much I succeed (some changes in the earlier patches really make sense only with the following ones).

      Show
      Sylvain Lebresne added a comment - I've rebased the patches to https://github.com/pcmanus/cassandra/commits/3708-2 . I also changed a bit what was in which patch to hopefully make it easier to review, but I don't know how much I succeed (some changes in the earlier patches really make sense only with the following ones).
      Hide
      Yuki Morishita added a comment -

      I checked out Sylvain's latest branch and overall it looked good to me. New on disk format works fine with compaction and streaming, as well as RowMutations among nodes.

      One thing that doesn't work though is the following:
      I created table as in dtest for 3708,

      CREATE TABLE test1 (
        k int,
        c1 int,
        c2 int,
        v1 int,
        v2 int,
        PRIMARY KEY (k, c1, c2)
      );
      

      and insert several rows.
      Deleting using following statement works as expected in dtest:

      DELETE FROM test1 WHERE k = 0 AND c1 = 0;
      

      But following delete statement doesn't work.

      DELETE FROM test1 WHERE k = 0 AND c1 = 0 AND c2 = 0;
      

      Although specifying which columns to delete does work.

      DELETE v1, v2 FROM test1 WHERE k = 0 AND c1 = 0 AND c2 = 0;
      

      From the log, I did't see any DeletionInfo in RowMutation for the first (not working) statement.

      DEBUG [Thrift:4] 2012-05-17 16:27:46,515 StorageProxy.java (line 174) Mutations/ConsistencyLevel are [RowMutation(keyspace='Keyspace1', key='00000000', modifications=[ColumnFamily(test1 [])])]/ONE
      

      I think this is just the matter of DeleteStatement. Other than that, LGTM.

      Show
      Yuki Morishita added a comment - I checked out Sylvain's latest branch and overall it looked good to me. New on disk format works fine with compaction and streaming, as well as RowMutations among nodes. One thing that doesn't work though is the following: I created table as in dtest for 3708, CREATE TABLE test1 ( k int, c1 int, c2 int, v1 int, v2 int, PRIMARY KEY (k, c1, c2) ); and insert several rows. Deleting using following statement works as expected in dtest: DELETE FROM test1 WHERE k = 0 AND c1 = 0; But following delete statement doesn't work. DELETE FROM test1 WHERE k = 0 AND c1 = 0 AND c2 = 0; Although specifying which columns to delete does work. DELETE v1, v2 FROM test1 WHERE k = 0 AND c1 = 0 AND c2 = 0; From the log, I did't see any DeletionInfo in RowMutation for the first (not working) statement. DEBUG [Thrift:4] 2012-05-17 16:27:46,515 StorageProxy.java (line 174) Mutations/ConsistencyLevel are [RowMutation(keyspace='Keyspace1', key='00000000', modifications=[ColumnFamily(test1 [])])]/ONE I think this is just the matter of DeleteStatement. Other than that, LGTM.
      Hide
      Sylvain Lebresne added a comment -

      You're right, DeleteStatement wasn't handling this case correctly. I pushed the simple fix to the 3708-2 branch.

      Show
      Sylvain Lebresne added a comment - You're right, DeleteStatement wasn't handling this case correctly. I pushed the simple fix to the 3708-2 branch.
      Hide
      Yuki Morishita added a comment -

      +1

      Show
      Yuki Morishita added a comment - +1
      Hide
      Sylvain Lebresne added a comment -

      Committed, thanks

      Show
      Sylvain Lebresne added a comment - Committed, thanks
      Hide
      Vijay added a comment -

      There is a bug in DeletionInfo.

      public int dataSize()
          {
              int size = TypeSizes.NATIVE.sizeof(topLevel.markedForDeleteAt);
              for (RangeTombstone r : ranges)
                  size += r.data.markedForDeleteAt;
              return size;
          }
      

      1) We should do TypeSizes.NATIVE.sizeof(r.data.markedForDeleteAt)
      2) We should also calculate the type sizes for the range tombstones.

      Show
      Vijay added a comment - There is a bug in DeletionInfo. public int dataSize() { int size = TypeSizes.NATIVE.sizeof(topLevel.markedForDeleteAt); for (RangeTombstone r : ranges) size += r.data.markedForDeleteAt; return size; } 1) We should do TypeSizes.NATIVE.sizeof(r.data.markedForDeleteAt) 2) We should also calculate the type sizes for the range tombstones.
      Hide
      Sylvain Lebresne added a comment -

      Right, I've been arguably a little too hasty with my rebase-before-commit. Fixed in commit 5ab69b6, thanks.

      Show
      Sylvain Lebresne added a comment - Right, I've been arguably a little too hasty with my rebase-before-commit. Fixed in commit 5ab69b6, thanks.
      Hide
      Vijay added a comment -

      looks like there is another bug in the following code.

      AbstractType.onDiskAtomComparator

                              int comp2 = AbstractType.this.compare(t1.max, t1.max);
                              if (comp2 == 0)
                                  return t1.data.compareTo(t2.data);
                              else
                                  return comp2;
      
      Show
      Vijay added a comment - looks like there is another bug in the following code. AbstractType.onDiskAtomComparator int comp2 = AbstractType. this .compare(t1.max, t1.max); if (comp2 == 0) return t1.data.compareTo(t2.data); else return comp2;
      Hide
      Sylvain Lebresne added a comment -

      Fixed, thanks for spotting it.

      Show
      Sylvain Lebresne added a comment - Fixed, thanks for spotting it.
      Hide
      Daniel Doubleday added a comment -

      IntervalTree:254

      if (comparator == null)
        Collections.sort(allEndpoints, comparator);
      else
        Collections.sort((List<Comparable>)allEndpoints);
      
      Show
      Daniel Doubleday added a comment - IntervalTree:254 if (comparator == null ) Collections.sort(allEndpoints, comparator); else Collections.sort((List<Comparable>)allEndpoints);
      Hide
      Sylvain Lebresne added a comment -

      Good catch. Fixed in b4e47bca8c80de488, thanks Daniel.

      Show
      Sylvain Lebresne added a comment - Good catch. Fixed in b4e47bca8c80de488, thanks Daniel.
      Hide
      Daniel Doubleday added a comment -

      Dont want to reopen again because its not a bug but maybe an improvement ...

      It seems that searchInternal is always descending into the tree even when there are no chances to find a result anymore.
      Unless I missing something a break early test could help (unit tests pass and hit the break early test):

      void searchInternal(Interval<C, D> searchInterval, List<D> results)
      {
          if (comparePoints(searchInterval.max, low) < 0 ||
              comparePoints(searchInterval.min, high) > 0) return;
      
      Show
      Daniel Doubleday added a comment - Dont want to reopen again because its not a bug but maybe an improvement ... It seems that searchInternal is always descending into the tree even when there are no chances to find a result anymore. Unless I missing something a break early test could help (unit tests pass and hit the break early test): void searchInternal(Interval<C, D> searchInterval, List<D> results) { if (comparePoints(searchInterval.max, low) < 0 || comparePoints(searchInterval.min, high) > 0) return ;
      Hide
      Jonathan Ellis added a comment -

      lgtm, committed

      Show
      Jonathan Ellis added a comment - lgtm, committed

        People

        • Assignee:
          Sylvain Lebresne
          Reporter:
          Jonathan Ellis
          Reviewer:
          Yuki Morishita
        • Votes:
          2 Vote for this issue
          Watchers:
          9 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development