Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8180

Optimize disk seek using min/max column name meta data when the LIMIT clause is used

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 3.4
    • Component/s: Local Write-Read Paths
    • Labels:
      None
    • Environment:

      Cassandra 2.0.10

      Description

      I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk.

      cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC);
      cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
      ...
      >nodetool flush test test
      ...
      cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
      ...
      >nodetool flush test test
      ...
      cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
      ...
      >nodetool flush test test
      

      After that, I activate request tracing:

      cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
       activity                                                                  | timestamp    | source    | source_elapsed
      ---------------------------------------------------------------------------+--------------+-----------+----------------
                                                              execute_cql3_query | 23:48:46,498 | 127.0.0.1 |              0
                                  Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 |             74
                                                             Preparing statement | 23:48:46,499 | 127.0.0.1 |            253
                                        Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |            930
                                                    Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |            943
                                                     Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 |           1032
                                                     Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 |           1160
                                     Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 |           1173
                                                     Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 |           1889
                                     Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 |           1901
                                                     Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 |           2373
                                     Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 |           2384
       Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 |           2768
                                      Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 |           2784
                                              Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 |           2976
                                                                Request complete | 23:48:46,501 | 127.0.0.1 |           3551
      

      We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data.

      Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path:

      cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
       activity                                                                  | timestamp    | source    | source_elapsed
      ---------------------------------------------------------------------------+--------------+-----------+----------------
                                                              execute_cql3_query | 23:52:31,888 | 127.0.0.1 |              0
                     Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 23:52:31,888 | 127.0.0.1 |             60
                                                             Preparing statement | 23:52:31,888 | 127.0.0.1 |            277
                                        Executing single-partition query on test | 23:52:31,889 | 127.0.0.1 |            961
                                                    Acquiring sstable references | 23:52:31,889 | 127.0.0.1 |            971
                                                     Merging memtable tombstones | 23:52:31,889 | 127.0.0.1 |           1020
                                                     Key cache hit for sstable 3 | 23:52:31,889 | 127.0.0.1 |           1108
                                     Seeking to partition beginning in data file | 23:52:31,889 | 127.0.0.1 |           1117
       Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:52:31,889 | 127.0.0.1 |           1611
                                      Merging data from memtables and 1 sstables | 23:52:31,890 | 127.0.0.1 |           1624
                                              Read 1 live and 0 tombstoned cells | 23:52:31,890 | 127.0.0.1 |           1700
                                                                Request complete | 23:52:31,890 | 127.0.0.1 |           2140
      
      1. 8180_001.yaml
        0.7 kB
        Stefania
      2. 8180_002.yaml
        0.7 kB
        Stefania

        Issue Links

          Activity

          Hide
          mshuler Michael Shuler added a comment -

          Querying for all the rows with the primary key WHERE id=1 indeed requires scanning all the rows. By limiting your query only to those clustering column rows that also have col > 25 is precisely the kind of query optimization that users should be doing with well thought out schema and queries that answer efficiently. This is by design. If I'm misunderstanding something here, please re-open and explain further

          Show
          mshuler Michael Shuler added a comment - Querying for all the rows with the primary key WHERE id=1 indeed requires scanning all the rows. By limiting your query only to those clustering column rows that also have col > 25 is precisely the kind of query optimization that users should be doing with well thought out schema and queries that answer efficiently. This is by design. If I'm misunderstanding something here, please re-open and explain further
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'm also really curious as to how you expect this optimization to work, because I don't see how this could work.

          Show
          slebresne Sylvain Lebresne added a comment - I'm also really curious as to how you expect this optimization to work, because I don't see how this could work.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Oh, unless you mean to order the sstables by mean min column names and to query them one at a time (like we do for names filter) in the specific case where their column names interval don't intersect. In which case, yes, that could be indeed useful with the compaction strategy from CASSANDRA-6602.

          Show
          slebresne Sylvain Lebresne added a comment - Oh, unless you mean to order the sstables by mean min column names and to query them one at a time (like we do for names filter) in the specific case where their column names interval don't intersect. In which case, yes, that could be indeed useful with the compaction strategy from CASSANDRA-6602 .
          Hide
          doanduyhai DOAN DuyHai added a comment -

          The idea is:

          If there is no restriction on clustering columns but only on LIMIT:

          1) Order SSTables by min/max column depending on the first clustering order
          2) Hit the first SSTable and start sequential read until reaching LIMIT
          3) If LIMIT is large enough, switch to another SSTable and so on

          Show
          doanduyhai DOAN DuyHai added a comment - The idea is: If there is no restriction on clustering columns but only on LIMIT: 1) Order SSTables by min/max column depending on the first clustering order 2) Hit the first SSTable and start sequential read until reaching LIMIT 3) If LIMIT is large enough, switch to another SSTable and so on
          Hide
          doanduyhai DOAN DuyHai added a comment -

          Another test showing that it could be optimized further:

          cqlsh:test> SELECT * FROM test WHERE id=1 AND col<40 LIMIT 1;
          
           id | col | val
          ----+-----+-----
            1 |  30 |  30
          
          (1 rows)
          
          
          Tracing session: 2725c710-5b86-11e4-aeed-814585a29e7b
          
           activity                                                                  | timestamp    | source    | source_elapsed
          ---------------------------------------------------------------------------+--------------+-----------+----------------
                                                                  execute_cql3_query | 16:00:46,850 | 127.0.0.1 |              0
                           Parsing SELECT * FROM test WHERE id=1 AND col<40 LIMIT 1; | 16:00:46,850 | 127.0.0.1 |             77
                                                                 Preparing statement | 16:00:46,850 | 127.0.0.1 |            244
                                            Executing single-partition query on test | 16:00:46,851 | 127.0.0.1 |           1485
                                                        Acquiring sstable references | 16:00:46,851 | 127.0.0.1 |           1500
                                                         Merging memtable tombstones | 16:00:46,851 | 127.0.0.1 |           1547
                                                         Key cache hit for sstable 3 | 16:00:46,852 | 127.0.0.1 |           1641
                                   Seeking to partition indexed section in data file | 16:00:46,852 | 127.0.0.1 |           1651
                                                         Key cache hit for sstable 2 | 16:00:46,854 | 127.0.0.1 |           4054
                                   Seeking to partition indexed section in data file | 16:00:46,854 | 127.0.0.1 |           4068
                                                         Key cache hit for sstable 1 | 16:00:46,855 | 127.0.0.1 |           5232
                                   Seeking to partition indexed section in data file | 16:00:46,855 | 127.0.0.1 |           5249
           Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 16:00:46,855 | 127.0.0.1 |           5499
                                          Merging data from memtables and 3 sstables | 16:00:46,855 | 127.0.0.1 |           5515
                                                  Read 2 live and 0 tombstoned cells | 16:00:46,855 | 127.0.0.1 |           5598
                                                                    Request complete | 16:00:46,855 | 127.0.0.1 |           5997
          

          Now reversing the inequality on the clustering column, C* does scan 3 SSTables instead of just one (since LIMIT = 1)

          Show
          doanduyhai DOAN DuyHai added a comment - Another test showing that it could be optimized further: cqlsh:test> SELECT * FROM test WHERE id=1 AND col<40 LIMIT 1; id | col | val ----+-----+----- 1 | 30 | 30 (1 rows) Tracing session: 2725c710-5b86-11e4-aeed-814585a29e7b activity | timestamp | source | source_elapsed ---------------------------------------------------------------------------+--------------+-----------+---------------- execute_cql3_query | 16:00:46,850 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 AND col<40 LIMIT 1; | 16:00:46,850 | 127.0.0.1 | 77 Preparing statement | 16:00:46,850 | 127.0.0.1 | 244 Executing single-partition query on test | 16:00:46,851 | 127.0.0.1 | 1485 Acquiring sstable references | 16:00:46,851 | 127.0.0.1 | 1500 Merging memtable tombstones | 16:00:46,851 | 127.0.0.1 | 1547 Key cache hit for sstable 3 | 16:00:46,852 | 127.0.0.1 | 1641 Seeking to partition indexed section in data file | 16:00:46,852 | 127.0.0.1 | 1651 Key cache hit for sstable 2 | 16:00:46,854 | 127.0.0.1 | 4054 Seeking to partition indexed section in data file | 16:00:46,854 | 127.0.0.1 | 4068 Key cache hit for sstable 1 | 16:00:46,855 | 127.0.0.1 | 5232 Seeking to partition indexed section in data file | 16:00:46,855 | 127.0.0.1 | 5249 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 16:00:46,855 | 127.0.0.1 | 5499 Merging data from memtables and 3 sstables | 16:00:46,855 | 127.0.0.1 | 5515 Read 2 live and 0 tombstoned cells | 16:00:46,855 | 127.0.0.1 | 5598 Request complete | 16:00:46,855 | 127.0.0.1 | 5997 Now reversing the inequality on the clustering column, C* does scan 3 SSTables instead of just one (since LIMIT = 1)
          Hide
          Stefania Stefania added a comment -

          Sylvain Lebresne, Tyler Hobbs, Aleksey Yeschenko :

          I think it might make sense if I implement this change directly on a branch based on 8099_engine_refactor? First of all I found it much easier to understand and secondly I don't particularly want to rebase or merge later on once 8099 is merged into trunk. Any concerns?

          I've been looking at the code on 8099 today, and I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in SinglePartitionSliceCommand and the second time to return the data. Or have I missed something? If not, I think we need to store the data in memory via an ArrayBackedPartition, is this correct?

          Here is a very inefficient and ugly way to do this, may I have some pointers on to improve on it?

          https://github.com/stef1927/cassandra/commits/8180-8099

          Specifically in querySSTablesByClustering() at line 254 of SinglePartitionSliceCommand.java.

          Show
          Stefania Stefania added a comment - Sylvain Lebresne , Tyler Hobbs , Aleksey Yeschenko : I think it might make sense if I implement this change directly on a branch based on 8099_engine_refactor ? First of all I found it much easier to understand and secondly I don't particularly want to rebase or merge later on once 8099 is merged into trunk. Any concerns? I've been looking at the code on 8099 today, and I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in SinglePartitionSliceCommand and the second time to return the data. Or have I missed something? If not, I think we need to store the data in memory via an ArrayBackedPartition , is this correct? Here is a very inefficient and ugly way to do this, may I have some pointers on to improve on it? https://github.com/stef1927/cassandra/commits/8180-8099 Specifically in querySSTablesByClustering() at line 254 of SinglePartitionSliceCommand.java .
          Hide
          slebresne Sylvain Lebresne added a comment -

          l I found it much easier to understand

          Glad that it's the case.

          I think it might make sense if I implement this change directly on a branch based on 8099_engine_refactor

          I wouldn't be the one to blame you for that.

          I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in SinglePartitionSliceCommand and the second time to return the data

          You actually don't have to care about the limit (in SinglePartitionSliceCommand at least). The way to do this would be to return an iterator that first query and return the results of the first sstable and once it has returned all results, it transparently query the 2nd sstable and start returning those results, etc...

          That being said, I do suspect doing this at the merging level (in MergeIterator) would be better. The idea would be to special the merge iterator to take specific iterators that expose some lowerBound() method. That method would be allowed to return a value that is not returned by the iterator but is lower than anything it will return. The merge iterator would use those lower bound as initial Candidate for the iterators but know that when it consumes those canditates it should just discard them (and get the actual next value of the iterator). Basically, we'd add a way for the iterator to say "don't bother using me until you've at least reached value X". The sstable iterators would typically implement that lowerBound method by returning the sstable "min column name". Provided we make sure the sstable iterators don't do any work unless their hasNext/next methods are called, we wouldn't actually use a sstable until we've reached it's "min column name".

          Doing it that way would 2 advantages over doing it at the "collation" level:

          1. this is more general as it would work even if the sstables min/max column name intersects (it's harder/uglier to do the same at the collation level imo).
          2. this would work for range queries too.

          We may want to build that on top of CASSANDRA-8915 however.

          Show
          slebresne Sylvain Lebresne added a comment - l I found it much easier to understand Glad that it's the case. I think it might make sense if I implement this change directly on a branch based on 8099_engine_refactor I wouldn't be the one to blame you for that. I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in SinglePartitionSliceCommand and the second time to return the data You actually don't have to care about the limit (in SinglePartitionSliceCommand at least). The way to do this would be to return an iterator that first query and return the results of the first sstable and once it has returned all results, it transparently query the 2nd sstable and start returning those results, etc... That being said, I do suspect doing this at the merging level (in MergeIterator) would be better. The idea would be to special the merge iterator to take specific iterators that expose some lowerBound() method. That method would be allowed to return a value that is not returned by the iterator but is lower than anything it will return. The merge iterator would use those lower bound as initial Candidate for the iterators but know that when it consumes those canditates it should just discard them (and get the actual next value of the iterator). Basically, we'd add a way for the iterator to say "don't bother using me until you've at least reached value X". The sstable iterators would typically implement that lowerBound method by returning the sstable "min column name". Provided we make sure the sstable iterators don't do any work unless their hasNext/next methods are called, we wouldn't actually use a sstable until we've reached it's "min column name". Doing it that way would 2 advantages over doing it at the "collation" level: this is more general as it would work even if the sstables min/max column name intersects (it's harder/uglier to do the same at the collation level imo). this would work for range queries too. We may want to build that on top of CASSANDRA-8915 however.
          Hide
          Stefania Stefania added a comment -

          Using MergeIterator is a great idea, I still have some details to iron out but it's already looking much better.

          I have one question : the iterator is over atoms but the sstable min and max column names are lists of ByteBuffer, which I can compare with atoms using the ClusteringComparator but it would be nice to convert the lower bound to an atom, so we can have only one generic type (the In type) in the MergeIterator specialization, which must feed atoms upstream. Is there a way to do this or do I just have to settle for having two different (comparable) types in MergeIterator?

          Show
          Stefania Stefania added a comment - Using MergeIterator is a great idea, I still have some details to iron out but it's already looking much better. I have one question : the iterator is over atoms but the sstable min and max column names are lists of ByteBuffer, which I can compare with atoms using the ClusteringComparator but it would be nice to convert the lower bound to an atom, so we can have only one generic type (the In type) in the MergeIterator specialization, which must feed atoms upstream. Is there a way to do this or do I just have to settle for having two different (comparable) types in MergeIterator?
          Hide
          slebresne Sylvain Lebresne added a comment -

          it would be nice to convert the lower bound to an atom

          I suspect you don't need to have an "Atom", only a "Clusterable", in which case you can convert the lower bound to a "Clustering" with something like new SimpleClustering(sstable.minClusteringValues.toArray(new ByteBuffer[metadata.clusteringColumns().size()])).

          Show
          slebresne Sylvain Lebresne added a comment - it would be nice to convert the lower bound to an atom I suspect you don't need to have an "Atom", only a "Clusterable", in which case you can convert the lower bound to a "Clustering" with something like new SimpleClustering(sstable.minClusteringValues.toArray(new ByteBuffer[metadata.clusteringColumns().size()])) .
          Hide
          Stefania Stefania added a comment -

          You are correct a Clusterable was sufficient.

          There is one disk access I am not sure if we can remove, when we read the partition level deletion:

          status.mostRecentPartitionTombstone = Math.max(status.mostRecentPartitionTombstone, iter.partitionLevelDeletion().markedForDeleteAt());
          

          We need to read the partition level deletion in the initial for loop of SinglePartitionReadCommand with the sstables ordered by max timestamp in order to skip older sstables:

          if (sstable.getMaxTimestamp() < status.mostRecentPartitionTombstone)
              break;
          

          I don't see how we could skip older sstables if they are not ordered by max timestamp and are instead picked lazily by the merge iterator when they become eligible according to their lower bound. As a consequence I don't know how to postpone reading the partition level deletion.

          So far I have some code that still calls iter.partitionLevelDeletion() in that initial for loop, but other iterator methods should not be called until the table is picked by the merge iterator.

          Can we do better?

          Code is here: https://github.com/stef1927/cassandra/tree/8180-8099

          Show
          Stefania Stefania added a comment - You are correct a Clusterable was sufficient. There is one disk access I am not sure if we can remove, when we read the partition level deletion: status.mostRecentPartitionTombstone = Math .max(status.mostRecentPartitionTombstone, iter.partitionLevelDeletion().markedForDeleteAt()); We need to read the partition level deletion in the initial for loop of SinglePartitionReadCommand with the sstables ordered by max timestamp in order to skip older sstables: if (sstable.getMaxTimestamp() < status.mostRecentPartitionTombstone) break ; I don't see how we could skip older sstables if they are not ordered by max timestamp and are instead picked lazily by the merge iterator when they become eligible according to their lower bound. As a consequence I don't know how to postpone reading the partition level deletion. So far I have some code that still calls iter.partitionLevelDeletion() in that initial for loop, but other iterator methods should not be called until the table is picked by the merge iterator. Can we do better? Code is here: https://github.com/stef1927/cassandra/tree/8180-8099
          Hide
          slebresne Sylvain Lebresne added a comment -

          There is one disk access I am not sure if we can remove, when we read the partition level deletion

          It's a good point. A simple optimization however could be to collect whether a sstable has any partition level deletion in the first place. I suspect that it's not that uncommon to not use partition level deletion at all and so we could avoid that seeks at least in that case.

          Now I'll also note that when we do have to fetch the partition deletion, if the partition is indexed, we get it from the index file. Which means we at least avoid seeking into the data file, but also mean that with the index entry comes the partition row-index, and that's actually more precise than the global sstable min/max clustering values. In other words, an optimization we could do is that when we do have the index entry when the sstable iterator "lower bound" is queried, we should return the min clustering from the row index rather than the one from the sstable min clustering values.

          Show
          slebresne Sylvain Lebresne added a comment - There is one disk access I am not sure if we can remove, when we read the partition level deletion It's a good point. A simple optimization however could be to collect whether a sstable has any partition level deletion in the first place. I suspect that it's not that uncommon to not use partition level deletion at all and so we could avoid that seeks at least in that case. Now I'll also note that when we do have to fetch the partition deletion, if the partition is indexed, we get it from the index file. Which means we at least avoid seeking into the data file, but also mean that with the index entry comes the partition row-index, and that's actually more precise than the global sstable min/max clustering values. In other words, an optimization we could do is that when we do have the index entry when the sstable iterator "lower bound" is queried, we should return the min clustering from the row index rather than the one from the sstable min clustering values.
          Hide
          Stefania Stefania added a comment - - edited

          Thanks for your input. With these two optimizations in place, I think it's starting to be in a decent shape and I would be happy for a first round of review.

          However, we can delay this until CASSANDRA-8099 has been completed since people are busy with it and this change in based on the same branch.

          Also, once CASSANDRA-8915 is available, we will have to integrate it, so more work and another review will be required then.

          Finally, although I did run all unit tests and the cql dtests, and verified that the only failures also apply to the 8099 branch, it could use more testing once 8099 is stable and 8915 is integrated.

          This is what the trace for the basic example above looks like now (with source and timestamp removed to improve formatting):

          cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
          
          [...]
          
            activity                                                                                        | source_elapsed
          -------------------------------------------------------------------------------------------------+----------------
                                                                                        Execute CQL3 query |              0
                                      Parsing SELECT * FROM test WHERE id=1 LIMIT 1; [SharedPool-Worker-1] |            211
                                                                 Preparing statement [SharedPool-Worker-1] |            430
                                            Executing single-partition query on test [SharedPool-Worker-3] |            849
                                                        Acquiring sstable references [SharedPool-Worker-3] |            915
           Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] |           1075
                                            Merging data from memtables and sstables [SharedPool-Worker-3] |           1133
                                                         Key cache hit for sstable 3 [SharedPool-Worker-1] |           1390
                                                                 Accessed 1 sstables [SharedPool-Worker-1] |           1901
                                                  Read 1 live and 0 tombstoned cells [SharedPool-Worker-1] |           1986
                                                                                          Request complete |           2035
          
          Show
          Stefania Stefania added a comment - - edited Thanks for your input. With these two optimizations in place, I think it's starting to be in a decent shape and I would be happy for a first round of review. However, we can delay this until CASSANDRA-8099 has been completed since people are busy with it and this change in based on the same branch. Also, once CASSANDRA-8915 is available, we will have to integrate it, so more work and another review will be required then. Finally, although I did run all unit tests and the cql dtests, and verified that the only failures also apply to the 8099 branch, it could use more testing once 8099 is stable and 8915 is integrated. This is what the trace for the basic example above looks like now (with source and timestamp removed to improve formatting): cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; [...] activity | source_elapsed -------------------------------------------------------------------------------------------------+---------------- Execute CQL3 query | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; [SharedPool-Worker-1] | 211 Preparing statement [SharedPool-Worker-1] | 430 Executing single-partition query on test [SharedPool-Worker-3] | 849 Acquiring sstable references [SharedPool-Worker-3] | 915 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] | 1075 Merging data from memtables and sstables [SharedPool-Worker-3] | 1133 Key cache hit for sstable 3 [SharedPool-Worker-1] | 1390 Accessed 1 sstables [SharedPool-Worker-1] | 1901 Read 1 live and 0 tombstoned cells [SharedPool-Worker-1] | 1986 Request complete | 2035
          Show
          Stefania Stefania added a comment - - edited Rebased to latest 8099 branch. Test results available here: http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8180-8099-testall/ http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8180-8099-dtest/ Many tests fail but so does the plain 8099 branch: http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8099_engine_refactor-testall/ http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8099_engine_refactor-dtest/
          Hide
          Stefania Stefania added a comment - - edited

          I'm going to rebase this and bring it back to life after the merge of CASSANDRA-8099 and CASSANDRA-8915.

          Show
          Stefania Stefania added a comment - - edited I'm going to rebase this and bring it back to life after the merge of CASSANDRA-8099 and CASSANDRA-8915 .
          Hide
          Stefania Stefania added a comment -

          I have attached the latest patch based on trunk, which is ready for review.

          The main things to note are:

          • In MergeIterator we need to discard fake lower bound values and we must advance the same iterator to check that there isn't another real value with the exact same value, in which case we must pass it to the reducer together with the other equal values from other iterators. However, we cannot just discard fake values too soon, i.e. in ManyToOne.advance() we cannot just advance as long as we have fake values, we need to wait for the heap to be sorted first. I implemented a peek method in the candidate, this is called for fake values when reducing values. This should not affect the correctness of the algorithm but there may be a more efficient way to do this, cc Benedict and Branimir Lambov.
          • We need to decide if we are happy with a wrapping iterator, LowerBoundUnfilteredRowIterator, in which case we may need a better name. I understand that wrapping too many iterators may hurt performance so we may want to look into modifying AbstractSSTableIterator directly, even though this might be a bit more work. I also wrap the merged iterator to make sure we update the metrics of iterated tables in the close method, when we know how many tables were iterated. It would be nice to at least save this wrapped iterator.
          • We need a reliable way to signal a fake Unfiltered object (the lower bound). I used an empty row but perhaps we should use a new specialization for Unfiltered or something else.

          Sylvain Lebresne are you happy to still be the reviewer or do you want to suggest someone else?

          Show
          Stefania Stefania added a comment - I have attached the latest patch based on trunk, which is ready for review. The main things to note are: In MergeIterator we need to discard fake lower bound values and we must advance the same iterator to check that there isn't another real value with the exact same value, in which case we must pass it to the reducer together with the other equal values from other iterators. However, we cannot just discard fake values too soon, i.e. in ManyToOne.advance() we cannot just advance as long as we have fake values, we need to wait for the heap to be sorted first. I implemented a peek method in the candidate, this is called for fake values when reducing values. This should not affect the correctness of the algorithm but there may be a more efficient way to do this, cc Benedict and Branimir Lambov . We need to decide if we are happy with a wrapping iterator, LowerBoundUnfilteredRowIterator , in which case we may need a better name. I understand that wrapping too many iterators may hurt performance so we may want to look into modifying AbstractSSTableIterator directly, even though this might be a bit more work. I also wrap the merged iterator to make sure we update the metrics of iterated tables in the close method, when we know how many tables were iterated. It would be nice to at least save this wrapped iterator. We need a reliable way to signal a fake Unfiltered object (the lower bound). I used an empty row but perhaps we should use a new specialization for Unfiltered or something else. Sylvain Lebresne are you happy to still be the reviewer or do you want to suggest someone else?
          Hide
          slebresne Sylvain Lebresne added a comment -

          Sylvain Lebresne are you happy to still be the reviewer or do you want to suggest someone else?

          I'm leaving on vacation at the end of the week and have enough on my plate until then that it's safe to assume I won't have time to look at this one. I'm happy to have a look when I'm back and have a tad more time, but it's probably a good idea to find another reviewer in the meantime in case that new reviewer has time to get to it sooner. I would suggest Branimir if he has some cycles since that ticket has a lot to do with MergeIterator and he has been dealing with that quite a bit lately.

          Show
          slebresne Sylvain Lebresne added a comment - Sylvain Lebresne are you happy to still be the reviewer or do you want to suggest someone else? I'm leaving on vacation at the end of the week and have enough on my plate until then that it's safe to assume I won't have time to look at this one. I'm happy to have a look when I'm back and have a tad more time, but it's probably a good idea to find another reviewer in the meantime in case that new reviewer has time to get to it sooner. I would suggest Branimir if he has some cycles since that ticket has a lot to do with MergeIterator and he has been dealing with that quite a bit lately.
          Hide
          Stefania Stefania added a comment -

          Thanks. Branimir Lambov would you have time to review this or shall we wait for Sylvain's return?

          Show
          Stefania Stefania added a comment - Thanks. Branimir Lambov would you have time to review this or shall we wait for Sylvain's return?
          Hide
          slebresne Sylvain Lebresne added a comment -

          Side note: would be nice to have some benchmarks validating the benefits of this before we commit, since that's a performance improvement.

          Show
          slebresne Sylvain Lebresne added a comment - Side note: would be nice to have some benchmarks validating the benefits of this before we commit, since that's a performance improvement.
          Hide
          Stefania Stefania added a comment -

          Noted. I'll work on a suitable stress profile and then attach some flight recorder or cstar_perf comparisons.

          Show
          Stefania Stefania added a comment - Noted. I'll work on a suitable stress profile and then attach some flight recorder or cstar_perf comparisons.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Do we have a place where we're archiving these stress profiles? Seems like it would be helpful to have a profile paired with a small comment on what specific setup/portions of the system is stresses for future use. Also can socket those into the perf harness when that's ready.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Do we have a place where we're archiving these stress profiles? Seems like it would be helpful to have a profile paired with a small comment on what specific setup/portions of the system is stresses for future use. Also can socket those into the perf harness when that's ready.
          Hide
          blambov Branimir Lambov added a comment -

          Yes, I have started looking at it now.

          Show
          blambov Branimir Lambov added a comment - Yes, I have started looking at it now.
          Hide
          Stefania Stefania added a comment - - edited

          We definitely need a place to archive these stress profiles, attaching them to the tickets like I've done so far is not very efficient. CASSANDRA-8503 aims to collect important stress profiles for regression testing, perhaps we can continue the discussion there?

          In terms of this specific performance optimization, so far things aren't going too well. The problem is that I don't know how to create a 'timeseries' profile with cassandra-stress, is this possible Benedict? We basically need to have many clustering rows per partition and they must be ordered.

          With the profile attached, at best I have been able to show that we are not worse:

          http://cstar.datastax.com/tests/id/ac8c686c-31de-11e5-95c3-42010af0688f
          http://cstar.datastax.com/tests/id/11dc0080-31dd-11e5-80f3-42010af0688f

          However, when I changed the primary key population distribution from 1..100K to 1B, I ended up with something worse:

          http://cstar.datastax.com/tests/id/ac8de5ca-31de-11e5-a5b9-42010af0688f

          So I think there is still some work to do. I have also some weirdness in the flight recorder profiles, which I have not attached due to their size but I can do if anyone is interested. I fixed two hot-spots today but there must be at least one more problem. The whole approach of a lazy wrapping iterator is a bit fragile IMO, all it takes is to call a method too soon and the entire optimization is lost, except the overhead of the wrapper iterator and the increased complexity in the merge iterator remains.

          Show
          Stefania Stefania added a comment - - edited We definitely need a place to archive these stress profiles, attaching them to the tickets like I've done so far is not very efficient. CASSANDRA-8503 aims to collect important stress profiles for regression testing, perhaps we can continue the discussion there? In terms of this specific performance optimization, so far things aren't going too well. The problem is that I don't know how to create a 'timeseries' profile with cassandra-stress, is this possible Benedict ? We basically need to have many clustering rows per partition and they must be ordered. With the profile attached, at best I have been able to show that we are not worse: http://cstar.datastax.com/tests/id/ac8c686c-31de-11e5-95c3-42010af0688f http://cstar.datastax.com/tests/id/11dc0080-31dd-11e5-80f3-42010af0688f However, when I changed the primary key population distribution from 1..100K to 1B, I ended up with something worse: http://cstar.datastax.com/tests/id/ac8de5ca-31de-11e5-a5b9-42010af0688f So I think there is still some work to do. I have also some weirdness in the flight recorder profiles, which I have not attached due to their size but I can do if anyone is interested. I fixed two hot-spots today but there must be at least one more problem. The whole approach of a lazy wrapping iterator is a bit fragile IMO, all it takes is to call a method too soon and the entire optimization is lost, except the overhead of the wrapper iterator and the increased complexity in the merge iterator remains.
          Hide
          Stefania Stefania added a comment -

          Thanks, make sure you have the latest code as I have rebased and fixed a couple of things today.

          Show
          Stefania Stefania added a comment - Thanks, make sure you have the latest code as I have rebased and fixed a couple of things today.
          Hide
          benedict Benedict added a comment -

          Support for that isn't fantastic, but instead of providing a select chance of 0.001, you can use the -insert revisit visits contents, and -pop read-lookback options.

          • visits tells stress to split inserts for a single partition up into multiple visits, performing a roughly equal number of inserts per visit, but ultimately inserting the entirety of the partition's contents
          • revisits tells stress how many of these split-up inserts to maintain at once, and in what distribution they should be revisited (i.e. exp(1..1M) would mean the most recent to be inserted would be exponentially more likely to be visited than the oldest; uniform(1..1M) would make them all uniformly likely; both would limit the number of "in flight" inserts to 1M)
          • read-lookback tells reads to operate only over those partitions that have been inserted, and only their portion that has been inserted
          • contents=SORTED sorts the partition data before insert. This is expensive if there are many items, so you should ensure no single clustering column generates more than a couple of dozen values. If necessary, introduce more clustering columns.

          The main problem is that these features are not well (if at all) tested, so you may encounter problems.

          Show
          benedict Benedict added a comment - Support for that isn't fantastic, but instead of providing a select chance of 0.001, you can use the -insert revisit visits contents , and -pop read-lookback options. visits tells stress to split inserts for a single partition up into multiple visits, performing a roughly equal number of inserts per visit, but ultimately inserting the entirety of the partition's contents revisits tells stress how many of these split-up inserts to maintain at once, and in what distribution they should be revisited (i.e. exp(1..1M) would mean the most recent to be inserted would be exponentially more likely to be visited than the oldest; uniform(1..1M) would make them all uniformly likely; both would limit the number of "in flight" inserts to 1M) read-lookback tells reads to operate only over those partitions that have been inserted, and only their portion that has been inserted contents=SORTED sorts the partition data before insert. This is expensive if there are many items, so you should ensure no single clustering column generates more than a couple of dozen values. If necessary, introduce more clustering columns. The main problem is that these features are not well (if at all) tested, so you may encounter problems.
          Hide
          blambov Branimir Lambov added a comment -

          I have some worries that relate to the use of a row as the lower bound: for one there is a small risk that having two equal values produced by an iterators could be mishandled by some existing or future code, and as a second I do not know what would happen if an sstable's content starts with a range tombstone, which sorts before rows with equal clustering. Additionally, I see no changes in MergeIterator.OneToOne or TrivialOneToOne, do they not need updates?

          I believe the solution could be simpler and cleaner if the bound is given as a distinct special value that sorts before any content (similar to how range tombstone bounds sort). The existence of these special values needs only concern the arguments given to MergeIterator: the comparator, which should know to sort them before anything else with the same key (including tombstone bounds), as well as the reducer, which can then be certain that a bound is not going to be attempted to be combined with anything else, and can safely ignore it by perhaps returning null at getReduced; if necessary bounds can be easily made unequal to each other as well. MergeIterator will then pop bounds before content rather than with it, and will not need not be modified at all.

          Show
          blambov Branimir Lambov added a comment - I have some worries that relate to the use of a row as the lower bound: for one there is a small risk that having two equal values produced by an iterators could be mishandled by some existing or future code, and as a second I do not know what would happen if an sstable's content starts with a range tombstone, which sorts before rows with equal clustering. Additionally, I see no changes in MergeIterator.OneToOne or TrivialOneToOne , do they not need updates? I believe the solution could be simpler and cleaner if the bound is given as a distinct special value that sorts before any content (similar to how range tombstone bounds sort). The existence of these special values needs only concern the arguments given to MergeIterator : the comparator, which should know to sort them before anything else with the same key (including tombstone bounds), as well as the reducer, which can then be certain that a bound is not going to be attempted to be combined with anything else, and can safely ignore it by perhaps returning null at getReduced ; if necessary bounds can be easily made unequal to each other as well. MergeIterator will then pop bounds before content rather than with it, and will not need not be modified at all.
          Hide
          Stefania Stefania added a comment -

          I have some worries that relate to the use of a row as the lower bound: for one there is a small risk that having two equal values produced by an iterators could be mishandled by some existing or future code, and as a second I do not know what would happen if an sstable's content starts with a range tombstone, which sorts before rows with equal clustering.

          The idea is that these fake Clusterable instances never leave the merge iterators. I can create a sub-class of Clusterable for clarity but it's safe to use an empty row as it is only returned as a lower bound and not as an iterated value. Have you looked at the version of the code I pushed today, where the candidate takes the lower bound from the iterator? At the moment the reducer decides if it should ignore it by checking if the row is empty. This is wrong it should be the candidate that decides that it should be ignored and the next one returned instead, except this requires a new comparison.

          Additionally, I see no changes in MergeIterator.OneToOne or TrivialOneToOne, do they not need updates?

          No they don't need updating, they never use the lower bound.

          I believe the solution could be simpler and cleaner if the bound is given as a distinct special value that sorts before any content...

          Yes this would solve the problem just mentioned above where we need to check if there is an equal real value immediately afterwards. I'll work on this idea, thank you very much!

          Show
          Stefania Stefania added a comment - I have some worries that relate to the use of a row as the lower bound: for one there is a small risk that having two equal values produced by an iterators could be mishandled by some existing or future code, and as a second I do not know what would happen if an sstable's content starts with a range tombstone, which sorts before rows with equal clustering. The idea is that these fake Clusterable instances never leave the merge iterators. I can create a sub-class of Clusterable for clarity but it's safe to use an empty row as it is only returned as a lower bound and not as an iterated value. Have you looked at the version of the code I pushed today, where the candidate takes the lower bound from the iterator? At the moment the reducer decides if it should ignore it by checking if the row is empty. This is wrong it should be the candidate that decides that it should be ignored and the next one returned instead, except this requires a new comparison. Additionally, I see no changes in MergeIterator.OneToOne or TrivialOneToOne, do they not need updates? No they don't need updating, they never use the lower bound. I believe the solution could be simpler and cleaner if the bound is given as a distinct special value that sorts before any content... Yes this would solve the problem just mentioned above where we need to check if there is an equal real value immediately afterwards. I'll work on this idea, thank you very much!
          Hide
          Stefania Stefania added a comment -

          Branimir Lambov I've pushed a new commit where I ensure that lower bounds always compare to less than real values, when the clustering is the same. This means the existing merge iterator algorithm is now almost unchanged, the only difference is that I moved consume() into the candidate, where we make sure the lower bounds are never consumed by the reducer. We still use empty rows as lower bounds, but they are never used outside of the merge iterator candidate. We could use a specialized Unfiltered if it really bothers you however. Can you take another look?

          As for performance, with this test the 8180 branch is ahead:

          user profile=https://dl.dropboxusercontent.com/u/15683245/8180.yaml ops\(insert=1,\) n=5M -rate threads=300 -insert revisit=uniform\(1..100\) visits=fixed\(25\) -pop seq=1..1K read-lookback=uniform\(1..1K\) contents=SORTED
          

          http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=1609.08&ymin=0&ymax=4637.6

          The read command is unchanged and performance is similar or maybe still better on trunk:

          user profile=https://dl.dropboxusercontent.com/u/15683245/8180.yaml ops\(singleval=1,\) n=5M -rate threads=300
          

          http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688f&metric=op_rate&operation=2_user&smoothing=1&show_aggregates=true&xmin=0&xmax=35.53&ymin=0&ymax=188763.3

          I think we are still visiting all sstables on the second command because the global bounds are probably the same. Profile is attached as 8180_002.yaml.

          I've also noticed with flight recorder that BigTableReader.getPosition() and Tracing.trace() are hotspots, both on trunk and on 8180, we should probably optimize them.

          Show
          Stefania Stefania added a comment - Branimir Lambov I've pushed a new commit where I ensure that lower bounds always compare to less than real values, when the clustering is the same. This means the existing merge iterator algorithm is now almost unchanged, the only difference is that I moved consume() into the candidate, where we make sure the lower bounds are never consumed by the reducer. We still use empty rows as lower bounds, but they are never used outside of the merge iterator candidate. We could use a specialized Unfiltered if it really bothers you however. Can you take another look? As for performance, with this test the 8180 branch is ahead: user profile=https: //dl.dropboxusercontent.com/u/15683245/8180.yaml ops\(insert=1,\) n=5M -rate threads=300 -insert revisit=uniform\(1..100\) visits=fixed\(25\) -pop seq=1..1K read-lookback=uniform\(1..1K\) contents=SORTED http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=1609.08&ymin=0&ymax=4637.6 The read command is unchanged and performance is similar or maybe still better on trunk: user profile=https: //dl.dropboxusercontent.com/u/15683245/8180.yaml ops\(singleval=1,\) n=5M -rate threads=300 http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688f&metric=op_rate&operation=2_user&smoothing=1&show_aggregates=true&xmin=0&xmax=35.53&ymin=0&ymax=188763.3 I think we are still visiting all sstables on the second command because the global bounds are probably the same. Profile is attached as 8180_002.yaml. I've also noticed with flight recorder that BigTableReader.getPosition() and Tracing.trace() are hotspots, both on trunk and on 8180, we should probably optimize them.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Stefania Could you have a shot at rebasing this at some point, and Branimir Lambov to have another look at it once that's done. I'd hate for that code to get so outdated that its too much effort to rebase.

          Show
          slebresne Sylvain Lebresne added a comment - Stefania Could you have a shot at rebasing this at some point, and Branimir Lambov to have another look at it once that's done. I'd hate for that code to get so outdated that its too much effort to rebase.
          Hide
          Stefania Stefania added a comment -

          Will rebase and also try to see if we can leverage CASSANDRA-9975 to avoid wrapping iterators.

          Show
          Stefania Stefania added a comment - Will rebase and also try to see if we can leverage CASSANDRA-9975 to avoid wrapping iterators.
          Hide
          Stefania Stefania added a comment -

          Rebased and squashed Branimir Lambov:

          trunk
          patch
          testall
          dtest

          There is one issue, RangeTombstone.Bound.TOP and BOTTOM are not currently updating the min and max clustering values of StatsMetadata since their clustering values are empty arrays, see MetadataCollector.updateClusteringValues() and BTW.applyToMarker(). As a result, some range tombstone tests are failing. It's not clear to me how to capture this information in the metadata without changing its format given that the smallest / biggest values would depend on the individual clustering types, cc Sylvain Lebresne as well.

          Show
          Stefania Stefania added a comment - Rebased and squashed Branimir Lambov : trunk patch testall dtest There is one issue, RangeTombstone.Bound.TOP and BOTTOM are not currently updating the min and max clustering values of StatsMetadata since their clustering values are empty arrays, see MetadataCollector.updateClusteringValues() and BTW.applyToMarker() . As a result, some range tombstone tests are failing. It's not clear to me how to capture this information in the metadata without changing its format given that the smallest / biggest values would depend on the individual clustering types, cc Sylvain Lebresne as well.
          Hide
          blambov Branimir Lambov added a comment -

          updateClusteringValues() doesn't produce what you need, as it is not just TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What you need is an actual min/maxClusteringPrefix which is different from the min/max of each component separately.

          I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data.

          Show
          blambov Branimir Lambov added a comment - updateClusteringValues() doesn't produce what you need, as it is not just TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What you need is an actual min/maxClusteringPrefix which is different from the min/max of each component separately. I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data.
          Hide
          blambov Branimir Lambov added a comment -

          I personally would prefer to not modify the behaviour of MergeIterator and keep it doing one simple thing, but this approach does have its charm.

          An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions.

          IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound to be explicit about its purpose.

          The choice to set rowIndexLowerBound in partitionLevelDeletion() appears very arbitrary and fragile. What is the reason to do it separately from globalLowerBound? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time?

          Show
          blambov Branimir Lambov added a comment - I personally would prefer to not modify the behaviour of MergeIterator and keep it doing one simple thing, but this approach does have its charm. An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen , which should do the right thing in both directions. IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound to be explicit about its purpose. The choice to set rowIndexLowerBound in partitionLevelDeletion() appears very arbitrary and fragile. What is the reason to do it separately from globalLowerBound ? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time?
          Hide
          Stefania Stefania added a comment -

          Thank you for the review.

          updateClusteringValues() doesn't produce what you need, as it is not just TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What you need is an actual min/maxClusteringPrefix which is different from the min/max of each component separately.

          I agree that we need a full clustering prefix. What I don't understand is how things like shouldInclude() in ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work. What is an example of any other incomplete prefix and do we have a gap in the tests then?

          I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data.

          There are 4 components that we write in the same file: VALIDATION, STATS, COMPACTION and HEADER so we can't simply keep on reading till the end of the file. However we write a TOC with the position of each component. So it should be possible but it would require changes to MetadataSerializer.deserialize() and the signature of IMetadataComponentSerializer.deserialize(), which should receive the total size to work out if there is more stuff to read at the end. I guess we can go for it.

          I personally would prefer to not modify the behaviour of MergeIterator and keep it doing one simple thing, but this approach does have its charm.

          The changes to MergeIterator are mostly in the candidate and really minimal, the actual algorithm is unchanged. However if you do have a less invasive approach in mind I'm eager to hear it.

          An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code).

          Thanks, I'll add this test.

          Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions.

          I'm not sure what you mean, do this for the test or the fix? I'm sure I'll work it out when I write the test though.

          IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound to be explicit about its purpose.

          OK

          The choice to set rowIndexLowerBound in partitionLevelDeletion() appears very arbitrary and fragile. What is the reason to do it separately from globalLowerBound? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time?

          The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file. Calling super.partitionLevelDeletion() also involves initializing the iterator and accessing the data file (AbstractSSTableIterator constructor). So we decided to use this more accurate bound only when we really have to access the index anyway, that is when partitionLevelDeletion() is called and there are tombstones. See this comment above.

          I hope to resume this in the next few days.

          Show
          Stefania Stefania added a comment - Thank you for the review. updateClusteringValues() doesn't produce what you need, as it is not just TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What you need is an actual min/maxClusteringPrefix which is different from the min/max of each component separately. I agree that we need a full clustering prefix. What I don't understand is how things like shouldInclude() in ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work. What is an example of any other incomplete prefix and do we have a gap in the tests then? I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data. There are 4 components that we write in the same file: VALIDATION, STATS, COMPACTION and HEADER so we can't simply keep on reading till the end of the file. However we write a TOC with the position of each component. So it should be possible but it would require changes to MetadataSerializer.deserialize() and the signature of IMetadataComponentSerializer.deserialize() , which should receive the total size to work out if there is more stuff to read at the end. I guess we can go for it. I personally would prefer to not modify the behaviour of MergeIterator and keep it doing one simple thing, but this approach does have its charm. The changes to MergeIterator are mostly in the candidate and really minimal, the actual algorithm is unchanged. However if you do have a less invasive approach in mind I'm eager to hear it. An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Thanks, I'll add this test. Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions. I'm not sure what you mean, do this for the test or the fix? I'm sure I'll work it out when I write the test though. IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound to be explicit about its purpose. OK The choice to set rowIndexLowerBound in partitionLevelDeletion() appears very arbitrary and fragile. What is the reason to do it separately from globalLowerBound? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time? The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file. Calling super.partitionLevelDeletion() also involves initializing the iterator and accessing the data file (AbstractSSTableIterator constructor). So we decided to use this more accurate bound only when we really have to access the index anyway, that is when partitionLevelDeletion() is called and there are tombstones. See this comment above. I hope to resume this in the next few days.
          Hide
          blambov Branimir Lambov added a comment -

          What is an example of any other incomplete prefix and do we have a gap in the tests then?

          Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one.

          What I don't understand is how things like shouldInclude() in ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work.

          If you look at the callsites for the method, you will see that they do more work in the presence of tombstones. So one solution is not to use the min/maxClusteringValues in that case.

          [MetadataSerializer.deserialize()] should receive the total size to work out if there is more stuff to read at the end.

          No need for that, you can set a flag in Version to tell you whether or not the information is present.

          I'm not sure what you mean, [use a RangeTombstoneBound] for the test or the fix?

          This is the fix. Instead of an empty row, the lower bound should be a RangeTombstoneBound as described.

          The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file.

          In the way you use this class, by the time lowerBound() is called, all of this is already done (by UnfilteredRowMergeIterator.create), possibly unnecessarily (if MergeIterator.OneToOne is to be used). I would just move finding the bound to lowerBound(), and I don't think it's even necessary to save the bound-- just retrieve it there, the method won't be called more than once.

          Show
          blambov Branimir Lambov added a comment - What is an example of any other incomplete prefix and do we have a gap in the tests then? Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one. What I don't understand is how things like shouldInclude() in ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work. If you look at the callsites for the method, you will see that they do more work in the presence of tombstones. So one solution is not to use the min/maxClusteringValues in that case. [MetadataSerializer.deserialize()] should receive the total size to work out if there is more stuff to read at the end. No need for that, you can set a flag in Version to tell you whether or not the information is present. I'm not sure what you mean, [use a RangeTombstoneBound] for the test or the fix? This is the fix. Instead of an empty row, the lower bound should be a RangeTombstoneBound as described. The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file. In the way you use this class, by the time lowerBound() is called, all of this is already done (by UnfilteredRowMergeIterator.create ), possibly unnecessarily (if MergeIterator.OneToOne is to be used). I would just move finding the bound to lowerBound() , and I don't think it's even necessary to save the bound-- just retrieve it there, the method won't be called more than once.
          Hide
          Stefania Stefania added a comment -

          Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one.

          I will add this test, thanks.

          So one solution is not to use the min/maxClusteringValues in that case.

          So maybe we could simply return a null lower bound in the presence of tombstones or is this too much of a compromise?

          No need for that, you can set a flag in Version to tell you whether or not the information is present.

          Doesn't this only indicate a different sstable version ("la", "ma", etc)?

          This is the fix. Instead of an empty row, the lower bound should be a RangeTombstoneBound as described.

          Thanks.

          I would just move finding the bound to lowerBound(), and I don't think it's even necessary to save the bound

          OK

          Show
          Stefania Stefania added a comment - Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one. I will add this test, thanks. So one solution is not to use the min/maxClusteringValues in that case. So maybe we could simply return a null lower bound in the presence of tombstones or is this too much of a compromise? No need for that, you can set a flag in Version to tell you whether or not the information is present. Doesn't this only indicate a different sstable version ("la", "ma", etc)? This is the fix. Instead of an empty row, the lower bound should be a RangeTombstoneBound as described. Thanks. I would just move finding the bound to lowerBound(), and I don't think it's even necessary to save the bound OK
          Hide
          Stefania Stefania added a comment -

          Pushed a commit where we don't use the lower bound in the presence of tombstones and DeleteTest is now passing. CI is still pending however. I would like to point out that iter.partitionLevelDeletion() is currently called for all sstables by queryMemtableAndDiskInternal() and therefore, in the presence of tombstones, we have to access the sstable anyway.

          Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one.

          This case existed already in DeleteTest.testDeleteWithRangeAndTwoClusteringColumns(). It does not fail though, because the clustering comparator compares prefix values from first to last and so it works fine with incomplete prefixes.

          An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions.

          I don't think this matters any longer if we never replace a tombstone with a lower bound but, I could not reproduce any failures even when using a lower bound for tombstones (other than for one sided range tombstones). Here is what I tried:

          • add a row, flush, delete same row, flush again => FINE (we don't even use merge iterator)
          • add two rows in same partition, flush, delete one and flush, delete other one and flush, FINE again
          • add three rows in same partition, flush, delete one and flush, delete other one and flush, FINE again

          We can replace BTreeRow.emptyRow(ret) with new RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(), ret.getRawValues()), DeletionTime.LIVE) if there is still a valid reason and ideally a failing test would be useful. I will clean up lowerBound() and add some comments to it in the next round once we have reached a decision.

          Show
          Stefania Stefania added a comment - Pushed a commit where we don't use the lower bound in the presence of tombstones and DeleteTest is now passing. CI is still pending however. I would like to point out that iter.partitionLevelDeletion() is currently called for all sstables by queryMemtableAndDiskInternal() and therefore, in the presence of tombstones, we have to access the sstable anyway. Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one. This case existed already in DeleteTest.testDeleteWithRangeAndTwoClusteringColumns() . It does not fail though, because the clustering comparator compares prefix values from first to last and so it works fine with incomplete prefixes. An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions. I don't think this matters any longer if we never replace a tombstone with a lower bound but, I could not reproduce any failures even when using a lower bound for tombstones (other than for one sided range tombstones). Here is what I tried: add a row, flush, delete same row, flush again => FINE (we don't even use merge iterator) add two rows in same partition, flush, delete one and flush, delete other one and flush, FINE again add three rows in same partition, flush, delete one and flush, delete other one and flush, FINE again We can replace BTreeRow.emptyRow(ret) with new RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(), ret.getRawValues()), DeletionTime.LIVE) if there is still a valid reason and ideally a failing test would be useful. I will clean up lowerBound() and add some comments to it in the next round once we have reached a decision.
          Hide
          blambov Branimir Lambov added a comment -

          iter.partitionLevelDeletion() is currently called for all sstables by queryMemtableAndDiskInternal()

          That is, your code is currently completely disabled by the if (... iterator != null) return null; check in lowerBound()? Is there a reason to want the code committed then?

          We can replace BTreeRow.emptyRow(ret) with new RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(), ret.getRawValues()), DeletionTime.LIVE) if there is still a valid reason and ideally a failing test would be useful.

          It is not as easy to fully break it as I was expecting, but in the presence of tombstones you can still break a basic feature of the iterators – the inequality of the returned elements. A test that does

                  createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))");
                  execute("INSERT INTO %s (a, b, c) VALUES(1, 1, '1')");
                  execute("INSERT INTO %s (a, b, c) VALUES(1, 3, '3')");
                  execute("DELETE FROM %s where a=1 and b >= 2 and b <= 3");
                  execute("INSERT INTO %s (a, b, c) VALUES(1, 2, '2')");
                  flush();
          
                  execute("DELETE FROM %s where a=1 and b >= 2 and b <= 3");
                  flush();
          
                  execute("SELECT * FROM %s WHERE a = ?", 1);
          

          will end up with an iterator that lists two tombstone markers with equal clustering. Unfortunately that's filtered out before being returned so it's not trivial to write a test that checks this. On the other hand, an assertion in UnfilteredIteratorWithLowerBound that saves the returned lowerBound() and checks the first next() to be greater-than or equal may be something we want to have anyway.

          Show
          blambov Branimir Lambov added a comment - iter.partitionLevelDeletion() is currently called for all sstables by queryMemtableAndDiskInternal() That is, your code is currently completely disabled by the if (... iterator != null) return null; check in lowerBound() ? Is there a reason to want the code committed then? We can replace BTreeRow.emptyRow(ret) with new RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(), ret.getRawValues()), DeletionTime.LIVE) if there is still a valid reason and ideally a failing test would be useful. It is not as easy to fully break it as I was expecting, but in the presence of tombstones you can still break a basic feature of the iterators – the inequality of the returned elements. A test that does createTable( "CREATE TABLE %s (a int , b int , c text, primary key (a, b))" ); execute( "INSERT INTO %s (a, b, c) VALUES(1, 1, '1')" ); execute( "INSERT INTO %s (a, b, c) VALUES(1, 3, '3')" ); execute( "DELETE FROM %s where a=1 and b >= 2 and b <= 3" ); execute( "INSERT INTO %s (a, b, c) VALUES(1, 2, '2')" ); flush(); execute( "DELETE FROM %s where a=1 and b >= 2 and b <= 3" ); flush(); execute( "SELECT * FROM %s WHERE a = ?" , 1); will end up with an iterator that lists two tombstone markers with equal clustering. Unfortunately that's filtered out before being returned so it's not trivial to write a test that checks this. On the other hand, an assertion in UnfilteredIteratorWithLowerBound that saves the returned lowerBound() and checks the first next() to be greater-than or equal may be something we want to have anyway.
          Hide
          Stefania Stefania added a comment -

          That is, your code is currently completely disabled by the if (... iterator != null) return null; check in lowerBound()? Is there a reason to want the code committed then?

          It is enabled when there are no tombsones since the iterator is null. However we can do better, when there are tombstones we can still use the row index bound if there is one.

          On the other hand, an assertion in UnfilteredIteratorWithLowerBound that saves the returned lowerBound() and checks the first next() to be greater-than or equal may be something we want to have anyway.

          I've added this. Whilst the test you proposed is fine, existing tests (e.g. DeleteTest.testDeleteWithRangeAndOneClusteringColumn()) started to fail but only when compaction is enabled. It seems we can have sstables with no tombstones, valid clustering values (e.g. 0) and yet the first item is a Bound.BOTTOM.

          Further, it seems that static columns could also have problems with clustering values so it's not just tombstones. See the comment in SinglePartitionReadCommand.shouldInclude().

          Moving back to in progress, might end up modifying sstable metadata format after all.

          Show
          Stefania Stefania added a comment - That is, your code is currently completely disabled by the if (... iterator != null) return null; check in lowerBound()? Is there a reason to want the code committed then? It is enabled when there are no tombsones since the iterator is null. However we can do better, when there are tombstones we can still use the row index bound if there is one. On the other hand, an assertion in UnfilteredIteratorWithLowerBound that saves the returned lowerBound() and checks the first next() to be greater-than or equal may be something we want to have anyway. I've added this. Whilst the test you proposed is fine, existing tests (e.g. DeleteTest.testDeleteWithRangeAndOneClusteringColumn() ) started to fail but only when compaction is enabled . It seems we can have sstables with no tombstones, valid clustering values (e.g. 0) and yet the first item is a Bound.BOTTOM . Further, it seems that static columns could also have problems with clustering values so it's not just tombstones. See the comment in SinglePartitionReadCommand.shouldInclude() . Moving back to in progress, might end up modifying sstable metadata format after all.
          Hide
          Stefania Stefania added a comment -

          It seems we can have sstables with no tombstones, valid clustering values (e.g. 0) and yet the first item is a Bound.BOTTOM.

          It's once again the metadata that is wrong: maxLocalDeletionTime is Integer.MAX_VALUE even though there are tombstones in the sstable, as a consequence sstable.hasTombstones() returns an incorrect value.

          Show
          Stefania Stefania added a comment - It seems we can have sstables with no tombstones, valid clustering values (e.g. 0) and yet the first item is a Bound.BOTTOM. It's once again the metadata that is wrong: maxLocalDeletionTime is Integer.MAX_VALUE even though there are tombstones in the sstable, as a consequence sstable.hasTombstones() returns an incorrect value.
          Hide
          Stefania Stefania added a comment -

          I think the issue is in MetadataCollector:

              public void update(Cell cell)
              {
                  updateTimestamp(cell.timestamp());
                  updateTTL(cell.ttl());
                  updateLocalDeletionTime(cell.localDeletionTime());
              }
          

          Shouldn't we check that the cell is a tombstone before updating the local deletion time?

          Show
          Stefania Stefania added a comment - I think the issue is in MetadataCollector : public void update(Cell cell) { updateTimestamp(cell.timestamp()); updateTTL(cell.ttl()); updateLocalDeletionTime(cell.localDeletionTime()); } Shouldn't we check that the cell is a tombstone before updating the local deletion time?
          Hide
          blambov Branimir Lambov added a comment -

          Further, it seems that static columns could also have problems with clustering values so it's not just tombstones. See the comment in SinglePartitionReadCommand.shouldInclude().

          Isn't the static row is outside the iteration? Unless I'm very mistaken It shouldn't enter MergeIterator at all, and thus the bound needn't be corrected for it.

          Additionally, if static rows are needed the merge will query each iterator's staticRow() method which will initialize the iterator.

          It is enabled when there are no tombsones since the iterator is null.

          Sorry, my bad, I have overlooked partitionLevelDeletion() not calling the super method in that case.

          It's once again the metadata that is wrong: maxLocalDeletionTime is Integer.MAX_VALUE even though there are tombstones in the sstable, as a consequence sstable.hasTombstones() returns an incorrect value.

          So we can't really trust hasTombstones()? This may be a significant bug.

          Show
          blambov Branimir Lambov added a comment - Further, it seems that static columns could also have problems with clustering values so it's not just tombstones. See the comment in SinglePartitionReadCommand.shouldInclude() . Isn't the static row is outside the iteration? Unless I'm very mistaken It shouldn't enter MergeIterator at all, and thus the bound needn't be corrected for it. Additionally, if static rows are needed the merge will query each iterator's staticRow() method which will initialize the iterator. It is enabled when there are no tombsones since the iterator is null. Sorry, my bad, I have overlooked partitionLevelDeletion() not calling the super method in that case. It's once again the metadata that is wrong: maxLocalDeletionTime is Integer.MAX_VALUE even though there are tombstones in the sstable, as a consequence sstable.hasTombstones() returns an incorrect value. So we can't really trust hasTombstones() ? This may be a significant bug.
          Hide
          Stefania Stefania added a comment -

          The latest commit is ready for review here. CI is still pending however.

          Isn't the static row is outside the iteration?

          That's a very good point, thanks for pointing this out. So we don't need to worry about this.

          So we can't really trust hasTombstones()? This may be a significant bug.

          This method was introduced in this patch but currently the same logic is used in SinglePartitionReadCommand.queryMemtableAndDiskInternal() and queryMemtableAndSSTablesInTimestampOrder to determine which non overlapping sstables contain no tombstones at all, see especially the comment in the second method:

                  // This mean that nothing queried by the filter can be in the sstable. One exception is the top-level partition deletion
                  // however: if it is set, it impacts everything and must be included. Getting that top-level partition deletion costs us
                  // some seek in general however (unless the partition is indexed and is in the key cache), so we first check if the sstable
                  // has any tombstone at all as a shortcut.
                  if (sstable.getSSTableMetadata().maxLocalDeletionTime == Integer.MAX_VALUE)
                      continue; // Means no tombstone at all, we can skip that sstable
          

          The trouble is that in other places we assume the opposite, for example CompactionController.getFullyExpiredSSTables() requires maxDeletionTime to be Integer.MAX_VALUE in the presence of live cells or else it might expire the sstable. I think the code in SinglePartitionReadCommand is wrong and we should have looked at the minDeletionTime instead, so I've updated the new method accordingly.

          I've had some failures with upgrade dtests, due to the clustering values of old sstables ("jb" format) containing more values than the comparator types. I've added a check on Version.hasNewStatsFile() to prevent this and some assertions to spot it sooner, and this seems to have fixed the failing tests.

          To wrap up the discussion on upgrading sstable stats, this would allow applying this optimization also to sstables with tombstones and that are smaller than DatabaseDescriptor.column_index_size_in_kb, 64kb by default. For larger sstables with tombstones, the lower bound comes from the partition index and this will be read anyway in the presence of tombstones when partitionDeletion() is called. This requires also initializing the iterator, which will cause a disk seek to the partition start if the partition index is not available (sstables smaller than 64kb) or there are static columns in the query, see the AbstractSSTableIterator constructor.

          Therefore I don't see the point in upgrading the stats, but should we wish to do it, here is the work required:

          • upgrade the sstable version from "ma" to "mb" (unsure what this entails exactly, other than changing BigVersion and BigFormat)
          • introduce a serializer for clustering prefixes that is able to distinguish between clusterings and bounds by using functionality similar to what's in UnfilteredSerializer
          • pass the clustering types, or ideally the SerializationHeader, to the StatsMetadata serialize and deserialize methods - or alternatively add the min and max clustering prefixes to SerializationHeader.Component, which is also a metadata component. However in this case we'd have to do some more work to pass to it the min and max clustering prefixes, which can instead be easily collected for StatsMetadata by using the existing method MetadataCollector.updateClusteringValues(). I suppose they could also go into EncodingStats, which is serialized in the header, but I am not sure if this would increase the size of partitions excessively because it seems we also use EncodingStats for partitions.
          Show
          Stefania Stefania added a comment - The latest commit is ready for review here . CI is still pending however. Isn't the static row is outside the iteration? That's a very good point, thanks for pointing this out. So we don't need to worry about this. So we can't really trust hasTombstones()? This may be a significant bug. This method was introduced in this patch but currently the same logic is used in SinglePartitionReadCommand.queryMemtableAndDiskInternal() and queryMemtableAndSSTablesInTimestampOrder to determine which non overlapping sstables contain no tombstones at all, see especially the comment in the second method: // This mean that nothing queried by the filter can be in the sstable. One exception is the top-level partition deletion // however: if it is set, it impacts everything and must be included. Getting that top-level partition deletion costs us // some seek in general however (unless the partition is indexed and is in the key cache), so we first check if the sstable // has any tombstone at all as a shortcut. if (sstable.getSSTableMetadata().maxLocalDeletionTime == Integer .MAX_VALUE) continue ; // Means no tombstone at all, we can skip that sstable The trouble is that in other places we assume the opposite, for example CompactionController.getFullyExpiredSSTables() requires maxDeletionTime to be Integer.MAX_VALUE in the presence of live cells or else it might expire the sstable. I think the code in SinglePartitionReadCommand is wrong and we should have looked at the minDeletionTime instead, so I've updated the new method accordingly. I've had some failures with upgrade dtests, due to the clustering values of old sstables ("jb" format) containing more values than the comparator types. I've added a check on Version.hasNewStatsFile() to prevent this and some assertions to spot it sooner, and this seems to have fixed the failing tests. To wrap up the discussion on upgrading sstable stats, this would allow applying this optimization also to sstables with tombstones and that are smaller than DatabaseDescriptor.column_index_size_in_kb , 64kb by default. For larger sstables with tombstones, the lower bound comes from the partition index and this will be read anyway in the presence of tombstones when partitionDeletion() is called. This requires also initializing the iterator, which will cause a disk seek to the partition start if the partition index is not available (sstables smaller than 64kb) or there are static columns in the query, see the AbstractSSTableIterator constructor. Therefore I don't see the point in upgrading the stats, but should we wish to do it, here is the work required: upgrade the sstable version from "ma" to "mb" (unsure what this entails exactly, other than changing BigVersion and BigFormat ) introduce a serializer for clustering prefixes that is able to distinguish between clusterings and bounds by using functionality similar to what's in UnfilteredSerializer pass the clustering types, or ideally the SerializationHeader , to the StatsMetadata serialize and deserialize methods - or alternatively add the min and max clustering prefixes to SerializationHeader.Component , which is also a metadata component. However in this case we'd have to do some more work to pass to it the min and max clustering prefixes, which can instead be easily collected for StatsMetadata by using the existing method MetadataCollector.updateClusteringValues() . I suppose they could also go into EncodingStats , which is serialized in the header, but I am not sure if this would increase the size of partitions excessively because it seems we also use EncodingStats for partitions.
          Hide
          blambov Branimir Lambov added a comment -

          Thank you, this now looks almost ready to go. This is not a problem of the new code, but I am worried about the partitionLevelDeletion > minTimestamp check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right?

          Nits:

          Shouldn't minTimestamp be updated only if shouldInclude()?

          The trace on merging doesn't seem to be using the right value.

          You should use Transformation for withSSTablesIterated, and I'd actually do isEmpty() explicitly-- the new code is harder to read but I don't think it improves anything.

          CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()).

          Show
          blambov Branimir Lambov added a comment - Thank you, this now looks almost ready to go. This is not a problem of the new code, but I am worried about the partitionLevelDeletion > minTimestamp check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right? Nits: Shouldn't minTimestamp be updated only if shouldInclude() ? The trace on merging doesn't seem to be using the right value. You should use Transformation for withSSTablesIterated , and I'd actually do isEmpty() explicitly-- the new code is harder to read but I don't think it improves anything. CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore() ).
          Hide
          Stefania Stefania added a comment -

          Latest commit available here.

          I am worried about the partitionLevelDeletion > minTimestamp check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right?

          You are right to be worried, this test fails on trunk:

              createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))");
          
              for (int i = 0; i < 10; i++)
                  execute("INSERT INTO %s (a, b, c) VALUES(1, ?, 'abc')", i);
              flush();
          
              execute("DELETE FROM %s WHERE a=1 and b <= 3");
              flush();
          
              assertEmpty(execute("SELECT * FROM %s WHERE a=1 and b <= 2"));
          

          I git-annotated the original code in 2.2, which is in CollationController at line 293 and traced it to CASSANDRA-5514. I think the comment explaining it is this. However, perhaps back than range tombstones did not exist? BTW, this comment explains quite well the raison d'ĂȘtre of the min and max clustering values in the stats.

          Shouldn't minTimestamp be updated only if shouldInclude()?

          I agree and we should also update minTimestamp by looking at memtables as well now that CASSANDRA-9949 is available.

          The trace on merging doesn't seem to be using the right value.

          Done, thanks.

          You should use Transformation for withSSTablesIterated, and I'd actually do isEmpty() explicitly-- the new code is harder to read but I don't think it improves anything.

          Done. Because BaseRows eagerly caches the static row in its constructor, I had to change one thing: rather than counting an sstable as iterated when the iterator is initialized, we do so when computeNext() is called. I think this is fair since the constructor only accesses the data partition in very limited cases as discussed above. However we do access the partition index, which prior to this we would only do for sstables with tombstones. If this is not acceptable then either we don't use Transformation or we change BaseRows.

          I've also attached a boolean to the iterator itself rather than counting the tables via sstablesIterated and I was therefore able to get rid of QueryMemtableAndDiskStatus. It seems a bit cleaner like this. I called the method returning that boolean iterated() but perhaps we can find a better name.

          CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()).

          Done, thanks.

          Show
          Stefania Stefania added a comment - Latest commit available here . I am worried about the partitionLevelDeletion > minTimestamp check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right? You are right to be worried, this test fails on trunk: createTable( "CREATE TABLE %s (a int , b int , c text, primary key (a, b))" ); for ( int i = 0; i < 10; i++) execute( "INSERT INTO %s (a, b, c) VALUES(1, ?, 'abc')" , i); flush(); execute( "DELETE FROM %s WHERE a=1 and b <= 3" ); flush(); assertEmpty(execute( "SELECT * FROM %s WHERE a=1 and b <= 2" )); I git-annotated the original code in 2.2, which is in CollationController at line 293 and traced it to CASSANDRA-5514 . I think the comment explaining it is this . However, perhaps back than range tombstones did not exist? BTW, this comment explains quite well the raison d'ĂȘtre of the min and max clustering values in the stats. Shouldn't minTimestamp be updated only if shouldInclude()? I agree and we should also update minTimestamp by looking at memtables as well now that CASSANDRA-9949 is available. The trace on merging doesn't seem to be using the right value. Done, thanks. You should use Transformation for withSSTablesIterated, and I'd actually do isEmpty() explicitly-- the new code is harder to read but I don't think it improves anything. Done. Because BaseRows eagerly caches the static row in its constructor, I had to change one thing: rather than counting an sstable as iterated when the iterator is initialized, we do so when computeNext() is called. I think this is fair since the constructor only accesses the data partition in very limited cases as discussed above. However we do access the partition index, which prior to this we would only do for sstables with tombstones. If this is not acceptable then either we don't use Transformation or we change BaseRows . I've also attached a boolean to the iterator itself rather than counting the tables via sstablesIterated and I was therefore able to get rid of QueryMemtableAndDiskStatus . It seems a bit cleaner like this. I called the method returning that boolean iterated() but perhaps we can find a better name. CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()). Done, thanks.
          Hide
          blambov Branimir Lambov added a comment -

          LGTM.

          Nits (feel free to ignore):

          You can use count instead of mapToInt(e -> 1).sum().

          The ListUtils terminology is strange, union is the concatenation and sum is the union? It's also not very efficient. I don't think the two separate lists make a lot of sense versus using one with an instanceof check in onPartitionClose.

          BaseRows eagerly caches the static row in its constructor

          Isn't mergeIterator.mergeStaticRows the one that does this? Or is the problem the single-source case? In any case, you could override UnfilteredRowIteratorWithLowerBound.staticRow to return empty if no static columns are required.

          Show
          blambov Branimir Lambov added a comment - LGTM. Nits (feel free to ignore): You can use count instead of mapToInt(e -> 1).sum() . The ListUtils terminology is strange, union is the concatenation and sum is the union? It's also not very efficient. I don't think the two separate lists make a lot of sense versus using one with an instanceof check in onPartitionClose . BaseRows eagerly caches the static row in its constructor Isn't mergeIterator.mergeStaticRows the one that does this? Or is the problem the single-source case? In any case, you could override UnfilteredRowIteratorWithLowerBound.staticRow to return empty if no static columns are required.
          Hide
          benedict Benedict added a comment -

          One other nit: for debugging it's much clearer if we create named instances of Transformation - these can be declared inline in the method, so it's only one extra line of code.

          Show
          benedict Benedict added a comment - One other nit: for debugging it's much clearer if we create named instances of Transformation - these can be declared inline in the method, so it's only one extra line of code.
          Hide
          Stefania Stefania added a comment -

          Thank you for your comments, refer to this commit:

          • Re-introduced a unique list of iterators and we filter by iterator type before counting the sstables iterated.
          • Overridden UnfilteredRowIteratorWithLowerBound.staticRow, and yes the problem was the single-source case.
          • Replaced the inline Transformation instance with a named class.
          • Modified the existing tests for static columns to also test when flushing to sstables
          • Fixed some trivial warnings and removed some duplication in test code (flush(boolean forceFlush)).

          CI is still running. I will also need to squash before we can flag as ready to commit.

          Show
          Stefania Stefania added a comment - Thank you for your comments, refer to this commit : Re-introduced a unique list of iterators and we filter by iterator type before counting the sstables iterated. Overridden UnfilteredRowIteratorWithLowerBound.staticRow , and yes the problem was the single-source case. Replaced the inline Transformation instance with a named class. Modified the existing tests for static columns to also test when flushing to sstables Fixed some trivial warnings and removed some duplication in test code ( flush(boolean forceFlush) ). CI is still running. I will also need to squash before we can flag as ready to commit.
          Hide
          blambov Branimir Lambov added a comment -

          Changes look good and patch looks good.

          Show
          blambov Branimir Lambov added a comment - Changes look good and patch looks good.
          Hide
          Stefania Stefania added a comment -

          Thanks for the review Branimir!

          The patch for trunk is here, rebased and squashed.

          Launched one more CI round, if all good will move the ticket to ready to commit:

          trunk
          testall
          dtest
          Show
          Stefania Stefania added a comment - Thanks for the review Branimir! The patch for trunk is here , rebased and squashed. Launched one more CI round, if all good will move the ticket to ready to commit: trunk testall dtest
          Hide
          Stefania Stefania added a comment -

          CI looks good, this is ready to commit.

          Show
          Stefania Stefania added a comment - CI looks good, this is ready to commit.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Committed as b11fba750c610de5e97acba070cc571cf0a96416 to trunk (3.4), thanks.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Committed as b11fba750c610de5e97acba070cc571cf0a96416 to trunk (3.4), thanks.

            People

            • Assignee:
              Stefania Stefania
              Reporter:
              doanduyhai DOAN DuyHai
              Reviewer:
              Branimir Lambov
            • Votes:
              7 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development