Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-1125

Filter out ColumnFamily rows that aren't part of the query (using a KeyRange)

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 0.8.2
    • Component/s: None
    • Labels:
      None

      Description

      Currently, when running a MapReduce job against data in a Cassandra data store, it reads through all the data for a particular ColumnFamily. This could be optimized to only read through those rows that have to do with the query.

      It's a small change but wanted to put it in Jira so that it didn't fall through the cracks.

      1. CASSANDRA-1125.patch
        6 kB
        mck
      2. CASSANDRA-1125.patch
        6 kB
        mck
      3. 1125-v3.txt
        6 kB
        Jonathan Ellis
      4. 1125-formatted.txt
        6 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          iow: allow specifying start and end keys.

          Show
          jbellis Jonathan Ellis added a comment - iow: allow specifying start and end keys.
          Hide
          jbellis Jonathan Ellis added a comment -

          another option would be to use the RowPredicate with CASSANDRA-749

          Show
          jbellis Jonathan Ellis added a comment - another option would be to use the RowPredicate with CASSANDRA-749
          Hide
          jbellis Jonathan Ellis added a comment -

          I like the RowPredicate approach (introduced in CASSANDRA-1154 which you are reviewing).

          Show
          jbellis Jonathan Ellis added a comment - I like the RowPredicate approach (introduced in CASSANDRA-1154 which you are reviewing).
          Hide
          michaelsembwever mck added a comment -
          Show
          michaelsembwever mck added a comment - this would be a very nice feature. it has been brought up http://thread.gmane.org/gmane.comp.db.cassandra.user/4965 and http://thread.gmane.org/gmane.comp.db.cassandra.user/6135
          Hide
          michaelsembwever mck added a comment - - edited

          Jonathan: do you mean the IndexExpression and IndexClause and Table.open("Keyspace1").getColumnFamilyStore("Indexed1").scan(clause, filter);
          being used, instead of the KeyRange, inside of ColumnFamilyRecordReader.maybeInit() ??

          Show
          michaelsembwever mck added a comment - - edited Jonathan: do you mean the IndexExpression and IndexClause and Table.open("Keyspace1").getColumnFamilyStore("Indexed1").scan(clause, filter); being used, instead of the KeyRange, inside of ColumnFamilyRecordReader.maybeInit() ??
          Hide
          jbellis Jonathan Ellis added a comment -

          I think everything I wanted to do here is covered by CASSANDRA-1600, but we weren't able to reach consensus on that for 0.7 so I tabled it.

          Show
          jbellis Jonathan Ellis added a comment - I think everything I wanted to do here is covered by CASSANDRA-1600 , but we weren't able to reach consensus on that for 0.7 so I tabled it.
          Hide
          jbellis Jonathan Ellis added a comment -

          CASSANDRA-1600 has a patch now.

          So what we need here is to allow specifying a KeyRange to the job. This will give us index queries for free; for start/end limits we'd need to and limit each split to its intersection with the KeyRange start/end in CFIF.

          This would be easy but we ripped out the AbstractBounds intersection code (in part because we were never quite sure if it was entirely debugged). Time to take another stab at that, or are there other ideas?

          Show
          jbellis Jonathan Ellis added a comment - CASSANDRA-1600 has a patch now. So what we need here is to allow specifying a KeyRange to the job. This will give us index queries for free; for start/end limits we'd need to and limit each split to its intersection with the KeyRange start/end in CFIF. This would be easy but we ripped out the AbstractBounds intersection code (in part because we were never quite sure if it was entirely debugged). Time to take another stab at that, or are there other ideas?
          Hide
          michaelsembwever mck added a comment - - edited

          For now (without CASSANDRA-1600) I can use a KeyRange and Range.intersectionWith(..) for start/end rowKey limits in CFIF.

          To upgrade from KeyRange to IndexClause (once it contains an optional KeyRange field) can be easily enough done latter by replacing ConfigHelper.setInputKeyRange(..) to ConfigHelper.setInputIndexClause(..) and rewriting the two lines of code in CFRR's RowIterator.maybeInit(..)

          Show
          michaelsembwever mck added a comment - - edited For now (without CASSANDRA-1600 ) I can use a KeyRange and Range.intersectionWith(..) for start/end rowKey limits in CFIF. To upgrade from KeyRange to IndexClause (once it contains an optional KeyRange field) can be easily enough done latter by replacing ConfigHelper.setInputKeyRange(..) to ConfigHelper.setInputIndexClause(..) and rewriting the two lines of code in CFRR's RowIterator.maybeInit(..)
          Hide
          michaelsembwever mck added a comment - - edited

          can this go into 0.8.1 ?
          ( and can we split this issue into two: 1) for KeyRange and 2) for IndexClause )

          Show
          michaelsembwever mck added a comment - - edited can this go into 0.8.1 ? ( and can we split this issue into two: 1) for KeyRange and 2) for IndexClause )
          Hide
          jbellis Jonathan Ellis added a comment -

          Looks good to me for the most part. (Attaching reformatted version.)

          One part though I'm not 100% sure about – we're using KeyRange for start-exclusive ranges, when the Thrift API always uses it for start-inclusive.

          I'd be more comfortable with any of:

          • using a Pair<String, String>
          • using a new one-off class
          • using KeyRange but with tokens (which Thrift also uses for start-exclusive)
          • using a Range object directly (also requires tokens)
          Show
          jbellis Jonathan Ellis added a comment - Looks good to me for the most part. (Attaching reformatted version.) One part though I'm not 100% sure about – we're using KeyRange for start-exclusive ranges, when the Thrift API always uses it for start-inclusive. I'd be more comfortable with any of: using a Pair<String, String> using a new one-off class using KeyRange but with tokens (which Thrift also uses for start-exclusive) using a Range object directly (also requires tokens)
          Hide
          jbellis Jonathan Ellis added a comment -

          (And I'd be fine with putting this in 0.8.x.)

          Show
          jbellis Jonathan Ellis added a comment - (And I'd be fine with putting this in 0.8.x.)
          Hide
          jeromatron Jeremy Hanna added a comment -

          So does this only include key ranges - that's what it sounds like. And indexes are out for now too, it sounds like - e.g. where timebucket = 12345.

          Show
          jeromatron Jeremy Hanna added a comment - So does this only include key ranges - that's what it sounds like. And indexes are out for now too, it sounds like - e.g. where timebucket = 12345.
          Hide
          jbellis Jonathan Ellis added a comment -

          Like Mck said, that will have to be split into another ticket, since it continues to depend on CASSANDRA-1600.

          Show
          jbellis Jonathan Ellis added a comment - Like Mck said, that will have to be split into another ticket, since it continues to depend on CASSANDRA-1600 .
          Hide
          michaelsembwever mck added a comment -

          using KeyRange but with tokens (which Thrift also uses for start-exclusive)

          this is my preference. i'll make a patch for it.

          Show
          michaelsembwever mck added a comment - using KeyRange but with tokens (which Thrift also uses for start-exclusive) this is my preference. i'll make a patch for it.
          Hide
          jbellis Jonathan Ellis added a comment -

          v3 makes the KeyRange an implementation detail (setInputRange just takes Strings for start and end) and fixes a reference to the key fields in CFIF.

          Show
          jbellis Jonathan Ellis added a comment - v3 makes the KeyRange an implementation detail (setInputRange just takes Strings for start and end) and fixes a reference to the key fields in CFIF.
          Hide
          michaelsembwever mck added a comment -

          +1 (tested) on 1125-v3.txt

          Show
          michaelsembwever mck added a comment - +1 (tested) on 1125-v3.txt
          Hide
          michaelsembwever mck added a comment -

          Created CASSANDRA-2878 for the better solution using a IndexClause

          Show
          michaelsembwever mck added a comment - Created CASSANDRA-2878 for the better solution using a IndexClause
          Hide
          jbellis Jonathan Ellis added a comment -

          committed to 0.8 and trunk

          Show
          jbellis Jonathan Ellis added a comment - committed to 0.8 and trunk
          Hide
          hudson Hudson added a comment -

          Integrated in Cassandra-0.8 #214 (See https://builds.apache.org/job/Cassandra-0.8/214/)
          add KeyRangeoption to Hadoop inputformat
          patch by Mck SembWever; reviewed by jbellis for CASSANDRA-1125

          jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1145731
          Files :

          • /cassandra/branches/cassandra-0.8/CHANGES.txt
          • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ConfigHelper.java
          • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
          Show
          hudson Hudson added a comment - Integrated in Cassandra-0.8 #214 (See https://builds.apache.org/job/Cassandra-0.8/214/ ) add KeyRangeoption to Hadoop inputformat patch by Mck SembWever; reviewed by jbellis for CASSANDRA-1125 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1145731 Files : /cassandra/branches/cassandra-0.8/CHANGES.txt /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ConfigHelper.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
          Hide
          michaelsembwever mck added a comment - - edited

          Something broke here in production once we went out with 0.8.2. It may have been some poor testing, i'm not entirely sure and a little surprised.

          CFIF:135 breaks because inside dhtRange.intersects(jobRange) there's a call to new Range(token, token) which calls StorageService.getPartitioner() and StorageService is null as we're not inside the server.

          A quick fix is to change Range:148 from new Range(token, token) to new Range(token, token, partitioner) making the presumption that the partitioner for the new Range will be the same as this Range. This won't work if the Range wraps in any way (which could be just a limitation of the current KeyRange filtering), but otherwise tests ok.

          Show
          michaelsembwever mck added a comment - - edited Something broke here in production once we went out with 0.8.2. It may have been some poor testing, i'm not entirely sure and a little surprised. CFIF:135 breaks because inside dhtRange.intersects(jobRange) there's a call to new Range(token, token) which calls StorageService.getPartitioner() and StorageService is null as we're not inside the server. A quick fix is to change Range:148 from new Range(token, token) to new Range(token, token, partitioner) making the presumption that the partitioner for the new Range will be the same as this Range. This won't work if the Range wraps in any way (which could be just a limitation of the current KeyRange filtering), but otherwise tests ok.
          Hide
          jbellis Jonathan Ellis added a comment -

          Created CASSANDRA-3108 to address.

          Show
          jbellis Jonathan Ellis added a comment - Created CASSANDRA-3108 to address.

            People

            • Assignee:
              michaelsembwever mck
              Reporter:
              jeromatron Jeremy Hanna
              Reviewer:
              Jonathan Ellis
            • Votes:
              4 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development