Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-2855

Skip rows with empty columns when slicing entire row

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 0.8.8
    • Component/s: CQL
    • Labels:

      Description

      We have been finding that range ghosts appear in results from Hadoop via Pig. This could also happen if rows don't have data for the slice predicate that is given. This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.

      We would like to add an option to skip rows that have no column values in it. That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking. However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue. Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.

      It could be parameter cassandra.skip.empty.rows and be true/false.

      1. ASF.LICENSE.NOT.GRANTED--v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
        5 kB
        T Jake Luciani
      2. 2855-v5.txt
        3 kB
        Brandon Williams
      3. 2855-v4.txt
        4 kB
        Jeremy Hanna
      4. 2855-v3.txt
        3 kB
        Jeremy Hanna
      5. 2855-v2.txt
        1 kB
        Jeremy Hanna

        Activity

        Hide
        jbellis Jonathan Ellis added a comment -

        I don't like the idea of adding flags to change behavior.

        What I think we could do is not bother including empty rows in the resultset, IF we are doing a slice query for the entire row. (Since, as soon as the tombstones expire, they will be gone anyway.)

        Show
        jbellis Jonathan Ellis added a comment - I don't like the idea of adding flags to change behavior. What I think we could do is not bother including empty rows in the resultset, IF we are doing a slice query for the entire row. (Since, as soon as the tombstones expire, they will be gone anyway.)
        Hide
        jeromatron Jeremy Hanna added a comment -

        What I think we could do is not bother including empty rows in the resultset, IF we are doing a slice query for the entire row. (Since, as soon as the tombstones expire, they will be gone anyway.)

        Yeah - our primary concern is tombstones. Would be great to get that done at a lower level.

        Show
        jeromatron Jeremy Hanna added a comment - What I think we could do is not bother including empty rows in the resultset, IF we are doing a slice query for the entire row. (Since, as soon as the tombstones expire, they will be gone anyway.) Yeah - our primary concern is tombstones. Would be great to get that done at a lower level.
        Hide
        jeromatron Jeremy Hanna added a comment -

        is it more expensive/complicated to do it for an empty slice or is that just orthogonal to this since that is handled in a different place?

        Show
        jeromatron Jeremy Hanna added a comment - is it more expensive/complicated to do it for an empty slice or is that just orthogonal to this since that is handled in a different place?
        Hide
        jbellis Jonathan Ellis added a comment -

        sylvain points out that doing this at the Thrift layer would break the row count contract. We could still do this at the CFRR level.

        Show
        jbellis Jonathan Ellis added a comment - sylvain points out that doing this at the Thrift layer would break the row count contract. We could still do this at the CFRR level.
        Hide
        jbellis Jonathan Ellis added a comment -

        is it more expensive/complicated to do it for an empty slice

        empty result for entire row slice means it really will be gone when tombstone expires, so the two are semantically equivalent. this is not the case for a smaller slice; an empty result for that could mean "there is data in the row, just not in the slice you requested." so leaving that out would be an error.

        Show
        jbellis Jonathan Ellis added a comment - is it more expensive/complicated to do it for an empty slice empty result for entire row slice means it really will be gone when tombstone expires, so the two are semantically equivalent. this is not the case for a smaller slice; an empty result for that could mean "there is data in the row, just not in the slice you requested." so leaving that out would be an error.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Simple patch to skip results that have no values for the key.

        Show
        jeromatron Jeremy Hanna added a comment - Simple patch to skip results that have no values for the key.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Brandon was saying that empty slice comment only referred to core Cassandra, so in the CFRR I just skipped any key didn't have values - hoping that isSetColumns handles all cases for that.

        Show
        jeromatron Jeremy Hanna added a comment - Brandon was saying that empty slice comment only referred to core Cassandra, so in the CFRR I just skipped any key didn't have values - hoping that isSetColumns handles all cases for that.
        Hide
        jeromatron Jeremy Hanna added a comment -

        v2 is tested to skip results with no columns and tombstones. Also fixed where an exception would occur because lastRow looked at the altered set of rows.

        Show
        jeromatron Jeremy Hanna added a comment - v2 is tested to skip results with no columns and tombstones. Also fixed where an exception would occur because lastRow looked at the altered set of rows.
        Hide
        jeromatron Jeremy Hanna added a comment - - edited

        Added a configuration property cassandra.skip.empty.results which defaults to false. We can't skip just complete empty rows because there is no way to tell if the complete row is empty based on a result from a slice predicate.

        Show
        jeromatron Jeremy Hanna added a comment - - edited Added a configuration property cassandra.skip.empty.results which defaults to false. We can't skip just complete empty rows because there is no way to tell if the complete row is empty based on a result from a slice predicate.
        Hide
        brandon.williams Brandon Williams added a comment -

        skip.empty.results should probably be 'skip.empty.rows' or 'skip.tombstones' and there needs to be a check on the predicate to see if it covers the entire row, and if so suppress the tombstone, but if not return the empty slice.

        Show
        brandon.williams Brandon Williams added a comment - skip.empty.results should probably be 'skip.empty.rows' or 'skip.tombstones' and there needs to be a check on the predicate to see if it covers the entire row, and if so suppress the tombstone, but if not return the empty slice.
        Hide
        jeromatron Jeremy Hanna added a comment -

        v4 updates the config var to cassandra.skip.empty.rows and only does so if the slice predicate is empty.

        Show
        jeromatron Jeremy Hanna added a comment - v4 updates the config var to cassandra.skip.empty.rows and only does so if the slice predicate is empty.
        Hide
        jbellis Jonathan Ellis added a comment -

        If we are only only skipping when it the predicate covers the entire row (which is the Right Thing imo), why do we need the configuration setting? Can't we make it just always skip? Look at it this way: you're giving the same result that the user would see anyway if he had a lower tombstone grace.

        Show
        jbellis Jonathan Ellis added a comment - If we are only only skipping when it the predicate covers the entire row (which is the Right Thing imo), why do we need the configuration setting? Can't we make it just always skip? Look at it this way: you're giving the same result that the user would see anyway if he had a lower tombstone grace.
        Hide
        jeromatron Jeremy Hanna added a comment -

        True - wouldn't matter.

        Show
        jeromatron Jeremy Hanna added a comment - True - wouldn't matter.
        Hide
        brandon.williams Brandon Williams added a comment -

        v5 removes the skip.empty.rows option making this behavior always-on.

        Show
        brandon.williams Brandon Williams added a comment - v5 removes the skip.empty.rows option making this behavior always-on.
        Hide
        jeromatron Jeremy Hanna added a comment -

        +1

        Show
        jeromatron Jeremy Hanna added a comment - +1
        Hide
        brandon.williams Brandon Williams added a comment -

        Committed

        Show
        brandon.williams Brandon Williams added a comment - Committed
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra-0.8 #368 (See https://builds.apache.org/job/Cassandra-0.8/368/)
        Skip empty rows when slicing the entire row.
        Patch by Jeremy Hanna and brandonwilliams, reviewed by Jeremy Hanna for
        CASSANDRA-2855

        brandonwilliams : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182463
        Files :

        • /cassandra/branches/cassandra-0.8/CHANGES.txt
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
        Show
        hudson Hudson added a comment - Integrated in Cassandra-0.8 #368 (See https://builds.apache.org/job/Cassandra-0.8/368/ ) Skip empty rows when slicing the entire row. Patch by Jeremy Hanna and brandonwilliams, reviewed by Jeremy Hanna for CASSANDRA-2855 brandonwilliams : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182463 Files : /cassandra/branches/cassandra-0.8/CHANGES.txt /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
        Hide
        tjake T Jake Luciani added a comment -

        Reverted will submit a new patch

        Show
        tjake T Jake Luciani added a comment - Reverted will submit a new patch
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra-0.8 #395 (See https://builds.apache.org/job/Cassandra-0.8/395/)
        Revert CASSANDRA-2855

        jake : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199245
        Files :

        • /cassandra/branches/cassandra-0.8/CHANGES.txt
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
        Show
        hudson Hudson added a comment - Integrated in Cassandra-0.8 #395 (See https://builds.apache.org/job/Cassandra-0.8/395/ ) Revert CASSANDRA-2855 jake : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199245 Files : /cassandra/branches/cassandra-0.8/CHANGES.txt /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
        Hide
        tjake T Jake Luciani added a comment -

        next attempt

        Show
        tjake T Jake Luciani added a comment - next attempt
        Hide
        tjake T Jake Luciani added a comment -

        marking re-opened

        Show
        tjake T Jake Luciani added a comment - marking re-opened
        Hide
        brandon.williams Brandon Williams added a comment -

        +1

        Show
        brandon.williams Brandon Williams added a comment - +1
        Hide
        brandon.williams Brandon Williams added a comment -

        Committed.

        Show
        brandon.williams Brandon Williams added a comment - Committed.
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra-0.8 #398 (See https://builds.apache.org/job/Cassandra-0.8/398/)
        Skip empty rows when entire row is requested, redux.
        Patch by tjake, reviewed by brandonwilliams for CASSANDRA-2855

        brandonwilliams : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1200471
        Files :

        • /cassandra/branches/cassandra-0.8/CHANGES.txt
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
        Show
        hudson Hudson added a comment - Integrated in Cassandra-0.8 #398 (See https://builds.apache.org/job/Cassandra-0.8/398/ ) Skip empty rows when entire row is requested, redux. Patch by tjake, reviewed by brandonwilliams for CASSANDRA-2855 brandonwilliams : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1200471 Files : /cassandra/branches/cassandra-0.8/CHANGES.txt /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
        Hide
        jeromatron Jeremy Hanna added a comment -

        fwiw - saw an interesting analogous ticket for hbase storage - https://issues.apache.org/jira/browse/PIG-2114
        it talks about omitNulls and how it's used on the load and on the store side.

        Show
        jeromatron Jeremy Hanna added a comment - fwiw - saw an interesting analogous ticket for hbase storage - https://issues.apache.org/jira/browse/PIG-2114 it talks about omitNulls and how it's used on the load and on the store side.

          People

          • Assignee:
            tjake T Jake Luciani
            Reporter:
            jeromatron Jeremy Hanna
            Reviewer:
            Brandon Williams
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development