Cassandra
  1. Cassandra
  2. CASSANDRA-3777

get_range_slices() always returns list of KeySlice containing all available rows even if column size is empty

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:

      Debian Squeeze

      Description

      Hi,

      we are using Cassandra to store data in super column families with a date as their name. We would like to iterate over the keys only containing data which matches given slice range (e.g. a certain day). In fact, method get_range_slices() always returns all rows where getColumnSize() on given KeySlice is 0.

      In combination with Hadoop we use the ColumnFamilyInputFormat which currently only supports SliceRanges. In our setup we might have billions of rows within a column family. Even though setting a slice range we always have to iterate all row keys, which in my opinion doesn't make any sense.

      Lets have a look at a very simple example:

      Cassandra.Client client = ConfigHelper.createConnection("localhost", 9160, true);
      client.set_keyspace("Foo");

      SlicePredicate predicate = new SlicePredicate();
      SliceRange sliceRange = new SliceRange();
      sliceRange.start = Util.bb("I@1327273200");
      sliceRange.finish = Util.bb("I@1327273200~");
      predicate.slice_range = sliceRange;

      KeyRange keyRange = new KeyRange();
      keyRange.start_key = Util.bb("");
      keyRange.end_key = Util.bb("");

      List<KeySlice> rows = client.get_range_slices(new ColumnParent("Bar"), predicate,
      keyRange, ConsistencyLevel.ONE);

      for (KeySlice slice : rows)

      { System.out.println("key: " + new String(slice.getKey()) + ", columns: " + slice.getColumnsSize()); }

      This is the output:

      key: I@1327359600@14@2074@478@32798@80445@2011@138@205@4320@0, columns: 0
      key: I@1327273200@12@1151@139@801@1728@2033@138@219@4476@0, columns: 1
      key: I@1327359600@14@2055@359@1032@2078@2011@138@205@4320@0, columns: 0
      key: I@1327359600@14@1151@139@801@1728@2011@138@205@4320@0, columns: 0
      key: I@1327273200@12@2074@478@32798@80445@2033@138@219@4476@0, columns: 1
      key: I@1327273200@12@2055@359@1032@2079@2033@138@219@4476@0, columns: 1

      Searching by slice ranges works fine, but for all other row keys not matching given slice range they are still part of the result list. We are filtering out such key slices by checking their column size, but it would make more sense to get only those keys we are looking for (which have obviously column size > 0).

      ColumnFamilyRecordReader creates sorted maps from the result list which means creating billions of maps and passing them to the mapper which are finally thrown away because they do not contain any content.

      The question is: Is there a chance by using slice ranges to get only those key slices which matches given slice range? Or is there any reason why this behaviour is like described above?

      Best Regards

      Bert Passek

        Activity

        Gavin made changes -
        Workflow patch-available, re-open possible [ 12753076 ] reopen-resolved, no closed status, patch-avail, testing [ 12758597 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12650332 ] patch-available, re-open possible [ 12753076 ]
        Jonathan Ellis made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Not A Problem [ 8 ]
        Show
        Jonathan Ellis added a comment - See http://wiki.apache.org/cassandra/FAQ#range_ghosts and CASSANDRA-3982
        Hide
        Goir Riog added a comment -

        Hi,

        whats the status on this one ?
        this "bug" still exists all versions. Is there any reason why this is like described above ?
        Its a huge network and processing overhead which can easily avoided.

        A short comment on this one would be nice.

        Thanks
        Goir

        Show
        Goir Riog added a comment - Hi, whats the status on this one ? this "bug" still exists all versions. Is there any reason why this is like described above ? Its a huge network and processing overhead which can easily avoided. A short comment on this one would be nice. Thanks Goir
        bert Passek created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            bert Passek
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development