Cassandra
  1. Cassandra
  2. CASSANDRA-1599

Add sort/order support for secondary indexing

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Fix Version/s: None
    • Component/s: API
    • Labels:
      None

      Description

      For a lot of users paging is a standard use case on many web applications. It would be nice to allow paging as part of a Boolean Expression.

      Page -> start index
      -> end index
      -> page timestamp
      -> Sort Order

      When sorting, is it possible to sort both ASC and DESC?

        Issue Links

          Activity

          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Later [ 7 ]
          Gavin made changes -
          Link This issue depends upon CASSANDRA-1684 [ CASSANDRA-1684 ]
          Gavin made changes -
          Link This issue depends on CASSANDRA-1684 [ CASSANDRA-1684 ]
          Gavin made changes -
          Link This issue depends upon CASSANDRA-2915 [ CASSANDRA-2915 ]
          Gavin made changes -
          Link This issue depends on CASSANDRA-2915 [ CASSANDRA-2915 ]
          Gavin made changes -
          Link This issue depends upon CASSANDRA-1598 [ CASSANDRA-1598 ]
          Gavin made changes -
          Link This issue depends on CASSANDRA-1598 [ CASSANDRA-1598 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12752469 ] reopen-resolved, no closed status, patch-avail, testing [ 12755376 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12523847 ] patch-available, re-open possible [ 12752469 ]
          Jonathan Ellis made changes -
          Link This issue depends on CASSANDRA-1684 [ CASSANDRA-1684 ]
          Jonathan Ellis made changes -
          Assignee Jonathan Ellis [ jbellis ]
          Todd Nine made changes -
          Link This issue depends on CASSANDRA-2915 [ CASSANDRA-2915 ]
          Todd Nine made changes -
          Parent CASSANDRA-2915 [ 12514474 ]
          Issue Type Sub-task [ 7 ] New Feature [ 2 ]
          Todd Nine made changes -
          Parent CASSANDRA-2915 [ 12514474 ]
          Issue Type New Feature [ 2 ] Sub-task [ 7 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.0 [ 12316349 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.0 [ 12316349 ]
          Fix Version/s 0.8 [ 12314820 ]
          Jonathan Ellis made changes -
          Original Estimate 96h [ 345600 ] 32h [ 115200 ]
          Remaining Estimate 96h [ 345600 ] 32h [ 115200 ]
          Jonathan Ellis made changes -
          Original Estimate 96h [ 345600 ]
          Remaining Estimate 96h [ 345600 ]
          Jonathan Ellis made changes -
          Assignee Jonathan Ellis [ jbellis ]
          Hide
          Jingsi Liu added a comment - - edited

          Hi,all

          >Expose 0.8.0 distributed indexes as readonly column families which are sorted by the index value, and which are queried using get_range_slices
          i am wondering how to solve this query with the readonly index:
          email == 'bob@gmail.com' && (lastlogindate > today - 5 days || newmessagedate > today -1 day).

          since i met the similar problem: in SQL expression,it is like:
          select * from sometable where author='somebody' order by last_reply_time desc.

          or do i make some misunderstanding?

          thank very much.

          Show
          Jingsi Liu added a comment - - edited Hi,all >Expose 0.8.0 distributed indexes as readonly column families which are sorted by the index value, and which are queried using get_range_slices i am wondering how to solve this query with the readonly index: email == 'bob@gmail.com' && (lastlogindate > today - 5 days || newmessagedate > today -1 day). since i met the similar problem: in SQL expression,it is like: select * from sometable where author='somebody' order by last_reply_time desc. or do i make some misunderstanding? thank very much.
          Hide
          Stu Hood added a comment - - edited

          >> column families which are sorted by the index value
          > This is the hard part, I am having trouble wrapping my mind
          RandomPartitioner would need to use wide rows for a distributed index, but order preserving partitioners could use skinny rows.

          EDIT: I guess I'm also assuming CASSANDRA-1205, where index values can be converted to a byte[] collation key.

          > As suggested by "adding a CF index type" I think using the same index query api makes more sense.
          The comment above assumes that 1600 is a good idea: we would be using the same index query API: get_range_slices.

          Show
          Stu Hood added a comment - - edited >> column families which are sorted by the index value > This is the hard part, I am having trouble wrapping my mind RandomPartitioner would need to use wide rows for a distributed index, but order preserving partitioners could use skinny rows. EDIT: I guess I'm also assuming CASSANDRA-1205 , where index values can be converted to a byte[] collation key. > As suggested by "adding a CF index type" I think using the same index query api makes more sense. The comment above assumes that 1600 is a good idea: we would be using the same index query API: get_range_slices.
          Hide
          Jonathan Ellis added a comment - - edited

          This is sort of along the lines of what I was thinking, although I think just adding a ColumnFamily index type would be adequate to distinguish it.

          column families which are sorted by the index value

          This is the hard part, I am having trouble wrapping my mind around a scheme that allows both different sort orders in different CFs and always routes things correctly by node token no matter which CF you are talking about.

          which are queried using get_range_slices

          As suggested by "adding a CF index type" I think using the same index query api makes more sense.

          Show
          Jonathan Ellis added a comment - - edited This is sort of along the lines of what I was thinking, although I think just adding a ColumnFamily index type would be adequate to distinguish it. column families which are sorted by the index value This is the hard part, I am having trouble wrapping my mind around a scheme that allows both different sort orders in different CFs and always routes things correctly by node token no matter which CF you are talking about. which are queried using get_range_slices As suggested by "adding a CF index type" I think using the same index query api makes more sense.
          Stu Hood made changes -
          Summary Add paging support for secondary indexing Add sort/order support for secondary indexing
          Fix Version/s 0.8 [ 12314820 ]
          Fix Version/s 0.7.0 [ 12315212 ]
          Hide
          Stu Hood added a comment - - edited

          Another issue with local indexes is that implementing sorting would involve a clusterwide merge sort. A distributed index is required to efficiently return the data in index order. I think this issue should be delayed for 0.8.0 when we have distributed indexes available: the indexes available in 0.7.0 are intended for filtering data.

          As a multi-part solution, (imo) we should:

          1. (optionally) Rename local indexes to "filter_indexes" or "filters"
          2. Expose 0.8.0 distributed indexes as readonly column families which are sorted by the index value, and which are queried using get_range_slices
          3. Implement LT/LTE/GT/GTE operations for the key-range in get_range_slices

          Outcomes:

          • Your "primary" index expression would be consistently queried using the "range" parameter in get_range_slices and would define the sort order
          • "filters" (0.7.0 secondary indexes) would be applied using the IndexClause argument as described on CASSANDRA-1600

          I'm going to open another ticket to suggest some changes to index definitions to make this consistent.

          Show
          Stu Hood added a comment - - edited Another issue with local indexes is that implementing sorting would involve a clusterwide merge sort. A distributed index is required to efficiently return the data in index order. I think this issue should be delayed for 0.8.0 when we have distributed indexes available: the indexes available in 0.7.0 are intended for filtering data. As a multi-part solution, (imo) we should: (optionally) Rename local indexes to "filter_indexes" or "filters" Expose 0.8.0 distributed indexes as readonly column families which are sorted by the index value, and which are queried using get_range_slices Implement LT/LTE/GT/GTE operations for the key-range in get_range_slices Outcomes: Your "primary" index expression would be consistently queried using the "range" parameter in get_range_slices and would define the sort order "filters" (0.7.0 secondary indexes) would be applied using the IndexClause argument as described on CASSANDRA-1600 I'm going to open another ticket to suggest some changes to index definitions to make this consistent.
          Hide
          Stu Hood added a comment - - edited

          This ticket should probably be titled "allow sorting by index value", since that is not yet possible, and the paging concerns are not valid until it is implemented.

          Show
          Stu Hood added a comment - - edited This ticket should probably be titled "allow sorting by index value", since that is not yet possible, and the paging concerns are not valid until it is implemented.
          Stu Hood made changes -
          Component/s API [ 12313742 ]
          Todd Nine made changes -
          Link This issue depends on CASSANDRA-1598 [ CASSANDRA-1598 ]
          Todd Nine made changes -
          Link This issue blocks CASSANDRA-1598 [ CASSANDRA-1598 ]
          Hide
          Todd Nine added a comment - - edited

          Consider a query similar to the following.

          email == 'bob@gmail.com' && (lastlogindate > today - 5 days || newmessagedate > today -1 day).

          Which start key do I advance, one, both? As a client I would have to iterate over every field in the expression tree to determine what my start key should be for two index clauses. While this is not impossible, this becomes very complex for large boolean operand trees. As a user, this functionality would provide a clean interface that abstracts the user from the need to perform an analysis of the previous result set and "diff" it with the expression tree provided. I'm not saying it's an absolute must have, but it would certainly provide a lot of appeal to users that are utilizing Cassandra as an eventually consistent storage mechanism for web based applications once union and intersections are implemented in Cassandra.

          Show
          Todd Nine added a comment - - edited Consider a query similar to the following. email == 'bob@gmail.com' && (lastlogindate > today - 5 days || newmessagedate > today -1 day). Which start key do I advance, one, both? As a client I would have to iterate over every field in the expression tree to determine what my start key should be for two index clauses. While this is not impossible, this becomes very complex for large boolean operand trees. As a user, this functionality would provide a clean interface that abstracts the user from the need to perform an analysis of the previous result set and "diff" it with the expression tree provided. I'm not saying it's an absolute must have, but it would certainly provide a lot of appeal to users that are utilizing Cassandra as an eventually consistent storage mechanism for web based applications once union and intersections are implemented in Cassandra.
          Todd Nine made changes -
          Field Original Value New Value
          Link This issue blocks CASSANDRA-1598 [ CASSANDRA-1598 ]
          Hide
          Jonathan Ellis added a comment -

          how is this different from IndexClause.start_key?

          Show
          Jonathan Ellis added a comment - how is this different from IndexClause.start_key?
          Todd Nine created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Nine
            • Votes:
              9 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 32h
                32h
                Remaining:
                Remaining Estimate - 32h
                32h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development