Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 0.7 beta 1
    • Component/s: Core
    • Labels:
      None
    1. views-discussion-2.txt
      4 kB
      Stu Hood
    2. views-discussion.txt
      2 kB
      Stu Hood
    3. ASF.LICENSE.NOT.GRANTED--0002-add-IndexClause-to-KeyPredicate.txt
      0.9 kB
      Jonathan Ellis
    4. ASF.LICENSE.NOT.GRANTED--0001-add-rowpredicate-keypredicate-keycount-multiget.txt
      4 kB
      Jonathan Ellis
    5. 0003-Index-Patch-749.txt
      34 kB
      Vijay
    6. 0002-Config-API-Index-Patch-749.txt
      17 kB
      Vijay
    7. 0001-Thrift-Index-Patch-749.txt
      2 kB
      Vijay
    8. 0001-simple-secondary-indices.patch
      20 kB
      Gary Dusbabek

      Issue Links

        Activity

        Hide
        Gary Dusbabek added a comment -

        Any strong opinions on whether this should be supported for binary loading?

        Show
        Gary Dusbabek added a comment - Any strong opinions on whether this should be supported for binary loading?
        Hide
        Jonathan Ellis added a comment -

        it should be "supported" in the sense that if you want to load the index rows yourself you should be able to do that. but we shouldn't try to create indexes from the serialized row blobs sent to bmt.

        Show
        Jonathan Ellis added a comment - it should be "supported" in the sense that if you want to load the index rows yourself you should be able to do that. but we shouldn't try to create indexes from the serialized row blobs sent to bmt.
        Hide
        Gary Dusbabek added a comment -

        Adds secondary indices on columns as specified in storage-conf:
        <Index On='some-col-name'/>

        One assumption this patch makes is that the col names and values are all strings. If we wanted to lift that restriction, we could specify a class that can 'stringify' a column name or value as part of the xml above.

        Show
        Gary Dusbabek added a comment - Adds secondary indices on columns as specified in storage-conf: <Index On='some-col-name'/> One assumption this patch makes is that the col names and values are all strings. If we wanted to lift that restriction, we could specify a class that can 'stringify' a column name or value as part of the xml above.
        Hide
        Stu Hood added a comment -

        Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?)

        Also, adding an example/dummy predicate that uses the index would be useful.

        Show
        Stu Hood added a comment - Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?) Also, adding an example/dummy predicate that uses the index would be useful.
        Hide
        Jonathan Ellis added a comment -

        > Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?)

        I'm not sure why that would be useful.

        What we're trying to do here is move to the server a pattern that can be more efficiently done there, and so every client doesn't have to reimplement it manually.

        Show
        Jonathan Ellis added a comment - > Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?) I'm not sure why that would be useful. What we're trying to do here is move to the server a pattern that can be more efficiently done there, and so every client doesn't have to reimplement it manually.
        Hide
        Stu Hood added a comment -

        > I'm not sure why that would be useful.
        A distributed secondary index would allow you to query one machine to figure out what other machines have columns matching a predicate, as opposed to a local secondary index where you immediately query every machine to figure out whether a predicate matches.

        Show
        Stu Hood added a comment - > I'm not sure why that would be useful. A distributed secondary index would allow you to query one machine to figure out what other machines have columns matching a predicate, as opposed to a local secondary index where you immediately query every machine to figure out whether a predicate matches.
        Hide
        Jonathan Ellis added a comment -

        you mean, make each machine hold a copy of the full index?

        that's worth thinking about, but it's not useful yet since we don't support index scans anyway. (another ticket.)

        let's keep this ticket's scope to just automating using another CF to look up keys by value.

        Show
        Jonathan Ellis added a comment - you mean, make each machine hold a copy of the full index? that's worth thinking about, but it's not useful yet since we don't support index scans anyway. (another ticket.) let's keep this ticket's scope to just automating using another CF to look up keys by value.
        Hide
        Stu Hood added a comment -

        > you mean, make each machine hold a copy of the full index?
        I mean a distributed secondary index would be stored in a true ColumnFamily, so it would be partitioned like any other CF.

        > let's keep this ticket's scope to just automating using another CF to look up keys by value.
        In the local secondary index case, perhaps rather than using a separate column family, the index should be another component of the ColumnFamilyStore, tied to each Memtable/SSTable, rather than being a separate ColumnFamily. This gains us a lot of efficiency, and some consistency.

        We already have the primary index file (Index.db) on disk, so secondary indexes would be similar: (column, datafile_offset) tuples. Consistency wise, all replication and repairs happen at the ColumnFamily level, so replication might repair the data ColumnFamily but not its index for instance.

        Show
        Stu Hood added a comment - > you mean, make each machine hold a copy of the full index? I mean a distributed secondary index would be stored in a true ColumnFamily, so it would be partitioned like any other CF. > let's keep this ticket's scope to just automating using another CF to look up keys by value. In the local secondary index case, perhaps rather than using a separate column family, the index should be another component of the ColumnFamilyStore, tied to each Memtable/SSTable, rather than being a separate ColumnFamily. This gains us a lot of efficiency, and some consistency. We already have the primary index file (Index.db) on disk, so secondary indexes would be similar: (column, datafile_offset) tuples. Consistency wise, all replication and repairs happen at the ColumnFamily level, so replication might repair the data ColumnFamily but not its index for instance.
        Hide
        Jonathan Ellis added a comment -

        I think having each node index its CFs locally is a lose for us, because as you say we have to query the full cluster for any index lookup, since we are throwing away our usual partitioning scheme.

        This means we add a ton of complexity (adding a completely different query path) in exchange for not being able to scale these queries, since the work generated increases in lockstep w/ machines added.

        Show
        Jonathan Ellis added a comment - I think having each node index its CFs locally is a lose for us, because as you say we have to query the full cluster for any index lookup, since we are throwing away our usual partitioning scheme. This means we add a ton of complexity (adding a completely different query path) in exchange for not being able to scale these queries, since the work generated increases in lockstep w/ machines added.
        Hide
        Stu Hood added a comment -

        It would appear that the HBase folks have had this exact same discussion, and have settled on two disparate packages for local and distributed secondary indexes: http://issues.apache.org/jira/browse/HBASE-2037

        If we are going to settle on one (or some kind of hybrid?) we need to think more about usecases.

        Show
        Stu Hood added a comment - It would appear that the HBase folks have had this exact same discussion, and have settled on two disparate packages for local and distributed secondary indexes: http://issues.apache.org/jira/browse/HBASE-2037 If we are going to settle on one (or some kind of hybrid?) we need to think more about usecases.
        Hide
        Stu Hood added a comment -

        I've been thinking about this more, and I don't think implementing secondary indexes is worth it: distributed or otherwise. Instead, I think the 'view' approach that CouchDB and Riak have taken is definitely superior.

        For instance, it is easy to implement a secondary index as a view of a ColumnFamily: the key for the view is the value of the indexed column, and the value for the view is the key of the original row. But views are considerably more powerful, since you can store any item in the key or value for the view.

        Also, a view is more conducive to duplication of data, which we prefer in Cassandra: rather than having secondary indexes pointing to the one true copy of the data, you can duplicate that data in a view if you'd like, and have it be lazily/eagerly updated serverside.

        Yes, views might mean a server side scripting language, or an easy to way to plug in and configure Java view classes. It might even mean map-reduce.

        Show
        Stu Hood added a comment - I've been thinking about this more, and I don't think implementing secondary indexes is worth it: distributed or otherwise. Instead, I think the 'view' approach that CouchDB and Riak have taken is definitely superior. For instance, it is easy to implement a secondary index as a view of a ColumnFamily: the key for the view is the value of the indexed column, and the value for the view is the key of the original row. But views are considerably more powerful, since you can store any item in the key or value for the view. Also, a view is more conducive to duplication of data, which we prefer in Cassandra: rather than having secondary indexes pointing to the one true copy of the data, you can duplicate that data in a view if you'd like, and have it be lazily/eagerly updated serverside. Yes, views might mean a server side scripting language, or an easy to way to plug in and configure Java view classes. It might even mean map-reduce.
        Hide
        Jonathan Ellis added a comment -

        gary, do you think this is worth committing to 0.6 given the String limitation? for 0.7 we will almost certainly move to byte[] keys which would make the column -> key thing much more sane.

        Show
        Jonathan Ellis added a comment - gary, do you think this is worth committing to 0.6 given the String limitation? for 0.7 we will almost certainly move to byte[] keys which would make the column -> key thing much more sane.
        Hide
        Gary Dusbabek added a comment -

        It might as well wait until 0.7 then.

        There will still be a string restriction on the column name, unless there is a practical way to express a binary column name in storage-conf.xml where the secondary indices are declared. Maybe add an attribute to <Index> to indicate the On attribute contains base64 data or something like that. Ugly crap though.

        Show
        Gary Dusbabek added a comment - It might as well wait until 0.7 then. There will still be a string restriction on the column name, unless there is a practical way to express a binary column name in storage-conf.xml where the secondary indices are declared. Maybe add an attribute to <Index> to indicate the On attribute contains base64 data or something like that. Ugly crap though.
        Hide
        Jonathan Ellis added a comment -

        ok, let's push this to 0.7 then.

        Show
        Jonathan Ellis added a comment - ok, let's push this to 0.7 then.
        Hide
        Stu Hood added a comment -

        Thoughts on incrementally updated views:

        In a first version a View CF could be defined with a normal column family definition, plus a function that transforms a mutation against a 'base' column family into zero or more mutations against the 'view' column family.

        At mutation time, inserts can immediately be transformed into inserts to the view. But, an insert that overwrites an older value implies deletion from the base, and therefore a potential deletion from the view.

        One disadvantage/advantage Cassandra has is that an entire row is not available at mutation time, so we need to defer all deletes to the view until read time. To handle deletes, all columns in the view cf should be tagged with the column_key from the base cf that caused their creation.

        At read time, the base cf column_keys that are tagging the columns in the view need to be queried and the view function re-applied. If the output of the view function no longer causes an insert to the column in question in the view, the column can be deleted from the view.

        Definitely need to think of ways to efficiently account for deletes, and minimize the number of reads to the base cf that need to occur at read time for the view cf.

        Show
        Stu Hood added a comment - Thoughts on incrementally updated views: In a first version a View CF could be defined with a normal column family definition, plus a function that transforms a mutation against a 'base' column family into zero or more mutations against the 'view' column family. At mutation time, inserts can immediately be transformed into inserts to the view. But, an insert that overwrites an older value implies deletion from the base, and therefore a potential deletion from the view. One disadvantage/advantage Cassandra has is that an entire row is not available at mutation time, so we need to defer all deletes to the view until read time. To handle deletes, all columns in the view cf should be tagged with the column_key from the base cf that caused their creation. At read time, the base cf column_keys that are tagging the columns in the view need to be queried and the view function re-applied. If the output of the view function no longer causes an insert to the column in question in the view, the column can be deleted from the view. Definitely need to think of ways to efficiently account for deletes, and minimize the number of reads to the base cf that need to occur at read time for the view cf.
        Hide
        Filippo Fadda added a comment -

        Hello, I just watched the patch code, and I saw that there is only a method to search for a row set inside the specified ColomnFamily, for the specified Column, given a value for the column itself. I think we really need a method to retrieve all the rows inside the specified ColumnFamily, for the specified Column, sorted for column values (az or za). Is there a way to do this? I don't see it in the code. Am I wrong?

        Show
        Filippo Fadda added a comment - Hello, I just watched the patch code, and I saw that there is only a method to search for a row set inside the specified ColomnFamily, for the specified Column, given a value for the column itself. I think we really need a method to retrieve all the rows inside the specified ColumnFamily, for the specified Column, sorted for column values (az or za). Is there a way to do this? I don't see it in the code. Am I wrong?
        Hide
        Gary Dusbabek added a comment -

        Filippo, you're conflating indexing and querying. Querying will come later. This patch merely does the indexing. It included a simple search method for demonstration purposes.

        Show
        Gary Dusbabek added a comment - Filippo, you're conflating indexing and querying. Querying will come later. This patch merely does the indexing. It included a simple search method for demonstration purposes.
        Hide
        Filippo Fadda added a comment -

        Thank you for your answer Gary. Any plan for querying? Do you think querying will be include in 0.7?

        Show
        Filippo Fadda added a comment - Thank you for your answer Gary. Any plan for querying? Do you think querying will be include in 0.7?
        Hide
        Stu Hood added a comment -

        More thoughts on incrementally updated views:

        In order to not add reads to Cassandra's write path, we could implement a modification to mapreduce that splits the Map function into a map-key function and a map-value. I'm attaching a conversation about a potential approach, but it doesn't really get into implementation details or what the API for the map-key and map-value functions would look like.

        Show
        Stu Hood added a comment - More thoughts on incrementally updated views: In order to not add reads to Cassandra's write path, we could implement a modification to mapreduce that splits the Map function into a map-key function and a map-value. I'm attaching a conversation about a potential approach, but it doesn't really get into implementation details or what the API for the map-key and map-value functions would look like.
        Hide
        Jonathan Ellis added a comment -

        > Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?)

        I think my initial reasoning was wrong here. I was anti-local-indexes because "we have to query the full cluster for any index lookup, since we are throwing away our usual partitioning scheme."

        Which is true, but it ignores the fact that, in most cases, you will have to "query the full cluster" to get the actual matching rows, b/c the indexed rows will be spread across all machines. So, having local indexes is better in the common case, since it actually saves a round trip from querying a the index to querying the rows.

        Also, having each node index the rows it has locally means you don't have to worry about sharding a very large index since it happens automatically.

        Finally, it lets us use the local commitlog to keep index + data in sync.

        Show
        Jonathan Ellis added a comment - > Is it worth creating a secondary index that only contains local data, versus a distributed secondary index (a normal ColumnFamily?) I think my initial reasoning was wrong here. I was anti-local-indexes because "we have to query the full cluster for any index lookup, since we are throwing away our usual partitioning scheme." Which is true, but it ignores the fact that, in most cases, you will have to "query the full cluster" to get the actual matching rows, b/c the indexed rows will be spread across all machines. So, having local indexes is better in the common case, since it actually saves a round trip from querying a the index to querying the rows. Also, having each node index the rows it has locally means you don't have to worry about sharding a very large index since it happens automatically. Finally, it lets us use the local commitlog to keep index + data in sync.
        Hide
        Jonathan Ellis added a comment -

        I'm leaning more and more towards "we should implement 2ary indexes + querying first, then later add full view support" since the former we can do w/o opening the whole user defined functions box which is a pretty big deal.

        Show
        Jonathan Ellis added a comment - I'm leaning more and more towards "we should implement 2ary indexes + querying first, then later add full view support" since the former we can do w/o opening the whole user defined functions box which is a pretty big deal.
        Hide
        Chris Goffinet added a comment -

        +1 Jonathan. I'd love to help, what can I do to start?

        Show
        Chris Goffinet added a comment - +1 Jonathan. I'd love to help, what can I do to start?
        Hide
        Jonathan Ellis added a comment - - edited

        Step one is convert key from String to byte[], so that's symmetric with columns to avoid the problem Gary noted in his original patch. (Should probably get its own ticket.)

        Show
        Jonathan Ellis added a comment - - edited Step one is convert key from String to byte[], so that's symmetric with columns to avoid the problem Gary noted in his original patch. (Should probably get its own ticket.)
        Hide
        Stu Hood added a comment -

        > Which is true, but it ignores the fact that, in most cases, you will have to "query the full cluster" to get the actual matching rows
        This was the point of the views being "semi-materialized". If your view contains all of the data you were interested in from the base row, and it matches a configured recency, then you don't need to query the base. Please see my comment re: "cribs" in the latest attached conversation.

        > local indexes is better in the common case, since it actually saves a round trip from querying a the index to querying the rows
        I disagree. I would expect that the view would contain a large number of rows (typically depending on 1 row each), so querying for one row in the view would usually query one or two rows in the base: not necessarily thousands. Also, the partitioned index has much better best case performance: for the local secondary indexes, you always need to query every unique range/endpoint in the cluster during the first phase, and then merge sort the results from all nodes before you can return a response for even a single row. Federating without partitioning will not scale.

        Being able to implement these skinny rows (rather than the million column rows lazyboy attempts) depends on being able to support non-unique row keys, but that is basically just a compound key of the view-key and the base-key appended, as described on CASSANDRA-767.

        > locally means you don't have to worry about sharding a very large index since it happens automatically
        This is why we have load balancing.

        > since the former we can do w/o opening the whole user defined functions box which is a pretty big deal
        There is no need to allow for arbitrary functions initially if we take the same approach we've taken for comparators: to start, a new view would be defined by extending an abstract class. We could easily have a built in "SecondaryIndex" view class that uses a matching column name/value as the row key in the view.


        Without a way to use these secondary indexes in queries, they are completely pointless. Is the intention that the indexes would be used to speed up predicates/filters in get_range_slices, or are you proposing that the secondary index/view looks and acts like a normal column family, with all of the row content, but with the secondary key as the row key? The former seems pointless, and the latter seems like it should be implemented using the partitioned secondary index approach.

        Show
        Stu Hood added a comment - > Which is true, but it ignores the fact that, in most cases, you will have to "query the full cluster" to get the actual matching rows This was the point of the views being "semi-materialized". If your view contains all of the data you were interested in from the base row, and it matches a configured recency, then you don't need to query the base. Please see my comment re: "cribs" in the latest attached conversation. > local indexes is better in the common case, since it actually saves a round trip from querying a the index to querying the rows I disagree. I would expect that the view would contain a large number of rows (typically depending on 1 row each), so querying for one row in the view would usually query one or two rows in the base: not necessarily thousands. Also, the partitioned index has much better best case performance: for the local secondary indexes, you always need to query every unique range/endpoint in the cluster during the first phase, and then merge sort the results from all nodes before you can return a response for even a single row. Federating without partitioning will not scale. Being able to implement these skinny rows (rather than the million column rows lazyboy attempts) depends on being able to support non-unique row keys, but that is basically just a compound key of the view-key and the base-key appended, as described on CASSANDRA-767 . > locally means you don't have to worry about sharding a very large index since it happens automatically This is why we have load balancing. > since the former we can do w/o opening the whole user defined functions box which is a pretty big deal There is no need to allow for arbitrary functions initially if we take the same approach we've taken for comparators: to start, a new view would be defined by extending an abstract class. We could easily have a built in "SecondaryIndex" view class that uses a matching column name/value as the row key in the view. Without a way to use these secondary indexes in queries, they are completely pointless. Is the intention that the indexes would be used to speed up predicates/filters in get_range_slices, or are you proposing that the secondary index/view looks and acts like a normal column family, with all of the row content, but with the secondary key as the row key? The former seems pointless, and the latter seems like it should be implemented using the partitioned secondary index approach.
        Hide
        Jonathan Ellis added a comment -

        > This is why we have load balancing.

        Load balancing doesn't help if you are indexing something with less potential values than you have nodes in the cluster. At the extreme, say booleans, it's probably not worth indexing vs just doing full scans. But if you have 100s of nodes then not being able to usefully index something woth 20 or 50 or 100 values kinda sucks.

        > We could easily have a built in "SecondaryIndex" view class that uses a matching column name/value as the row key in the view.

        That would probably work, although I don't want to fall into the trap of overgeneralizing because it's sexy. Nobody wants to write static java code to define a view, I can promise you that.

        > Is the intention that the indexes would be used to speed up predicates/filters in get_range_slices

        No, it's to add a different kind of predicate: "give me these columns [existing functionality] from rows that match this index condition [new functionality]."

        Show
        Jonathan Ellis added a comment - > This is why we have load balancing. Load balancing doesn't help if you are indexing something with less potential values than you have nodes in the cluster. At the extreme, say booleans, it's probably not worth indexing vs just doing full scans. But if you have 100s of nodes then not being able to usefully index something woth 20 or 50 or 100 values kinda sucks. > We could easily have a built in "SecondaryIndex" view class that uses a matching column name/value as the row key in the view. That would probably work, although I don't want to fall into the trap of overgeneralizing because it's sexy. Nobody wants to write static java code to define a view, I can promise you that. > Is the intention that the indexes would be used to speed up predicates/filters in get_range_slices No, it's to add a different kind of predicate: "give me these columns [existing functionality] from rows that match this index condition [new functionality] ."
        Hide
        Stu Hood added a comment -

        > Load balancing doesn't help if you are indexing something with less potential values than you have nodes in the cluster
        Again, this brings up the topic of skinny rows: I'm sticking with the idea that we would want skinny rows with a compound key, so that each row key in the index/view might start with "true" in the boolean case, but the actual view row key would be a compound: "true|<base-key>". So, yes, even in the boolean case it is possible to partition the index: you would have very hot spots around "true|" and "false|", but that wouldn't stop our load balancing from splitting based on the remainder of the key.

        > Nobody wants to write static java code to define a view, I can promise you that.
        I know, but it is a temporary solution that allows us to fine tune the interface without providing scripting support or anything else crazy.

        Show
        Stu Hood added a comment - > Load balancing doesn't help if you are indexing something with less potential values than you have nodes in the cluster Again, this brings up the topic of skinny rows: I'm sticking with the idea that we would want skinny rows with a compound key, so that each row key in the index/view might start with "true" in the boolean case, but the actual view row key would be a compound: "true|<base-key>". So, yes, even in the boolean case it is possible to partition the index: you would have very hot spots around "true| " and "false| ", but that wouldn't stop our load balancing from splitting based on the remainder of the key. > Nobody wants to write static java code to define a view, I can promise you that. I know, but it is a temporary solution that allows us to fine tune the interface without providing scripting support or anything else crazy.
        Hide
        Jonathan Ellis added a comment -

        Skinny rows is more special casing than just partitioning indexes, and doesn't solve the "how do we keep the index consistent w/ the actual data in the case of failures" problem.

        Show
        Jonathan Ellis added a comment - Skinny rows is more special casing than just partitioning indexes, and doesn't solve the "how do we keep the index consistent w/ the actual data in the case of failures" problem.
        Hide
        Gary Dusbabek added a comment -

        >Step one is convert key from String to byte[], so that's symmetric with columns to avoid the problem Gary noted in his original patch.

        We also still have the (minor) problem of describing the indexed column in storage-conf. <Index OnHex="0xabcd"/> or something like that ought to work.

        As to the locality discussion: I'm still in favor of keeping them node-local, at least for the way we intend to use them. (Please give me colB for rows where colA=foo and the row key is unknown.)

        Show
        Gary Dusbabek added a comment - >Step one is convert key from String to byte[], so that's symmetric with columns to avoid the problem Gary noted in his original patch. We also still have the (minor) problem of describing the indexed column in storage-conf. <Index OnHex="0xabcd"/> or something like that ought to work. As to the locality discussion: I'm still in favor of keeping them node-local, at least for the way we intend to use them. (Please give me colB for rows where colA=foo and the row key is unknown.)
        Hide
        Jonathan Ellis added a comment -

        It's worth pointing out that our row bloom filter rejects requests for non-existing rows very, very performantly, so the overhead for doing requests to all nodes for local indexes (or at least nodes / RF) when cardinality is high is lower than it looks at first.

        So in practice I strongly suspect this will scale at least to hundreds of nodes if not thousands, so saying "we can't do this because it won't scale" is not a strong argument.

        And when you are doing requests against a "index on a single node," the consistency problem is worth than you think. There's no way to make it consistent with a batch m/r, without a Big Lock against the CF being indexed, since if you are examining an index entry w/ no matching "natural" entry, you have no way of knowing if that's because another process is about to clean out the index entry, or add the natural entry. So you have to check each index hit for validity each time which is a huge hit. (And allowing the user to say "stale data" is okay is wrong, because it's not "eventually consistent," once out of sync it will stay that way.)

        Show
        Jonathan Ellis added a comment - It's worth pointing out that our row bloom filter rejects requests for non-existing rows very, very performantly, so the overhead for doing requests to all nodes for local indexes (or at least nodes / RF) when cardinality is high is lower than it looks at first. So in practice I strongly suspect this will scale at least to hundreds of nodes if not thousands, so saying "we can't do this because it won't scale" is not a strong argument. And when you are doing requests against a "index on a single node," the consistency problem is worth than you think. There's no way to make it consistent with a batch m/r, without a Big Lock against the CF being indexed, since if you are examining an index entry w/ no matching "natural" entry, you have no way of knowing if that's because another process is about to clean out the index entry, or add the natural entry. So you have to check each index hit for validity each time which is a huge hit. (And allowing the user to say "stale data" is okay is wrong, because it's not "eventually consistent," once out of sync it will stay that way.)
        Hide
        Jonathan Ellis added a comment -

        Another point: local indexes allow us to do indexed inequality comparisons (birth_date > $year) trivially, since we can safely make the local index sstables OPP no matter what the cluster partitioner setting is. Doing this for non-local indexes requires the cluster to be OPP, which experience has demonstrated is not what most people want to use despite its advantages.

        Show
        Jonathan Ellis added a comment - Another point: local indexes allow us to do indexed inequality comparisons (birth_date > $year) trivially, since we can safely make the local index sstables OPP no matter what the cluster partitioner setting is. Doing this for non-local indexes requires the cluster to be OPP, which experience has demonstrated is not what most people want to use despite its advantages.
        Hide
        Stu Hood added a comment -

        > So in practice I strongly suspect this will scale at least to hundreds of nodes if not thousands
        > so saying "we can't do this because it won't scale" is not a strong argument.
        I think you're making the "speed == scalability" mistake. It doesn't matter if we could do 30k index queries per second on one node: your bound for index queries for the entire cluster would still be 30k, no matter how many nodes you added.

        > So you have to check each index hit for validity each time which is a huge hit.
        You have to do the same thing for the secondary index: presumably you actually want to find the content of the row that was indexed, and so you need to seek to the row in the indexed CF. Both solutions need this seek: one just performs it across the network.

        > you have no way of knowing if that's because another process is about to clean out the index entry, or add the natural entry.
        This is a problem: I'll admit. One option is to do something like 'view-read-repair': when retrieving the indexed row from the base, only clean up an invalid index entry after enough time has passed since the entries' creation time for any in flight-writes to have completed.


        I think I'm convinced that fully materialized views will not be able to be consistent (even eventually), since the nodes storing the base/view are probably in different scopes of serializability. BUT I'm sticking to the idea that the partitioned view that queries the base for the row content is the superior one.

        Show
        Stu Hood added a comment - > So in practice I strongly suspect this will scale at least to hundreds of nodes if not thousands > so saying "we can't do this because it won't scale" is not a strong argument. I think you're making the "speed == scalability" mistake. It doesn't matter if we could do 30k index queries per second on one node: your bound for index queries for the entire cluster would still be 30k, no matter how many nodes you added. > So you have to check each index hit for validity each time which is a huge hit. You have to do the same thing for the secondary index: presumably you actually want to find the content of the row that was indexed, and so you need to seek to the row in the indexed CF. Both solutions need this seek: one just performs it across the network. > you have no way of knowing if that's because another process is about to clean out the index entry, or add the natural entry. This is a problem: I'll admit. One option is to do something like 'view-read-repair': when retrieving the indexed row from the base, only clean up an invalid index entry after enough time has passed since the entries' creation time for any in flight-writes to have completed. I think I'm convinced that fully materialized views will not be able to be consistent (even eventually), since the nodes storing the base/view are probably in different scopes of serializability. BUT I'm sticking to the idea that the partitioned view that queries the base for the row content is the superior one.
        Hide
        Stu Hood added a comment -

        > Doing this for non-local indexes requires the cluster to be OPP, which experience has demonstrated is not what
        > most people want to use despite its advantages
        If people don't want to use OPP, most likely that is because we have more work to do on load balancing (fixing CASSANDRA-579 for instance). OPP is one of our key advantages, and throwing it away because it still needs improvement is not wise.

        Show
        Stu Hood added a comment - > Doing this for non-local indexes requires the cluster to be OPP, which experience has demonstrated is not what > most people want to use despite its advantages If people don't want to use OPP, most likely that is because we have more work to do on load balancing (fixing CASSANDRA-579 for instance). OPP is one of our key advantages, and throwing it away because it still needs improvement is not wise.
        Hide
        Jonathan Ellis added a comment -

        > You have to do the same thing for the secondary index: presumably you actually want to find the content of the row that was indexed

        Not if you denormalize into subcolumns, you don't.

        Show
        Jonathan Ellis added a comment - > You have to do the same thing for the secondary index: presumably you actually want to find the content of the row that was indexed Not if you denormalize into subcolumns, you don't.
        Hide
        Jonathan Ellis added a comment -

        > I think you're making the "speed == scalability" mistake

        No, I'm simply acknowledging that there's no such thing as "infinite scalability," and if this scales to the machine counts people actually deploy on then it's silly to do something more complex.

        Show
        Jonathan Ellis added a comment - > I think you're making the "speed == scalability" mistake No, I'm simply acknowledging that there's no such thing as "infinite scalability," and if this scales to the machine counts people actually deploy on then it's silly to do something more complex.
        Hide
        Jonathan Ellis added a comment -

        Another point: having local indexes makes applying AND clauses touching multiple indexes easy. This is difficult to do efficiently when you have two indexes on two different machines.

        Show
        Jonathan Ellis added a comment - Another point: having local indexes makes applying AND clauses touching multiple indexes easy. This is difficult to do efficiently when you have two indexes on two different machines.
        Hide
        Stu Hood added a comment -

        > having local indexes makes applying AND clauses touching multiple indexes easy
        I don't think that is a road we should necessarily be going down: a compound index describing the AND is much more performant, and fits the view model much better.

        Show
        Stu Hood added a comment - > having local indexes makes applying AND clauses touching multiple indexes easy I don't think that is a road we should necessarily be going down: a compound index describing the AND is much more performant, and fits the view model much better.
        Hide
        Jonathan Ellis added a comment -

        Compound indexes can be useful but they explode in size far too quickly as you add index terms to be the only solution offered.

        Show
        Jonathan Ellis added a comment - Compound indexes can be useful but they explode in size far too quickly as you add index terms to be the only solution offered.
        Hide
        Jonathan Ellis added a comment -

        One disadvantage of local indexes is, they have to be rebuilt (on the receiving side) when streaming, and tombstoned on cleanup (on the sending side).

        This isn't without its silver lining, though; having to only transfer the actual data from the sending machines (where resources are scarcer than on the empty, receiving end) is definite efficiency win.

        We're also already planning to only stream data in 579, and rebuild index + BF on the receving side – with that in place, rebuilding local indexes will be a simple addition.

        Show
        Jonathan Ellis added a comment - One disadvantage of local indexes is, they have to be rebuilt (on the receiving side) when streaming, and tombstoned on cleanup (on the sending side). This isn't without its silver lining, though; having to only transfer the actual data from the sending machines (where resources are scarcer than on the empty, receiving end) is definite efficiency win. We're also already planning to only stream data in 579, and rebuild index + BF on the receving side – with that in place, rebuilding local indexes will be a simple addition.
        Hide
        Stu Hood added a comment -

        Cliff mentioned in #cassandra that he preferred the local secondary index approach. I think this was because adding more replicas is supposed to be a solution for scaling out reads, but I don't think it is that simple here.

        Rather than utilizing IO efficiently by partitioning, all of those nodes will have approximately the same stuff cached in memory, because they won't have been assigned anything specific to hold onto. Also, scaling up to more replicas (than you need for HA) across your entire cluster in order to improve your average latency seems like a waste of disks/machines.

        Additionally, you have to keep all of those replicas consistent, which means every node still has to answer reads eventually for read repair. Adding nodes doesn't actually eliminate any queries: it might improve your average latency, but it won't affect your throughput.

        Show
        Stu Hood added a comment - Cliff mentioned in #cassandra that he preferred the local secondary index approach. I think this was because adding more replicas is supposed to be a solution for scaling out reads, but I don't think it is that simple here. Rather than utilizing IO efficiently by partitioning, all of those nodes will have approximately the same stuff cached in memory, because they won't have been assigned anything specific to hold onto. Also, scaling up to more replicas (than you need for HA) across your entire cluster in order to improve your average latency seems like a waste of disks/machines. Additionally, you have to keep all of those replicas consistent, which means every node still has to answer reads eventually for read repair. Adding nodes doesn't actually eliminate any queries: it might improve your average latency, but it won't affect your throughput.
        Hide
        Jonathan Ellis added a comment -

        Making read repair disable-able would be a lot smaller of a design wart than what nonlocal indexes require.

        Show
        Jonathan Ellis added a comment - Making read repair disable-able would be a lot smaller of a design wart than what nonlocal indexes require.
        Hide
        Jonathan Ellis added a comment -

        And the performance of updating nonlocal indexes really is ass, which is why we shouldn't inflict it on everyone in the name of making things more convenient for the largest sites.

        (Remember, it's either ass because you have to do read base/update index/update index new/write base new, or because you have to set recency super low and cause a ton of check-against-base on your lookups.)

        Show
        Jonathan Ellis added a comment - And the performance of updating nonlocal indexes really is ass, which is why we shouldn't inflict it on everyone in the name of making things more convenient for the largest sites. (Remember, it's either ass because you have to do read base/update index/update index new/write base new, or because you have to set recency super low and cause a ton of check-against-base on your lookups.)
        Hide
        Jonathan Ellis added a comment -

        In support of the "nonlocal indexes are too slow" argument, I present as exhibit A google app engine. From http://www.chariotsolutions.com/slides/pdfs/ete2009-GoogleUndertheCoversApp.pdf it appears that they use nonlocal indexes, and it's so slow that they've "improved" it by automatically retrying timed-out queries.

        Local indexes may not scale well enough to be the only solution you ever need, but nonlocal indexes are too slow to be the only solution you ever need.

        It also bothers me that nonlocal indexes with "recency" violate our "except in failure scenarios, readers see writes in ms" policy. (There is a race: if a reader sees an "empty" cell, and fills it in before the write completes, it will be marked as empty until the recency expires. So we've added a second layer of inconsistency on top of our existing one.)

        To me the decisive argument is that if someone has a workload that really needs nonlocal indexes, they can build those in their app with very little performance difference, just the round trips from app server to cassandra coordinator node. (Using the StorageProxy api would erase even this.) Clients like lazyboy already automate this.

        If you have a workload that is a good fit for local indexes (which is, as I've pointed out, at the very least "every workload in a cluster with less than some unknown number N of nodes") then there's no way to fake that from the client.

        Finally, as I've noted, the query API should be index-implementation-agnostic, so implementing local indexes now doesn't preclude adding a PartitionedByColumnIndex later.

        Show
        Jonathan Ellis added a comment - In support of the "nonlocal indexes are too slow" argument, I present as exhibit A google app engine. From http://www.chariotsolutions.com/slides/pdfs/ete2009-GoogleUndertheCoversApp.pdf it appears that they use nonlocal indexes, and it's so slow that they've "improved" it by automatically retrying timed-out queries. Local indexes may not scale well enough to be the only solution you ever need, but nonlocal indexes are too slow to be the only solution you ever need. It also bothers me that nonlocal indexes with "recency" violate our "except in failure scenarios, readers see writes in ms" policy. (There is a race: if a reader sees an "empty" cell, and fills it in before the write completes, it will be marked as empty until the recency expires. So we've added a second layer of inconsistency on top of our existing one.) To me the decisive argument is that if someone has a workload that really needs nonlocal indexes, they can build those in their app with very little performance difference, just the round trips from app server to cassandra coordinator node. (Using the StorageProxy api would erase even this.) Clients like lazyboy already automate this. If you have a workload that is a good fit for local indexes (which is, as I've pointed out, at the very least "every workload in a cluster with less than some unknown number N of nodes") then there's no way to fake that from the client. Finally, as I've noted, the query API should be index-implementation-agnostic, so implementing local indexes now doesn't preclude adding a PartitionedByColumnIndex later.
        Hide
        Stu Hood added a comment -

        > it appears that they use nonlocal indexes, and it's so slow that they've "improved" it by automatically retrying timed-out queries
        First, they're hosted on Bigtable, so of course they need timeouts. Second, they have a section titled "Constraints Precede Performance", which shows why you should constrain queries rather than allowing queries that will be slow.

        > it's either ass because you have to do read base/update index/update index new/write base new
        This applies to both local and distributed indexes. You will have to decide between eager and lazy cleanups of the view, whether you are updating it locally or remotely. The performance hit in the write path for the distributed index is having to do blocking writes to W replicas in the view before starting writes to the base.

        > There is a race: if a reader sees an "empty" cell, and fills it in before the write completes, it will be
        > marked as empty until the recency expires.
        This was one of the points of that timeout parameter being set higher than a write might take to complete: if a write was still in progress (view clock doesn't match base clock), we wouldn't populate the view.

        Show
        Stu Hood added a comment - > it appears that they use nonlocal indexes, and it's so slow that they've "improved" it by automatically retrying timed-out queries First, they're hosted on Bigtable, so of course they need timeouts. Second, they have a section titled "Constraints Precede Performance", which shows why you should constrain queries rather than allowing queries that will be slow. > it's either ass because you have to do read base/update index/update index new/write base new This applies to both local and distributed indexes. You will have to decide between eager and lazy cleanups of the view, whether you are updating it locally or remotely. The performance hit in the write path for the distributed index is having to do blocking writes to W replicas in the view before starting writes to the base. > There is a race: if a reader sees an "empty" cell, and fills it in before the write completes, it will be > marked as empty until the recency expires. This was one of the points of that timeout parameter being set higher than a write might take to complete: if a write was still in progress (view clock doesn't match base clock), we wouldn't populate the view.
        Hide
        Jonathan Ellis added a comment - - edited

        > First, they're hosted on Bigtable, so of course they need timeouts. Second, they have a section titled "Constraints Precede Performance", which shows why you should constrain queries rather than allowing queries that will be slow.

        The point is that indexed queries are still slow (hundreds of ms) on app engine. Edit: and updates too, which was where I was going here.

        > This applies to both local and distributed indexes

        But with local it's all ... local.

        > This was one of the points of that timeout parameter being set higher than a write might take to complete: if a write was still in progress (view clock doesn't match base clock), we wouldn't populate the view.

        This is failing the "complex solutions will bite you" smoke test.

        Show
        Jonathan Ellis added a comment - - edited > First, they're hosted on Bigtable, so of course they need timeouts. Second, they have a section titled "Constraints Precede Performance", which shows why you should constrain queries rather than allowing queries that will be slow. The point is that indexed queries are still slow (hundreds of ms) on app engine. Edit: and updates too, which was where I was going here. > This applies to both local and distributed indexes But with local it's all ... local. > This was one of the points of that timeout parameter being set higher than a write might take to complete: if a write was still in progress (view clock doesn't match base clock), we wouldn't populate the view. This is failing the "complex solutions will bite you" smoke test.
        Hide
        Jonathan Ellis added a comment -

        If we had megastore-style entity groups and distributed transaction log, then keeping non-local indexes in sync becomes a non-problem. (Although performance still sucks, apparently.)

        Of course that's one hell of a prerequisite. So I still think local indexes are the sane short term solution. And we do need indexes sooner than later.

        Show
        Jonathan Ellis added a comment - If we had megastore-style entity groups and distributed transaction log, then keeping non-local indexes in sync becomes a non-problem. (Although performance still sucks, apparently.) Of course that's one hell of a prerequisite. So I still think local indexes are the sane short term solution. And we do need indexes sooner than later.
        Hide
        Stu Hood added a comment -

        Jonathan and I talked this afternoon, and settled a few things: this ticket
        should move forward with local indexes for now.

        We also agree that we should only allow queries against exactly one
        index to start with, which should be sufficient to solve a large number of
        problems.

        We're not entirely sure what the query API should look like yet:

        One possibility is exposing indexes as named views, which support the
        same calls that standard CFs do. Thus, the the user names a view e.g.
        a "UsersByAccount" view for the "Users" CF, which would look like a
        super CF of (indexkey, (basekey, (basecols))) to the thrift API.

        This would preclude support for indexing super CFs initially, since we
        don't support arbitrary nesting yet.

        Show
        Stu Hood added a comment - Jonathan and I talked this afternoon, and settled a few things: this ticket should move forward with local indexes for now. We also agree that we should only allow queries against exactly one index to start with, which should be sufficient to solve a large number of problems. We're not entirely sure what the query API should look like yet: One possibility is exposing indexes as named views, which support the same calls that standard CFs do. Thus, the the user names a view e.g. a "UsersByAccount" view for the "Users" CF, which would look like a super CF of (indexkey, (basekey, (basecols))) to the thrift API. This would preclude support for indexing super CFs initially, since we don't support arbitrary nesting yet.
        Hide
        Jonathan Ellis added a comment -

        The more I think about the pseudo-CF api the less I like it.

        We want to push filtering down to the node with the data on it, where possible. This means that, even if we only actually scan one index, we need to be able to say "AND cond2 op val2" and apply that before moving data off-node. This isn't possible the "pretend index is a supercolumn row" approach.

        An alternative would be to combine range scans, multigets, and index scans into a single api,

        multiget(rowpredicate, columnpredicate)*

        rowpredicate would then be one of

        • named keys [current multiget]
        • key range [current get_range_slices]
        • column comparisons [index scan]

        This unifies the concept of "get a bunch of rows at once" nicely.

        *open to a different method name if we can come up with one that doesn't suck.

        Show
        Jonathan Ellis added a comment - The more I think about the pseudo-CF api the less I like it. We want to push filtering down to the node with the data on it, where possible. This means that, even if we only actually scan one index, we need to be able to say "AND cond2 op val2" and apply that before moving data off-node. This isn't possible the "pretend index is a supercolumn row" approach. An alternative would be to combine range scans, multigets, and index scans into a single api, multiget(rowpredicate, columnpredicate)* rowpredicate would then be one of named keys [current multiget] key range [current get_range_slices] column comparisons [index scan] This unifies the concept of "get a bunch of rows at once" nicely. *open to a different method name if we can come up with one that doesn't suck.
        Hide
        Stu Hood added a comment -

        > This isn't possible the "pretend index is a supercolumn row" approach.
        I'm not sure that I understand why... can you give an example? The key in the pseudo CF would be the original indexed value, and each top level column in the index row would be a row from the base (from one node), so filtering within the base row could be applied locally on each node.

        > multiget(rowpredicate, columnpredicate)*
        The rowpredicate containing an "index scan" parameter is very interesting, and does clarify slow operations. But, I can easily image a situation where someone wanted to use both a "named keys" and "index scan" rowpredicate at once, which would still be very efficient, but which would require a list<rowpredicate>.

        I agree that placing the "index scan" predicate in the first position in the method call is essential, which is why I suggested the pseudo-CF api:


        An interesting parallel is to compare the proposed api to Python's array slicing syntax, which is extremely elegant. I imagine that our ideal API is one that allows either named keys or a key range at every level of nesting. The following paragraphs only refer to key/name slicing, and don't go into 'value' queries.

        As long as you concretely define a key or range of keys to search for at each level (such as [key1:key5][name1:name2][subname5]), your operation can run in bounded time. But, to provide for more flexibility, the get_range_slices method in the current API allows something like: [ ? ][name5] The question mark represents an unbounded level, which may mean a full table scan without finding 'subname5' (very dangerous, not scalable). This is one of the places where we need secondary indexes: we want columns containing any value for subname5 bunched together into an index.

        Comparing to the Python array API highlights the fact that prefix searches are always safe, and that by always having a parent predicate, you achieve bounded time operations. This is why placing the "index scan" predicate in the first position is so clear.


        This brings us back to the pseudo-CF api: why have 3 types of rowpredicates, and 2+ types of columnpredicates when, by asking users to define views that shuffle their data into a form that allows for prefix queries, we can do something like:

        multiget(list<predicate> predicates)

        ... with a predicate (key range or key list) required for every level, and only the last level allowing an unbounded predicate.

        With this API, the "named keys" + "index scan" query I pointed out above would look like (with an indexed 'age' column):

        multiget( [ predicate(key is 27), predicate(name in [ben, george]), predicate(subname is any) ] )

        Show
        Stu Hood added a comment - > This isn't possible the "pretend index is a supercolumn row" approach. I'm not sure that I understand why... can you give an example? The key in the pseudo CF would be the original indexed value, and each top level column in the index row would be a row from the base (from one node), so filtering within the base row could be applied locally on each node. > multiget(rowpredicate, columnpredicate)* The rowpredicate containing an "index scan" parameter is very interesting, and does clarify slow operations. But, I can easily image a situation where someone wanted to use both a "named keys" and "index scan" rowpredicate at once, which would still be very efficient, but which would require a list<rowpredicate>. I agree that placing the "index scan" predicate in the first position in the method call is essential, which is why I suggested the pseudo-CF api: An interesting parallel is to compare the proposed api to Python's array slicing syntax, which is extremely elegant. I imagine that our ideal API is one that allows either named keys or a key range at every level of nesting. The following paragraphs only refer to key/name slicing, and don't go into 'value' queries. As long as you concretely define a key or range of keys to search for at each level (such as [key1:key5] [name1:name2] [subname5] ), your operation can run in bounded time. But, to provide for more flexibility, the get_range_slices method in the current API allows something like: [ ? ] [name5] The question mark represents an unbounded level, which may mean a full table scan without finding 'subname5' (very dangerous, not scalable). This is one of the places where we need secondary indexes: we want columns containing any value for subname5 bunched together into an index. Comparing to the Python array API highlights the fact that prefix searches are always safe, and that by always having a parent predicate, you achieve bounded time operations. This is why placing the "index scan" predicate in the first position is so clear. This brings us back to the pseudo-CF api: why have 3 types of rowpredicates, and 2+ types of columnpredicates when, by asking users to define views that shuffle their data into a form that allows for prefix queries, we can do something like: multiget(list<predicate> predicates) ... with a predicate (key range or key list) required for every level, and only the last level allowing an unbounded predicate. With this API, the "named keys" + "index scan" query I pointed out above would look like (with an indexed 'age' column): multiget( [ predicate(key is 27), predicate(name in [ben, george] ), predicate(subname is any) ] )
        Hide
        Stu Hood added a comment -

        Thinking about this more, ANY range we allow is basically unbounded, because the distance between A and B may be billions of rows.

        Instead of disallowing them, I think the approach that get_range_slices takes for unbounded ranges (http://wiki.apache.org/cassandra/FAQ#range_ghosts) is reasonable: it bounds the time that can be spent on a specific query, because a query will return null rows rather than doing an unending scan.

        It's not ideal, because a user will probably be tempted to write a loop that pages through the null rows, but it will be a clear anti-pattern, because we can tell users: "if you're thinking about paging through the results from multiget, then you should probably be using a view instead.".

        Show
        Stu Hood added a comment - Thinking about this more, ANY range we allow is basically unbounded, because the distance between A and B may be billions of rows. Instead of disallowing them, I think the approach that get_range_slices takes for unbounded ranges ( http://wiki.apache.org/cassandra/FAQ#range_ghosts ) is reasonable: it bounds the time that can be spent on a specific query, because a query will return null rows rather than doing an unending scan. It's not ideal, because a user will probably be tempted to write a loop that pages through the null rows, but it will be a clear anti-pattern, because we can tell users: "if you're thinking about paging through the results from multiget, then you should probably be using a view instead.".
        Hide
        Jonathan Ellis added a comment -

        I suppose that's an option, but

        • that tombstoned keys give empty result sets is correct, if unexpected, and even so it's possibly the #1 FAQ we get. adding Yet Another Special Case, this time with null sets to indicate that "maybe it had the columns requested, maybe not, but it didn't match your where clause" isn't even remotely intuitive.
        • In 10 years of using relational databases I can count on zero fingers the number of times that an indexed query was killing the system because of extra where clauses that made it scan extra data. Sure, it's theoretically possible, but in practice it just doesn't seem to matter.
        • "my system lets me do queries that are slow" is NOT a complaint most people have about sql. you're addressing a pain point that people just aren't feeling.

        So I'd be more inclined to say, let's do it the intuitive way.

        Show
        Jonathan Ellis added a comment - I suppose that's an option, but that tombstoned keys give empty result sets is correct, if unexpected, and even so it's possibly the #1 FAQ we get. adding Yet Another Special Case, this time with null sets to indicate that "maybe it had the columns requested, maybe not, but it didn't match your where clause" isn't even remotely intuitive. In 10 years of using relational databases I can count on zero fingers the number of times that an indexed query was killing the system because of extra where clauses that made it scan extra data. Sure, it's theoretically possible, but in practice it just doesn't seem to matter. "my system lets me do queries that are slow" is NOT a complaint most people have about sql. you're addressing a pain point that people just aren't feeling. So I'd be more inclined to say, let's do it the intuitive way.
        Hide
        Stu Hood added a comment - - edited

        > "maybe it had the columns requested, maybe not, but it didn't match your where clause"
        I'm not talking about looking at values here: just keys/names. So the only question that query is answering is "did this column key/name exist with these parents".

        The only way to query values with the API I proposed is to create a view by the value.

        EDIT: Hmm... now I'm not so sure. You could technically allow value queries using that same API with an additional predicate level for the value.

        > "my system lets me do queries that are slow" is NOT a complaint most people have about sql
        I disagree, but perhaps we misunderstood eachother on that first point.


        Are there any alternatives to 'null rows', that aren't as awkward? Essentially, what a user is supposed to understand is: "the database had to scan 100 null rows for every 1 that it returned"

        Show
        Stu Hood added a comment - - edited > "maybe it had the columns requested, maybe not, but it didn't match your where clause" I'm not talking about looking at values here: just keys/names. So the only question that query is answering is "did this column key/name exist with these parents". The only way to query values with the API I proposed is to create a view by the value. EDIT: Hmm... now I'm not so sure. You could technically allow value queries using that same API with an additional predicate level for the value. > "my system lets me do queries that are slow" is NOT a complaint most people have about sql I disagree, but perhaps we misunderstood eachother on that first point. Are there any alternatives to 'null rows', that aren't as awkward? Essentially, what a user is supposed to understand is: "the database had to scan 100 null rows for every 1 that it returned"
        Hide
        Jonathan Ellis added a comment -

        goffinet referred me to http://portal.acm.org/citation.cfm?id=1386118.1386125 (pdf available via google scholar, e.g. http://ww2.cs.mu.oz.au/~jz/fulltext/tods08.pdf)

        Show
        Jonathan Ellis added a comment - goffinet referred me to http://portal.acm.org/citation.cfm?id=1386118.1386125 (pdf available via google scholar, e.g. http://ww2.cs.mu.oz.au/~jz/fulltext/tods08.pdf )
        Show
        Jonathan Ellis added a comment - google cache: http://74.125.155.132/scholar?q=cache:0twsVGLA63gJ:scholar.google.com/&hl=en&as_sdt=10000000000000
        Show
        Gary Dusbabek added a comment - this link works: http://www.csse.unimelb.edu.au/~jz/fulltext/acmtods08.pdf
        Hide
        Jonathan Ellis added a comment -

        Took a look at the HBase stuff. Like Stu said, they have both local indexes (IHBase) and distributed ones (THBase). Local indexes were implemented second, but seem to be regarded as better for most situations since they are substantially faster. (Neither THBase nor IHBase does materialization, so that is an important difference from what we are proposing here.)

        IHBase indexes are built in memory on regionserver start, and are not persisted.

        THBase stands for transactional. 2PC is used to keep indexes in sync with the original data to avoid inconsistency problems, but "you cannot rely on the transactional properties in the face of node failure." A lot of the tickets dealing with THBase must be searched for under the term OCC [optimistic concurrency control], including the first one, HBASE-669.

        IHBase and THBase are now both part of the same "transactional" contrib package. I'm not sure if you can use both types of indexes in the same CF or Table.

        From my reading, THBase does not deal with the "very large index rows" problem. Possibly rows can already be arbitrarily large under HBase + HDFS? (Note that for us rows will always be limited to the size of a single machine's local disk, even when we fix the "fit in memory' limitation.)

        Show
        Jonathan Ellis added a comment - Took a look at the HBase stuff. Like Stu said, they have both local indexes (IHBase) and distributed ones (THBase). Local indexes were implemented second, but seem to be regarded as better for most situations since they are substantially faster. (Neither THBase nor IHBase does materialization, so that is an important difference from what we are proposing here.) IHBase indexes are built in memory on regionserver start, and are not persisted. THBase stands for transactional. 2PC is used to keep indexes in sync with the original data to avoid inconsistency problems, but "you cannot rely on the transactional properties in the face of node failure." A lot of the tickets dealing with THBase must be searched for under the term OCC [optimistic concurrency control] , including the first one, HBASE-669 . IHBase and THBase are now both part of the same "transactional" contrib package. I'm not sure if you can use both types of indexes in the same CF or Table. From my reading, THBase does not deal with the "very large index rows" problem. Possibly rows can already be arbitrarily large under HBase + HDFS? (Note that for us rows will always be limited to the size of a single machine's local disk, even when we fix the "fit in memory' limitation.)
        Hide
        Jonathan Ellis added a comment -

        Another thought: if you are reading small "pages" worth of indexed rows out at a time, which we've been discussing here as the use case that non-local indexes are good at, then presumably you want the rows in your pages ordered by some property.

        If this property is the original, base row key (or if the user doesn't care and allows us to pick), then local indexes are just as efficient as non-local under OPP, since you can only query a single replica. (Under RP, you can't do non-local indexes at all, so local indexes still wins.)

        If the property is NOT the original, base row key, then not matter what kind of index you have, you need to read all the results to sort in the desired order, in which case parallelizing that read is better than not, which you get for free with local indexes; the "index hot spot" reasoning also applies.

        Show
        Jonathan Ellis added a comment - Another thought: if you are reading small "pages" worth of indexed rows out at a time, which we've been discussing here as the use case that non-local indexes are good at, then presumably you want the rows in your pages ordered by some property. If this property is the original, base row key (or if the user doesn't care and allows us to pick), then local indexes are just as efficient as non-local under OPP, since you can only query a single replica. (Under RP, you can't do non-local indexes at all, so local indexes still wins.) If the property is NOT the original, base row key, then not matter what kind of index you have, you need to read all the results to sort in the desired order, in which case parallelizing that read is better than not, which you get for free with local indexes; the "index hot spot" reasoning also applies.
        Hide
        Stu Hood added a comment - - edited

        I think we agree that both approaches have their merits. The vast difference between their best use cases needs to be considered as we decide on a query API. In particular:

        Local indexes are better for:

        • Low cardinality fields
        • Filtering of values in base order

        Distributed indexes are better for:

        • High cardinality fields
        • Querying of values in index order

        Since we've decided to move forward with local indexes in this version, I think it would be reasonable to restrict this initial implementation to filtering of known keys (no sorting), and to put a restriction in place to prevent them from being used for querying. Additionally, we should "leave room" to add in the ability to sort/query on the indexed dimension in the future (when we implement distributed indexes).

        Show
        Stu Hood added a comment - - edited I think we agree that both approaches have their merits. The vast difference between their best use cases needs to be considered as we decide on a query API. In particular: Local indexes are better for: Low cardinality fields Filtering of values in base order Distributed indexes are better for: High cardinality fields Querying of values in index order Since we've decided to move forward with local indexes in this version, I think it would be reasonable to restrict this initial implementation to filtering of known keys (no sorting), and to put a restriction in place to prevent them from being used for querying. Additionally, we should "leave room" to add in the ability to sort/query on the indexed dimension in the future (when we implement distributed indexes).
        Hide
        Stu Hood added a comment -

        Since local indexes aren't as useful for returning results in index order, and since they are much better for low cardinality fields, perhaps they should be implemented as bitmap indexes? http://en.wikipedia.org/wiki/Bitmap_index

        Show
        Stu Hood added a comment - Since local indexes aren't as useful for returning results in index order, and since they are much better for low cardinality fields, perhaps they should be implemented as bitmap indexes? http://en.wikipedia.org/wiki/Bitmap_index
        Hide
        Ryan King added a comment -

        Based on discussion (with many people, Stu included) at the Digg hackathon, we should do this:

        1. Do local indexes for low-cardinality data, including api changes to expose this.

        Among other concerns, Stu mentioned that there might be some indexes with such low cardinality that an index could be counterproductive. We agreed that in these cases we should still require that the user declare that they'd like an index, but avoid actually writing the index until it would be useful.

        2. For distributed indexes we should build coprocessors/triggers/callbacks (name TBD) as a separate project (and separate JIRA).

        Show
        Ryan King added a comment - Based on discussion (with many people, Stu included) at the Digg hackathon, we should do this: 1. Do local indexes for low-cardinality data, including api changes to expose this. Among other concerns, Stu mentioned that there might be some indexes with such low cardinality that an index could be counterproductive. We agreed that in these cases we should still require that the user declare that they'd like an index, but avoid actually writing the index until it would be useful. 2. For distributed indexes we should build coprocessors/triggers/callbacks (name TBD) as a separate project (and separate JIRA).
        Hide
        Paul Bohm added a comment -

        > 2. For distributed indexes we should build coprocessors/triggers/callbacks (name TBD) as a separate project (and separate JIRA).
        Please explain what you mean with coprocessors/triggers/callbacks and how they'd be implemented.

        Show
        Paul Bohm added a comment - > 2. For distributed indexes we should build coprocessors/triggers/callbacks (name TBD) as a separate project (and separate JIRA). Please explain what you mean with coprocessors/triggers/callbacks and how they'd be implemented.
        Hide
        Jonathan Ellis added a comment -

        He's referring to CASSANDRA-1016. Let's keep discussion about that over there.

        Show
        Jonathan Ellis added a comment - He's referring to CASSANDRA-1016 . Let's keep discussion about that over there.
        Hide
        Jonathan Ellis added a comment -

        attached what I think the thrift changes could look like

        Show
        Jonathan Ellis added a comment - attached what I think the thrift changes could look like
        Hide
        Schubert Zhang added a comment -

        I think the cassandra project manager should do some feature control.
        The feature such as this one should be postponed. There are many important things should be done firstly.

        Show
        Schubert Zhang added a comment - I think the cassandra project manager should do some feature control. The feature such as this one should be postponed. There are many important things should be done firstly.
        Hide
        Jonathan Ellis added a comment -

        To the degree that there is project management in an Apache project, we are doing exactly that: this is very, very important to a lot of people (which you can get a sense of from the Watch list).

        Show
        Jonathan Ellis added a comment - To the degree that there is project management in an Apache project, we are doing exactly that: this is very, very important to a lot of people (which you can get a sense of from the Watch list).
        Hide
        Stu Hood added a comment -

        RE: 0002: How do you specify the index to match the IndexClause against? If we're requiring named indexes, the name should probably be a member of the IndexClause object.

        Additionally, RowPredicate should probably hold a List<IndexClause> or some other structure allowing people to AND or OR the clauses.

        Show
        Stu Hood added a comment - RE: 0002: How do you specify the index to match the IndexClause against? If we're requiring named indexes, the name should probably be a member of the IndexClause object. Additionally, RowPredicate should probably hold a List<IndexClause> or some other structure allowing people to AND or OR the clauses.
        Hide
        Jonathan Ellis added a comment -

        You're right, IndexClause needs column [name] field.

        Agreed that we will want List<IndexClause> but not sure if we want to tackle that complexity for this ticket's scope. Making it a List for forwards-compatibility, but only supporting a single element to start with, seems reasonable.

        Show
        Jonathan Ellis added a comment - You're right, IndexClause needs column [name] field. Agreed that we will want List<IndexClause> but not sure if we want to tackle that complexity for this ticket's scope. Making it a List for forwards-compatibility, but only supporting a single element to start with, seems reasonable.
        Hide
        Vijay added a comment -

        I was watching this ticket... and i think we can do one more thing to make the local-index more efficient (If the user gives hits)... For example: while writing data the user can say (hint, key) --> which will be used by the partitioner to keep the data closer within a range of nodes.... scaling will not be a problem, when user adds nodes it will go to the most populated node anyways.... While querying if the user gives us the hint like (Org, valueToSearch) then we know where to search for. Just an interm solution to scale queries?

        Show
        Vijay added a comment - I was watching this ticket... and i think we can do one more thing to make the local-index more efficient (If the user gives hits)... For example: while writing data the user can say (hint, key) --> which will be used by the partitioner to keep the data closer within a range of nodes.... scaling will not be a problem, when user adds nodes it will go to the most populated node anyways.... While querying if the user gives us the hint like (Org, valueToSearch) then we know where to search for. Just an interm solution to scale queries?
        Hide
        Jonathan Ellis added a comment -

        Created subtasks for this, with patches on the first two.

        Let's leave supporting operators other than EQ for another ticket.

        Show
        Jonathan Ellis added a comment - Created subtasks for this, with patches on the first two. Let's leave supporting operators other than EQ for another ticket.
        Hide
        Vijay added a comment -

        ProtoType which i did for work..... Version 1 of the changes:

        Need to find a clean way to have indexCF...
        Currently only support Ordered Partitioner... Think internals needs to be changed to have various partitions...
        Supports multiple queries.... AND and OR operations.

        TODO: respond to delete

        Show
        Vijay added a comment - ProtoType which i did for work..... Version 1 of the changes: Need to find a clean way to have indexCF... Currently only support Ordered Partitioner... Think internals needs to be changed to have various partitions... Supports multiple queries.... AND and OR operations. TODO: respond to delete
        Hide
        Ryan King added a comment -

        I'll be OOTO with limited email until June 14.

        Show
        Ryan King added a comment - I'll be OOTO with limited email until June 14.
        Hide
        Jonathan Ellis added a comment -

        vijay, did you look at the code I posted in the subtasks to this issue?

        Show
        Jonathan Ellis added a comment - vijay, did you look at the code I posted in the subtasks to this issue?
        Hide
        Vijay added a comment -

        Ooops My bad, I didnt.... Should i port the patch or just wait? anyways is fine... As long as there is AND and OR (OrderBy are supported) if you want me to work on any of those ticket thats also ok with me.... Thanks Jonathan

        Show
        Vijay added a comment - Ooops My bad, I didnt.... Should i port the patch or just wait? anyways is fine... As long as there is AND and OR (OrderBy are supported) if you want me to work on any of those ticket thats also ok with me.... Thanks Jonathan
        Hide
        Jonathan Ellis added a comment -

        done except for CASSANRA-1156 (pending review)

        Show
        Jonathan Ellis added a comment - done except for CASSANRA-1156 (pending review)

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Gary Dusbabek
          • Votes:
            2 Vote for this issue
            Watchers:
            28 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development