I agree that us doing a full scan for that kind of query is confusing. In fact, that's a break of our otherwise applied rule: we don't allow queries that are not indexed. In that case we don't have an index and fallback to a full scan which we never do otherwise.
So the "logical" thing would just be to refuse that type of query (but to be clear, I think "SELECT * FROM videos" should always be allowed, because there is no surprise here, you've asked everything). We talked about allowing indexing the component of the clustering key (and though it's not done yet, I see no reason not to do it eventually), and once that is done we will be able to do those queries efficiently and it's only then that we should, again in theory, allow them.
Now in practice there is the fact that those queries more or less correspond to range_slice_queries and there is a good chance people would complain if we disallow them. I do note that it's not fully equivalent to the thrift case however, in the sense that in the thrift case you're literally asking for some sub-slice of all rows (or at least a range of rows), and in the result you will get all the rows, but with an empty set of columns if the provided filter selected nothing. In CQL3, you select only the rows where some predicate is true, so you won't get all those internal rows that have nothing for you.
force people to supply a LIMIT clause
I really don't think this is a LIMIT problem and thus I don't think forcing (or doing anything with) LIMIT is the solution. Namely, if you have billions of rows and none of them has videoname = 'My funny cat', then whatever the limit you provide (even 1) this query will timeout. Now I have some things to say about LIMIT and I've created
CASSANDRA-4918 for that, but this is a completely orthogonal problem imo.
So in terms of solutions, here are the ones I would suggest by order of preferences:
- we could add a new ALLOW FULL SCAN option to SELECT queries that would explicitly say "I allow the engine to do a full scan and thus I understand my query performance may suck immensely". We would then not allow queries like
SELECT * FROM videos WHERE videoname = 'My funny cat'
until we support 2ndary indexing videoname, but we would allow
SELECT * FROM videos WHERE videoname = 'My funny cat' ALLOW FULL SCAN
(alternative syntax could be 'ALLOW NON-INDEXED SCAN' or whatever). I think this would be in line with what we want for Cassandra: make the user explicitly conscious of the performance implications of its queries. We could even later extend the support of this 'ALLOW FULL SCAN' bits by bits to other type of queries we refuse today (though I'm certainly not implying this should be a priority).
- if others really don't like my previous idea, I do think that the logical next best thing is to refuse that type of queries pure and simple.
- as a last resort (though I don't really like it tbh), we could add some form a simple explain that would tell you whether a query is indexed or not (but I largely prefer the 'you have to explicitly say you're fine with non-indexed' solution).