Fix Version/s: None
When extracting data for analysis (e.g., in Hadoop) using a token scan, allowing filtering on column that is part of the partition key allow for more efficient processing. For example, assume that we have the following schema (from the example defined here):
Assume that I am primarily interested in doing analysis of more recent data, so I can use a SELECT like the following to extract the data I am interested in:
The filtering is potentially expensive since it touches a lot of columns. Since the date column that is used to fragment wide rows is related to the event_time, I could apply a (redundant) filter to date, as in:
...but currently I can't add the filter on the date column because it is part of the partition key. However, because this query is doing a token scan, there really is no problem in filtering on the partition key. The predicate on date can be evaluated directly on the row index without looking at the values in columns at all. The effect is to efficiently filter out a large swath of rows, and not forcing the scan to filter on rows that couldn't possibly contain those dates.
There are probably lots of ways to optimize predicates on partition key columns. For example, if the date column was made the first column in the partition key, evaluating a range could be done without scanning the entire row index.
In this case, if we have a year of data, but are only interested in extracting the last day, so the overhead of filtering is reduced by a factor of 365.
What I am looking for is:
- If the SELECT is a token scan, allow filtering on partition key columns.
- Predicates on partition key columns are evaluated on for the row as a whole, before filtering on clustering columns.