Currently, ColumnCountGetFilter and ColumnPaginationFilter suffer from this issue - they always undercount when there are multiple versions of a cell (even when max versions of a column family is set to 1 - I think this is because the versions exist until compaction happens). I looked at the ScanQueryMatcher/StoreScanner/ColumnTracker code and it seems that there is one other plausible approach towards resolving this. Currently, if a filter wants to skip over a KeyValue pair, it has 2 options - skip to next key value pair which could be the same column (SKIP) or skip to next column (SEEK_NEXT_COL). Though we are providing the filters a mechanism to really skip in these two ways when they exclude the value, we don't do that when they "include" the value. The INCLUDE always causes a seek to the next key value pair. I think that probably makes sense for the ColumnTracker since for column tracking we never want to seek across columns after doing an INCLUDE but for filters we probably want symmetry when trying to INCLUDE/EXCLUDE key value pairs. So, I was proposing something like:
1) Introduce INCLUDE_AND_SEEK_NEXT_COL to Filter.ReturnCode
2) Introduce INCLUDE_AND_SEEK_NEXT_COL to ScanQueryMatcher.MatchCode
3) Modify StoreScanner accordingly to seek to next column after the include and also link the above two types in the match() function
4) Finally modify ColumnPaginationFilter to return SEEK_NEXT_COL,INCLUDE_AND_SEEK_NEXT_COL instead of SKIP,INCLUDE_AND_SEEK_NEXT_COL respectively. Similarly for ColumnCountGetFilter
This might be a more direct way of resolving this issue and would avoid the column tracker sandwich between two layers of filters. What do you think, lars ?