Thank you for the review.
updateClusteringValues() doesn't produce what you need, as it is not just TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What you need is an actual min/maxClusteringPrefix which is different from the min/max of each component separately.
I agree that we need a full clustering prefix. What I don't understand is how things like shouldInclude() in ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work. What is an example of any other incomplete prefix and do we have a gap in the tests then?
I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data.
There are 4 components that we write in the same file: VALIDATION, STATS, COMPACTION and HEADER so we can't simply keep on reading till the end of the file. However we write a TOC with the position of each component. So it should be possible but it would require changes to MetadataSerializer.deserialize() and the signature of IMetadataComponentSerializer.deserialize(), which should receive the total size to work out if there is more stuff to read at the end. I guess we can go for it.
I personally would prefer to not modify the behaviour of MergeIterator and keep it doing one simple thing, but this approach does have its charm.
The changes to MergeIterator are mostly in the candidate and really minimal, the actual algorithm is unchanged. However if you do have a less invasive approach in mind I'm eager to hear it.
An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code).
Thanks, I'll add this test.
Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions.
I'm not sure what you mean, do this for the test or the fix? I'm sure I'll work it out when I write the test though.
IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound to be explicit about its purpose.
The choice to set rowIndexLowerBound in partitionLevelDeletion() appears very arbitrary and fragile. What is the reason to do it separately from globalLowerBound? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time?
The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file. Calling super.partitionLevelDeletion() also involves initializing the iterator and accessing the data file (AbstractSSTableIterator constructor). So we decided to use this more accurate bound only when we really have to access the index anyway, that is when partitionLevelDeletion() is called and there are tombstones. See this comment above.
I hope to resume this in the next few days.