Notes from chat:

When we repair [0, 1000], we agree on some level for the merkle tree, say 2, and we say the merkle tree leaves will be [0, 250], [250, 500], [500, 750], [750, 1000]

then each node calculate the hash for those leave base on their keys, and we compare.

We could make it a two step process, where everyone starts w/ the power of 2 tree, but then A can say "i have row 10 with a billion columns, let's subdivide [0, 250] into [0, (10, 500000000)] and [(10, 500000000), 250].

The drawback then is that you will do a first validation pass to agree on the subdivisions, then another to compute the actual hashes.

Or, we could first do a merkle tree as we do now, then for the ranges that differ, if we know they cover lots of columns (which can be computed easily initially), we could compute smaller hash ranges before streaming anything. You'd still read everything twice in the worst case, but if most rows are small then you don't need to read much the second time.

In the meantime, if you can shard your huge rows instead at the app level that will work better.

I don't think this will be necessary with

~~CASSANDRA-5351~~completed.