(1) having validation only require that the components, if they're provided, are of the correct type (looking at Sylvain's code, the compare function should work, but the validation wouldn't)
That's a bug. The last test of validate() should be negated. The only goal was forbid having leading bytes after all the comparator have been used (because compare makes the assumption that there will not have bytes without a proper comparator for them).
However it is clear that we should allow a key having less component than comparator (btw, the unit test in my patch was actually doing it).
This is fixed in the patch I'm attaching.
My other concern is that one of the things that got stripped out was the MATCH_MINIMUM, MATCH_MAXIMUM feature, which made implementing greater-than-equals, less-than-equals, etc. in ranges much easier
I'm not familiar enough with your code to be sure what those were doing exactly, but I suppose that the goal is that, if for instance you have the following column names:
then you want to be able to query anything that starts with the two component test and foo (that is test:foo:24 and test:foo:42 in my example).
Greater-than is already doable in my previous patch (up to the bug in validation). For the less-than part, I agree that it is nice to be able to do it easily. In my new patch, I add a leading byte to each component, whose purpose is to always be 0, except for lesser-than query. That way, you can do the query above easily. The price is a slightly more complicated encoding but I think it's totally worth it.
While i agree in theory, this is why thrift exists right?
I don't know thrift as well as you do, but I would think that changing the type of column name would be a massive compatibility breaker. And I really don't see a not crappy way to do it. Well, actually I do, but it consists in revamping the API completely, ditching the super columns in favor of composite column name. Don't get me wrong, I'm 100% in, but it's well beyond this particular ticket and I think that as long as we have a generic and simple enough encoding for those composite type, I believe it's ok to include this as is and just have a decent description of the encoding for client to use. We can still branch that later on on thrift anyway.
(2) creating a DynamicType that could be used for dynamic types
I do agree with you that it could be useful and I understand the use case. I also agree that whatever we do, we should make it clear that it is not beginner stuff and that you shouldn't use a dynamic composite type unless you really need it (since the possibility of shooting yourself in the foot if you don't know what you do is fairly high). All this being said, I took a stab at a DynamicCompositeType and I think the result really makes sense.
The main idea is really just to modify my static CompositeType so that each component starts with the name of the comparator class (that way, this is again completely generic, custom comparator can be used without problems). There is then 2 optimisations to make it a reasonable solution:
- FBUtilities.getComparator() uses a cache, to avoid the cost of introspection.
- There is a notion of alias. They are user defined and are local to each DynamicCompositeType (cf. test/conf/cassandra.yaml in the patch). An alias is a simple (printable) ascii character (there is around 100 possible alias which should be aplenty). You can then use those alias in the encoded column name, in which case you only need 2 bytes to encode the comparator, making it fairly compact.
The encoded format is very close to the one of the static CompositeType, even though it is a tiny bit more complex due to the aliases handling. But given that dynamic composite type is not meant for the faint of heart, that should be fine.
I'm attaching a new patch (against 0.7 branch) with both the static and dynamic composite type. The bulk of the code is shared between them. This also include a fair amount of unit tests.