Cassandra
  1. Cassandra
  2. CASSANDRA-2231

Add CompositeType comparer to the comparers provided in org.apache.cassandra.db.marshal

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 0.8.1
    • Component/s: Examples
    • Labels:
      None

      Description

      CompositeType is a custom comparer that makes it possible to create comparable composite values out of the basic types that Cassandra currently supports, such as Long, UUID, etc. This is very useful in both the creation of custom inverted indexes using columns in a skinny row, where each column name is a composite value, and also when using Cassandra's built-in secondary index support, where it can be used to encode the values in the columns that Cassandra indexes. One scenario for the usage of these is documented here: http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html. Source for contribution is attached and has been previously maintained on github here: https://github.com/edanuff/CassandraCompositeType

        Issue Links

          Activity

          Hide
          Luís Ferreira added a comment -

          I've compiled it with just this patch now and I get:

          Unable to find abstract-type class 'org.apache.cassandra.db.marshal.CompositeType(BytesType,BytesType)'
          

          I check the jar and it is there, I'm using the binaries that I got from doing ant release, am I doing this right?

          Show
          Luís Ferreira added a comment - I've compiled it with just this patch now and I get: Unable to find abstract -type class 'org.apache.cassandra.db.marshal.CompositeType(BytesType,BytesType)' I check the jar and it is there, I'm using the binaries that I got from doing ant release, am I doing this right?
          Hide
          Jonathan Ellis added a comment -

          This is a sign that you should use an official release supporting composites instead of trying to hack it into 0.7.9... Composite types have evolved over many tickets and patchsets, this was just the first.

          Show
          Jonathan Ellis added a comment - This is a sign that you should use an official release supporting composites instead of trying to hack it into 0.7.9... Composite types have evolved over many tickets and patchsets, this was just the first.
          Hide
          Luís Ferreira added a comment -

          Thanks, but this tutorial uses a Composite class that I can't find in the patch. That was the thing I couldn't understand.

          Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);
          Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL);
          start.addComponent(1,"CA",Composite.ComponentEquality.EQUAL);
          end.addComponent(1,"CA",Composite.ComponentEquality.GREATER_THAN_EQUAL);
          
          Show
          Luís Ferreira added a comment - Thanks, but this tutorial uses a Composite class that I can't find in the patch. That was the thing I couldn't understand. Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL); Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL); start.addComponent(1, "CA" ,Composite.ComponentEquality.EQUAL); end.addComponent(1, "CA" ,Composite.ComponentEquality.GREATER_THAN_EQUAL);
          Show
          Sylvain Lebresne added a comment - http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 might be a good start
          Hide
          Luís Ferreira added a comment -

          Thanks, I'll give it try without including that class then.

          But how do I create a composite key using just this patch then?

          Sorry for asking all these questions here, but I couldn't find any tutorial or info on this.

          Show
          Luís Ferreira added a comment - Thanks, I'll give it try without including that class then. But how do I create a composite key using just this patch then? Sorry for asking all these questions here, but I couldn't find any tutorial or info on this.
          Hide
          Sylvain Lebresne added a comment -

          I don't think that has anything to do with the problem though, does it?

          Well, actually that might, the class you linked is not related to this patch (and is not compatible with it as far as I can tell).

          Show
          Sylvain Lebresne added a comment - I don't think that has anything to do with the problem though, does it? Well, actually that might, the class you linked is not related to this patch (and is not compatible with it as far as I can tell).
          Hide
          Luís Ferreira added a comment -

          Thanks for the quick reply.

          I'm using a sun jvm and have used it a lot with cassandra without any problem. Actually, if I create CF that use the BytesType comparator only, it works. It just stops working when creating this particular CF.

          I've using the binaries in the build directory after doing an ant release. Not sure if it is the correct way to compile everything.

          I've also added https://github.com/edanuff/CassandraCompositeType/blob/master/src/main/java/comparators/Composite.java to org.apache.cassandra.db since I couldn't understand how to create an actual row key just with this patch. I don't think that has anything to do with the problem though, does it?

          Show
          Luís Ferreira added a comment - Thanks for the quick reply. I'm using a sun jvm and have used it a lot with cassandra without any problem. Actually, if I create CF that use the BytesType comparator only, it works. It just stops working when creating this particular CF. I've using the binaries in the build directory after doing an ant release. Not sure if it is the correct way to compile everything. I've also added https://github.com/edanuff/CassandraCompositeType/blob/master/src/main/java/comparators/Composite.java to org.apache.cassandra.db since I couldn't understand how to create an actual row key just with this patch. I don't think that has anything to do with the problem though, does it?
          Hide
          Sylvain Lebresne added a comment -

          I've patched a 0.7.9 version of cassandra with this patch plus the one on CASSANDRA-2355, and got no errors, but when running it I get:

          Invalid access of stack red zone 0x100401fe8 rip=0x1010c8f5e
          

          That really sound like a JVM bug, so I very much doubt the patch here has anything to do with that (since they don't use anything unsafe in particular). You could check if the JVM you are using is up to date, and preferably use a sun JVM (because that's what more tested).

          Show
          Sylvain Lebresne added a comment - I've patched a 0.7.9 version of cassandra with this patch plus the one on CASSANDRA-2355 , and got no errors, but when running it I get: Invalid access of stack red zone 0x100401fe8 rip=0x1010c8f5e That really sound like a JVM bug, so I very much doubt the patch here has anything to do with that (since they don't use anything unsafe in particular). You could check if the JVM you are using is up to date, and preferably use a sun JVM (because that's what more tested).
          Hide
          Luís Ferreira added a comment -

          Hi,

          I've patched a 0.7.9 version of cassandra with this patch plus the one on CASSANDRA-2355, and got no errors, but when running it I get:

          Invalid access of stack red zone 0x100401fe8 rip=0x1010c8f5e
          

          I create the column family as so:

          CfDef base_columns_cf_definition = new CfDef();
          base_columns_cf_definition.setName("BaseColumns_CF");
          base_columns_cf_definition.setColumn_type("Standard");
          base_columns_cf_definition.setComparator_type("CompositeType(BytesType,BytesType)");
          base_columns_cf_definition.setKeyspace("Table"+this.getContainerid())
          

          I would really like to have this feature, but I cannot upgrade to version 0.8.1. What am I doing wrong?

          Show
          Luís Ferreira added a comment - Hi, I've patched a 0.7.9 version of cassandra with this patch plus the one on CASSANDRA-2355 , and got no errors, but when running it I get: Invalid access of stack red zone 0x100401fe8 rip=0x1010c8f5e I create the column family as so: CfDef base_columns_cf_definition = new CfDef(); base_columns_cf_definition.setName( "BaseColumns_CF" ); base_columns_cf_definition.setColumn_type( "Standard" ); base_columns_cf_definition.setComparator_type( "CompositeType(BytesType,BytesType)" ); base_columns_cf_definition.setKeyspace( "Table" + this .getContainerid()) I would really like to have this feature, but I cannot upgrade to version 0.8.1. What am I doing wrong?
          Hide
          Sylvain Lebresne added a comment -

          The comment still applies to DynamicCompositeType, but what the comment doesn't says is that if you use a 0x01 as the end-of-component, it expects you have no remaining component. The error message tells that apparently there is some bytes remaining after that 0x01. You can look the discussion above on that ticket for why that doesn't make sense to have anything after a 0x01.

          Show
          Sylvain Lebresne added a comment - The comment still applies to DynamicCompositeType, but what the comment doesn't says is that if you use a 0x01 as the end-of-component, it expects you have no remaining component. The error message tells that apparently there is some bytes remaining after that 0x01. You can look the discussion above on that ticket for why that doesn't make sense to have anything after a 0x01.
          Hide
          donal zang added a comment -

          In CompositeType.java I find this:
          37 * 'end-of-component' byte should always be 0 for actual column name.
          38 * However, it can set to 1 for query bounds. This allows to query for the
          39 * equivalent of 'give me the full super-column'. That is, if during a slice
          40 * query uses:
          41 * start = <3><"foo".getBytes()><0>
          42 * end = <3><"foo".getBytes()><1>
          43 * then he will be sure to get all the columns whose first component is "foo".

          Is this also apply to DynamicCompositeType ? When I use it in a query end with '0x01', there's an Exception "Invalid bytes remaining after an end-of-component at component0"

          Show
          donal zang added a comment - In CompositeType.java I find this: 37 * 'end-of-component' byte should always be 0 for actual column name. 38 * However, it can set to 1 for query bounds. This allows to query for the 39 * equivalent of 'give me the full super-column'. That is, if during a slice 40 * query uses: 41 * start = <3><"foo".getBytes()><0> 42 * end = <3><"foo".getBytes()><1> 43 * then he will be sure to get all the columns whose first component is "foo". Is this also apply to DynamicCompositeType ? When I use it in a query end with '0x01', there's an Exception "Invalid bytes remaining after an end-of-component at component0"
          Hide
          donal zang added a comment -

          Thanks sylvain !

          Show
          donal zang added a comment - Thanks sylvain !
          Hide
          Sylvain Lebresne added a comment -

          You'd have to apply the patch on CASSANDRA-2355 first.

          Show
          Sylvain Lebresne added a comment - You'd have to apply the patch on CASSANDRA-2355 first.
          Hide
          donal zang added a comment - - edited

          Hi,
          forgive me if the question is naive.
          I'm using the release 0.7(http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.7.6-2/) and when I apply the patch , it can't find the file TypeParser.java :

          $patch -p1 < 0001-Add-compositeType-and-DynamicCompositeType_0.7.patch
          patching file src/java/org/apache/cassandra/db/marshal/AbstractCompositeType.java
          patching file src/java/org/apache/cassandra/db/marshal/CompositeType.java
          patching file src/java/org/apache/cassandra/db/marshal/DynamicCompositeType.java
          can't find file to patch at input line 694
          Perhaps you used the wrong -p or --strip option?
          The text leading up to this was:
          --------------------------

          diff --git a/src/java/org/apache/cassandra/db/marshal/TypeParser.java b/src/java/org/apache/cassandra/db/marshal/TypeParser.java
          index a3fabc5..0b21042 100644
          — a/src/java/org/apache/cassandra/db/marshal/TypeParser.java
          +++ b/src/java/org/apache/cassandra/db/marshal/TypeParser.java
          --------------------------
          File to patch:

          So, am I missing something?

          Show
          donal zang added a comment - - edited Hi, forgive me if the question is naive. I'm using the release 0.7( http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.7.6-2/ ) and when I apply the patch , it can't find the file TypeParser.java : $patch -p1 < 0001-Add-compositeType-and-DynamicCompositeType_0.7.patch patching file src/java/org/apache/cassandra/db/marshal/AbstractCompositeType.java patching file src/java/org/apache/cassandra/db/marshal/CompositeType.java patching file src/java/org/apache/cassandra/db/marshal/DynamicCompositeType.java can't find file to patch at input line 694 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- diff --git a/src/java/org/apache/cassandra/db/marshal/TypeParser.java b/src/java/org/apache/cassandra/db/marshal/TypeParser.java index a3fabc5..0b21042 100644 — a/src/java/org/apache/cassandra/db/marshal/TypeParser.java +++ b/src/java/org/apache/cassandra/db/marshal/TypeParser.java -------------------------- File to patch: So, am I missing something?
          Hide
          Sylvain Lebresne added a comment -

          Interpreted the "looks good to me" of Ed as +1, committed for 0.8.1.

          Show
          Sylvain Lebresne added a comment - Interpreted the "looks good to me" of Ed as +1, committed for 0.8.1.
          Hide
          Sylvain Lebresne added a comment -

          Actually attached update version that make the JDBC metadata functions throw UnsupportedOperationException. We probably can do better than binary so let's not commit to anything specific just yet.

          Show
          Sylvain Lebresne added a comment - Actually attached update version that make the JDBC metadata functions throw UnsupportedOperationException. We probably can do better than binary so let's not commit to anything specific just yet.
          Hide
          Sylvain Lebresne added a comment -

          v4 is rebased against current trunk and add the decompose function (which simply returns it's argument as Ed suggested).

          Not that as far as jdbc is concerned, composite type manipulate binary data for now. I'm sure we can do better but I'll gladly leave that to further improvements.

          Show
          Sylvain Lebresne added a comment - v4 is rebased against current trunk and add the decompose function (which simply returns it's argument as Ed suggested). Not that as far as jdbc is concerned, composite type manipulate binary data for now. I'm sure we can do better but I'll gladly leave that to further improvements.
          Hide
          Ed Anuff added a comment -

          It probably should just be a passthrough like it is for the BytesType:

          @Override
          public ByteBuffer decompose(ByteBuffer value)

          { return value; }

          Assuming that's put in into AbstractCompositeType, it looks good to me. The unit tests are the same we've been using in the version of this we've been maintaining at https://github.com/riptano/hector-composite which some folks are using in production.

          Show
          Ed Anuff added a comment - It probably should just be a passthrough like it is for the BytesType: @Override public ByteBuffer decompose(ByteBuffer value) { return value; } Assuming that's put in into AbstractCompositeType, it looks good to me. The unit tests are the same we've been using in the version of this we've been maintaining at https://github.com/riptano/hector-composite which some folks are using in production.
          Hide
          Sylvain Lebresne added a comment -

          Yes, decompose was added back to 0.8 after I rebased those patch and I didn't rebase them back. So you're right that those should be added.

          Show
          Sylvain Lebresne added a comment - Yes, decompose was added back to 0.8 after I rebased those patch and I didn't rebase them back. So you're right that those should be added.
          Hide
          Ed Anuff added a comment -

          Am I correct in that the v3 rebased to .8 doesn't include the decompose() methods? I had to add those to get the v3 patch to build. Once I did that, it passed the unit tests.

          Show
          Ed Anuff added a comment - Am I correct in that the v3 rebased to .8 doesn't include the decompose() methods? I had to add those to get the v3 patch to build. Once I did that, it passed the unit tests.
          Hide
          Sylvain Lebresne added a comment -

          I'm attaching a rebase of this patch to 0.7. I don't think this will go into a 0.7 release though, because 0.7 should really be about small bug fixes at this point, but that way it will be easy for those who wants it enough to apply this patch.

          Show
          Sylvain Lebresne added a comment - I'm attaching a rebase of this patch to 0.7. I don't think this will go into a 0.7 release though, because 0.7 should really be about small bug fixes at this point, but that way it will be easy for those who wants it enough to apply this patch.
          Hide
          Ed Anuff added a comment -

          Yes, will do it today.

          Show
          Ed Anuff added a comment - Yes, will do it today.
          Hide
          Jonathan Ellis added a comment -

          Ed, can you review v3?

          Show
          Jonathan Ellis added a comment - Ed, can you review v3?
          Hide
          Sylvain Lebresne added a comment -

          Attaching v3 rebased to 0.8. It applies on top of CASSANDRA-2355.

          Show
          Sylvain Lebresne added a comment - Attaching v3 rebased to 0.8. It applies on top of CASSANDRA-2355 .
          Hide
          Jeremy Hanna added a comment -

          I'd like to see this in 0.7 if possible for things we're working on.

          Show
          Jeremy Hanna added a comment - I'd like to see this in 0.7 if possible for things we're working on.
          Hide
          Sylvain Lebresne added a comment -

          Rebased and updated patch attached. The new patch use -1, 0, 1 for the end-of-component. It also fixes some issue I add fixed in CASSANDRA-2355 but really belong here, namely handling the parameters of the composite types when deflating CFMetadata into Avro, etc...

          This still a patch against 0.7. I think this is really not very disruptive and could go into 0.7, but I can rebase to 0.8 otherwise. The only problem I can think of is if someone already have a custom comparator named CompositeType or DynamicCompositeType, but that's a problem we face each time we add a new comparator.

          Show
          Sylvain Lebresne added a comment - Rebased and updated patch attached. The new patch use -1, 0, 1 for the end-of-component. It also fixes some issue I add fixed in CASSANDRA-2355 but really belong here, namely handling the parameters of the composite types when deflating CFMetadata into Avro, etc... This still a patch against 0.7. I think this is really not very disruptive and could go into 0.7, but I can rebase to 0.8 otherwise. The only problem I can think of is if someone already have a custom comparator named CompositeType or DynamicCompositeType, but that's a problem we face each time we add a new comparator.
          Hide
          Ed Anuff added a comment -

          If you can add the -1, 0, 1 e-o-c suggestion, then I'm good with this. We've been using a preview version of your patch that we put up at https://github.com/riptano/hector-composite to make sure it works. So far it seems to meet every use case, with the exception of the previous questions from April 4 which you correctly pointed out couldn't be solved the way we were suggesting. So, I'm all for getting this out as soon as possible.

          Show
          Ed Anuff added a comment - If you can add the -1, 0, 1 e-o-c suggestion, then I'm good with this. We've been using a preview version of your patch that we put up at https://github.com/riptano/hector-composite to make sure it works. So far it seems to meet every use case, with the exception of the previous questions from April 4 which you correctly pointed out couldn't be solved the way we were suggesting. So, I'm all for getting this out as soon as possible.
          Hide
          Sylvain Lebresne added a comment -

          So I don't know where you guys are with your JPA implementation, but I for one would be keen on getting this out because I do think this is broadly useful. However it is worth making sure this do handle your needs (once this is out, it will be harder to make changes). So how far is this from those needs ? I agree with the change of using -1, 0, 1 for the e-o-c (as this allows 'strictly greater-than' kind of queries; I don't think the attached patch have that already but I'll add the change next time I rebase this, just want to know if there anything else needed first) and you didn't follow up yet on my last comment. Is there something else ?

          Show
          Sylvain Lebresne added a comment - So I don't know where you guys are with your JPA implementation, but I for one would be keen on getting this out because I do think this is broadly useful. However it is worth making sure this do handle your needs (once this is out, it will be harder to make changes). So how far is this from those needs ? I agree with the change of using -1, 0, 1 for the e-o-c (as this allows 'strictly greater-than' kind of queries; I don't think the attached patch have that already but I'll add the change next time I rebase this, just want to know if there anything else needed first) and you didn't follow up yet on my last comment. Is there something else ?
          Hide
          Sylvain Lebresne added a comment - - edited

          The thing is this, if we agree that for actual values, then end-of-component (eoc) is always 0, I don't see what could be the use for the start or end of a query to have anything after a eoc != 0, since for any comparison of the start (resp. end) with an actual value, the comparaison will return as soon as we compare that eoc.

          Taking your example of start => (100+1): (10+0), when compared to any value (x:0):(y:0), the comparison will either return when comparing x to 100, and if x==100, will stop when comparing the eoc (returning then that start > value, so that query would typically return all value whose first component is strictly greater than 100).

          Show
          Sylvain Lebresne added a comment - - edited The thing is this, if we agree that for actual values, then end-of-component (eoc) is always 0, I don't see what could be the use for the start or end of a query to have anything after a eoc != 0, since for any comparison of the start (resp. end) with an actual value, the comparaison will return as soon as we compare that eoc. Taking your example of start => (100+1): (10+0), when compared to any value (x:0):(y:0), the comparison will either return when comparing x to 100, and if x==100, will stop when comparing the eoc (returning then that start > value, so that query would typically return all value whose first component is strictly greater than 100).
          Hide
          Ed Anuff added a comment -

          One more thing that's come up in testing, we're finding that only the last end-of-component byte is allowed to be non-zero. In lines 231-233 of your path, you do this check in AbstractCompositeType.validate():

          + byte b = bb.get();
          + if (b != 0 && bb.remaining() != 0)
          + throw new MarshalException("Invalid bytes remaining after an end-of-component at component" + i);
          +

          Is this check necessary or would it be ok if the end-of-component byte could be non-zero in any component? We're usually doing the less-than-equals or greater-than-equals comparisons on the first component value, not the last, which is usually just a discriminator value to prevent duplicate column names.

          Show
          Ed Anuff added a comment - One more thing that's come up in testing, we're finding that only the last end-of-component byte is allowed to be non-zero. In lines 231-233 of your path, you do this check in AbstractCompositeType.validate(): + byte b = bb.get(); + if (b != 0 && bb.remaining() != 0) + throw new MarshalException("Invalid bytes remaining after an end-of-component at component" + i); + Is this check necessary or would it be ok if the end-of-component byte could be non-zero in any component? We're usually doing the less-than-equals or greater-than-equals comparisons on the first component value, not the last, which is usually just a discriminator value to prevent duplicate column names.
          Hide
          Todd Nine added a comment -

          Hi Sylvain,
          I seem to have encountered a bug in the comparator. I'm using the composite to perform Cassandra based intersections of fields during queries. For instance, say a user defines an index as this.

          status + unitId.

          The write would always contain values of <status>0<unitId>+0 when using the -1 0 and 1 fields. If the user enters a query such as this

          status > 100 && status < 300 && unitId = 10

          I would need to construct a column scan of the following to get correct result sets.

          start => 100+1+10+0

          end => 300+0+10+1

          However on in the validate function I'm receiving the error "Invalid bytes remaining after an end-of-component at component". This seems incorrect to me. We're ultimately attempting to apply any equality operand and transform it to a range scan for the given field in the composite. This means that -1, or 1 could appear after any component in the composite, not just the last one. Can you please add this functionality/remove this verification check?

          Thanks,
          Todd

          Show
          Todd Nine added a comment - Hi Sylvain, I seem to have encountered a bug in the comparator. I'm using the composite to perform Cassandra based intersections of fields during queries. For instance, say a user defines an index as this. status + unitId. The write would always contain values of <status> 0 <unitId>+0 when using the -1 0 and 1 fields. If the user enters a query such as this status > 100 && status < 300 && unitId = 10 I would need to construct a column scan of the following to get correct result sets. start => 100+1+10+0 end => 300+0+10+1 However on in the validate function I'm receiving the error "Invalid bytes remaining after an end-of-component at component". This seems incorrect to me. We're ultimately attempting to apply any equality operand and transform it to a range scan for the given field in the composite. This means that -1, or 1 could appear after any component in the composite, not just the last one. Can you please add this functionality/remove this verification check? Thanks, Todd
          Hide
          Sylvain Lebresne added a comment -

          It's a good idea. I'll update the patch.

          Show
          Sylvain Lebresne added a comment - It's a good idea. I'll update the patch.
          Hide
          Ed Anuff added a comment -

          Sylvain, in the JPA implementation, we're seeing that we'd like to have a little more flexibility with the trailing end-of-component, specifically, that it be able to have values of -1,0,1 rather than just 0,1. The comparison logic would look like this:

          byte b1 = bb1.get();
          byte b2 = bb2.get();
          if (b1 < 0) {
          	if (b2 >= 0) {
          		return -1;
          	}
          }
          
          if (b1 > 0) {
          	if (b2 <= 0) {
          		return 1;
          	}
          }
          
          if ((b1 == 0) && (b2 != 0)) {
          	return - b2;
          }
          
          Show
          Ed Anuff added a comment - Sylvain, in the JPA implementation, we're seeing that we'd like to have a little more flexibility with the trailing end-of-component, specifically, that it be able to have values of -1,0,1 rather than just 0,1. The comparison logic would look like this: byte b1 = bb1.get(); byte b2 = bb2.get(); if (b1 < 0) { if (b2 >= 0) { return -1; } } if (b1 > 0) { if (b2 <= 0) { return 1; } } if ((b1 == 0) && (b2 != 0)) { return - b2; }
          Hide
          Sylvain Lebresne added a comment -

          Patch updated to now depend on CASSANDRA-2355. The parsing of comparator arguments of the previous patches were too naive, and in particular didn't worked with schema migrations.

          Show
          Sylvain Lebresne added a comment - Patch updated to now depend on CASSANDRA-2355 . The parsing of comparator arguments of the previous patches were too naive, and in particular didn't worked with schema migrations.
          Hide
          Sylvain Lebresne added a comment -

          For parameterized behavior of comparators, my assumption is that this would work within the DynamicCompositeType as well?

          It would.

          but ultimately we'd need a more robust comparator factory

          That's the idea. I intent to add this as part of CASSANDRA-2355.

          Show
          Sylvain Lebresne added a comment - For parameterized behavior of comparators, my assumption is that this would work within the DynamicCompositeType as well? It would. but ultimately we'd need a more robust comparator factory That's the idea. I intent to add this as part of CASSANDRA-2355 .
          Hide
          Ed Anuff added a comment -

          For parameterized behavior of comparators, my assumption is that this would work within the DynamicCompositeType as well? I'll add this to Cassandra-2235, but I'm thinking about the embedded comparator names in the dynamic format. Right now, you're simply calling FBUtilities.getComparator() with the name, but ultimately we'd need a more robust comparator factory that could be passed something like "UUIDType(restrictTo=time,sort=desc)" and parse out the parameters in order to construct the instance and was able to cache the parameterized version in a similar way to how your patch currently caches the comparators it instantiates, and would probably need to be able to know that "UUIDType(restrictTo=time,sort=desc)" and "UUIDType(sort=desc,restrictTo=time)" are the same comparator.

          Show
          Ed Anuff added a comment - For parameterized behavior of comparators, my assumption is that this would work within the DynamicCompositeType as well? I'll add this to Cassandra-2235, but I'm thinking about the embedded comparator names in the dynamic format. Right now, you're simply calling FBUtilities.getComparator() with the name, but ultimately we'd need a more robust comparator factory that could be passed something like "UUIDType(restrictTo=time,sort=desc)" and parse out the parameters in order to construct the instance and was able to cache the parameterized version in a similar way to how your patch currently caches the comparators it instantiates, and would probably need to be able to know that "UUIDType(restrictTo=time,sort=desc)" and "UUIDType(sort=desc,restrictTo=time)" are the same comparator.
          Hide
          Sylvain Lebresne added a comment -

          Patch rebased (against 0.7)

          Show
          Sylvain Lebresne added a comment - Patch rebased (against 0.7)
          Hide
          Sylvain Lebresne added a comment -

          I've removed myself as reviewer as I don't think it make sense for me to review my code. Imho, this can go into 0.7, given how little interaction this has with existing code.

          Show
          Sylvain Lebresne added a comment - I've removed myself as reviewer as I don't think it make sense for me to review my code. Imho, this can go into 0.7, given how little interaction this has with existing code.
          Hide
          Sylvain Lebresne added a comment -

          That would be better solved by having reversed comparator for all comparators, and I would much prefer a general way to define a reversed comparator. That is, for instance, you could declare comparator=LongType(order=desc). Having a way That would be better solved by having reversed comparator for all comparators, and I would much prefer having a general way to define a reversed comparator. That is, for instance, you could declare comparator=LongType(order=desc).

          That way you don't have to mess with the composite types encoded format (and I'm really keen on keeping it as simple as possible). Plus have a reversed comparator for all comparator is generally useful even outside composite types (granted it's a bit less useful for simple types since you can always query in reverse, but reverse queries are probably a bit slower due to reverse seeks).

          Even more generally, I think being able to parametrize the behavior of comparators is genuinely useful. For instance, I could see having only one comparator for UUIDs (in the spirit of CASSANDRA-2233) and being able to say stuffs like

            comparator=UUIDType                     // accepts both time based and lexical
            comparator=UUIDType(restrictTo=time)    // time based only
            comparator=UUIDType(restrictTo=lexical) // lexical only
          

          or something along those lines.

          I've opened CASSANDRA-2355 for doing the reverse part of this.

          Show
          Sylvain Lebresne added a comment - That would be better solved by having reversed comparator for all comparators, and I would much prefer a general way to define a reversed comparator. That is, for instance, you could declare comparator=LongType(order=desc). Having a way That would be better solved by having reversed comparator for all comparators, and I would much prefer having a general way to define a reversed comparator. That is, for instance, you could declare comparator=LongType(order=desc) . That way you don't have to mess with the composite types encoded format (and I'm really keen on keeping it as simple as possible). Plus have a reversed comparator for all comparator is generally useful even outside composite types (granted it's a bit less useful for simple types since you can always query in reverse, but reverse queries are probably a bit slower due to reverse seeks). Even more generally, I think being able to parametrize the behavior of comparators is genuinely useful. For instance, I could see having only one comparator for UUIDs (in the spirit of CASSANDRA-2233 ) and being able to say stuffs like comparator=UUIDType // accepts both time based and lexical comparator=UUIDType(restrictTo=time) // time based only comparator=UUIDType(restrictTo=lexical) // lexical only or something along those lines. I've opened CASSANDRA-2355 for doing the reverse part of this.
          Hide
          Ed Anuff added a comment - - edited

          From the email conversation around this earlier today, I'm wondering if a bit in the trailing byte at the end of each component could be used for a sort order flag?

          I'd like to suggest we put a byte just before the length/value part of the component that if it's non-zero, reverses the comparer results for that component. Both component parts must have the same sort order byte (i.e. both are 0 or both are 1) or a RuntimeException is thrown.

          For context, we're looking at doing something in the JPA implementation via annotations that's functionally similar to how App Engine defines indexes in it's index.yaml - http://code.google.com/appengine/docs/java/configyaml/indexconfig.html

          Show
          Ed Anuff added a comment - - edited From the email conversation around this earlier today, I'm wondering if a bit in the trailing byte at the end of each component could be used for a sort order flag? I'd like to suggest we put a byte just before the length/value part of the component that if it's non-zero, reverses the comparer results for that component. Both component parts must have the same sort order byte (i.e. both are 0 or both are 1) or a RuntimeException is thrown. For context, we're looking at doing something in the JPA implementation via annotations that's functionally similar to how App Engine defines indexes in it's index.yaml - http://code.google.com/appengine/docs/java/configyaml/indexconfig.html
          Hide
          Todd Nine added a comment -

          Since this is primarily needed for ordering well defined queries and collections, can we add a bitset to the composite type to represent sort ordering? A lot of queries need the semantics of "order by logindate desc, firstname asc, lastname asc". This would give is the ability to set the descending flag on any one of the composite types allowing us to always have a correctly order result set stored in the column name. In this case I would simply set the bit 0 to signal to the composite it needs to order the first field descending instead of the standard ascending.

          Show
          Todd Nine added a comment - Since this is primarily needed for ordering well defined queries and collections, can we add a bitset to the composite type to represent sort ordering? A lot of queries need the semantics of "order by logindate desc, firstname asc, lastname asc". This would give is the ability to set the descending flag on any one of the composite types allowing us to always have a correctly order result set stored in the column name. In this case I would simply set the bit 0 to signal to the composite it needs to order the first field descending instead of the standard ascending.
          Hide
          Stu Hood added a comment -

          > Yes, IntegerType would be a better choice because it uses BigInteger that packs
          > stuffs as much as it can (aka, 1 byte if it fits).
          Speaking of which: should we use variable length integers for the composite field lengths here? 99% of the time the length will fit in a single byte.

          Show
          Stu Hood added a comment - > Yes, IntegerType would be a better choice because it uses BigInteger that packs > stuffs as much as it can (aka, 1 byte if it fits). Speaking of which: should we use variable length integers for the composite field lengths here? 99% of the time the length will fit in a single byte.
          Hide
          Sylvain Lebresne added a comment -

          Yes, IntegerType would be a better choice because it uses BigInteger that packs stuffs as much as it can (aka, 1 byte if it fits). I know that because it bitted me while writing the unit test for this )

          Show
          Sylvain Lebresne added a comment - Yes, IntegerType would be a better choice because it uses BigInteger that packs stuffs as much as it can (aka, 1 byte if it fits). I know that because it bitted me while writing the unit test for this )
          Hide
          Ed Anuff added a comment -

          Hmm, that would work, although you certainly wouldn't want to use the LongType as your integer. I guess the minimum overhead for a component is 6 bytes - 2 header, 2 length, 1 value, 1 inclusion flag.

          I'm not seeing anything else that wouldn't let me use this as a functional replacement for the original CompositeType, so I'm +1 on it. Thanks!

          Show
          Ed Anuff added a comment - Hmm, that would work, although you certainly wouldn't want to use the LongType as your integer. I guess the minimum overhead for a component is 6 bytes - 2 header, 2 length, 1 value, 1 inclusion flag. I'm not seeing anything else that wouldn't let me use this as a functional replacement for the original CompositeType, so I'm +1 on it. Thanks!
          Hide
          Sylvain Lebresne added a comment -

          We could do that. But I have to admit that I'm usually not very fond of making things work when it's not clear they should. I think that could hide problems more than it helps. Better fail fast if there is a mess up.

          And if you want to store in the same row column names that have intrinsically different structures, you can prefix it manually to distinguish thinks. That is, in you example, you'll replace <integer,uuid,timestamp> (type1) by <integer,integer,uuid,timestamp> and <string,uuid,timestamp> (type2) by <integer,string,uuid,timestamp> and assign the same integer for all the column name of type1 and the same integer (but a different from first one) to all those of type2. And this has actually a bunch of advantages: because you use a true component, you can query for the whole range of column name having the same 'type' (aka all type1 or all type2). Moreover you do control the prefix so you have more control on how each separate set of columns sort between each other. Yes you have to manage those prefix client side, but honestly I think it's a small price for the type safety it provides.

          Show
          Sylvain Lebresne added a comment - We could do that. But I have to admit that I'm usually not very fond of making things work when it's not clear they should. I think that could hide problems more than it helps. Better fail fast if there is a mess up. And if you want to store in the same row column names that have intrinsically different structures, you can prefix it manually to distinguish thinks. That is, in you example, you'll replace <integer,uuid,timestamp> (type1) by <integer,integer,uuid,timestamp> and <string,uuid,timestamp> (type2) by <integer,string,uuid,timestamp> and assign the same integer for all the column name of type1 and the same integer (but a different from first one) to all those of type2. And this has actually a bunch of advantages: because you use a true component, you can query for the whole range of column name having the same 'type' (aka all type1 or all type2). Moreover you do control the prefix so you have more control on how each separate set of columns sort between each other. Yes you have to manage those prefix client side, but honestly I think it's a small price for the type safety it provides.
          Hide
          Ed Anuff added a comment -

          If two dynamic composite types are compared, the first <integer,uuid,timestamp> and the second <string,uuid,timestamp>, this results in an exception being thrown in line 659, correct? In the original CompositeType, the component types each had an ordinal type value and the comparison was done on those type values if the components were of different types. I might suggest that in your code using the alias character byte or the hashCode() of the classname as the type value and doing a similar comparison, rather than throwing an exception.

          Show
          Ed Anuff added a comment - If two dynamic composite types are compared, the first <integer,uuid,timestamp> and the second <string,uuid,timestamp>, this results in an exception being thrown in line 659, correct? In the original CompositeType, the component types each had an ordinal type value and the comparison was done on those type values if the components were of different types. I might suggest that in your code using the alias character byte or the hashCode() of the classname as the type value and doing a similar comparison, rather than throwing an exception.
          Hide
          Sylvain Lebresne added a comment -

          Yes that is also correct.

          Show
          Sylvain Lebresne added a comment - Yes that is also correct.
          Hide
          Ed Anuff added a comment -

          Greater-than is already doable in my previous patch (up to the bug in validation). For the less-than part, I agree that it is nice to be able to do it easily. In my new patch, I add a leading byte to each component, whose purpose is to always be 0, except for lesser-than query. That way, you can do the query above easily. The price is a slightly more complicated encoding but I think it's totally worth it.

          Just to be clear, the original idea was to make it possible to construct a key for the purposes of doing a range slice that would compare inclusive either or both at the start and finish of the range. This appears to be possible with the "inclusion byte" that you're using in lines 179 through 184 of your patch. Is that correct?

          Show
          Ed Anuff added a comment - Greater-than is already doable in my previous patch (up to the bug in validation). For the less-than part, I agree that it is nice to be able to do it easily. In my new patch, I add a leading byte to each component, whose purpose is to always be 0, except for lesser-than query. That way, you can do the query above easily. The price is a slightly more complicated encoding but I think it's totally worth it. Just to be clear, the original idea was to make it possible to construct a key for the purposes of doing a range slice that would compare inclusive either or both at the start and finish of the range. This appears to be possible with the "inclusion byte" that you're using in lines 179 through 184 of your patch. Is that correct?
          Hide
          Sylvain Lebresne added a comment -

          Does this mean that I can have any number of components in my dynamic composite key and that "b" and "t" are aliases to BytesType and TimeUUIDType for the purpose of space efficienct? Using both or neither of those aliases is valid and the order in which I use them isn't mandated, correct?

          That is correct. The aliases are here only for space efficiency. No obligation of using them of any sort.

          Show
          Sylvain Lebresne added a comment - Does this mean that I can have any number of components in my dynamic composite key and that "b" and "t" are aliases to BytesType and TimeUUIDType for the purpose of space efficienct? Using both or neither of those aliases is valid and the order in which I use them isn't mandated, correct? That is correct. The aliases are here only for space efficiency. No obligation of using them of any sort.
          Hide
          Ed Anuff added a comment -

          Sylvain, this looks like it could do the trick.

          Just to clarify, in your example, if I declare something like this:

          DynamicCompositeType(b => BytesType, t => TimeUUIDType)

          Does this mean that I can have any number of components in my dynamic composite key and that "b" and "t" are aliases to BytesType and TimeUUIDType for the purpose of space efficienct? Using both or neither of those aliases is valid and the order in which I use them isn't mandated, correct?

          Show
          Ed Anuff added a comment - Sylvain, this looks like it could do the trick. Just to clarify, in your example, if I declare something like this: DynamicCompositeType(b => BytesType, t => TimeUUIDType) Does this mean that I can have any number of components in my dynamic composite key and that "b" and "t" are aliases to BytesType and TimeUUIDType for the purpose of space efficienct? Using both or neither of those aliases is valid and the order in which I use them isn't mandated, correct?
          Hide
          Sylvain Lebresne added a comment -

          (1) having validation only require that the components, if they're provided, are of the correct type (looking at Sylvain's code, the compare function should work, but the validation wouldn't)

          That's a bug. The last test of validate() should be negated. The only goal was forbid having leading bytes after all the comparator have been used (because compare makes the assumption that there will not have bytes without a proper comparator for them).

          However it is clear that we should allow a key having less component than comparator (btw, the unit test in my patch was actually doing it).

          This is fixed in the patch I'm attaching.

          My other concern is that one of the things that got stripped out was the MATCH_MINIMUM, MATCH_MAXIMUM feature, which made implementing greater-than-equals, less-than-equals, etc. in ranges much easier

          I'm not familiar enough with your code to be sure what those were doing exactly, but I suppose that the goal is that, if for instance you have the following column names:

              test:bar:42
              test:foo:24
              test:foo:42
              test:foobar
          

          then you want to be able to query anything that starts with the two component test and foo (that is test:foo:24 and test:foo:42 in my example).

          Greater-than is already doable in my previous patch (up to the bug in validation). For the less-than part, I agree that it is nice to be able to do it easily. In my new patch, I add a leading byte to each component, whose purpose is to always be 0, except for lesser-than query. That way, you can do the query above easily. The price is a slightly more complicated encoding but I think it's totally worth it.

          While i agree in theory, this is why thrift exists right?

          I don't know thrift as well as you do, but I would think that changing the type of column name would be a massive compatibility breaker. And I really don't see a not crappy way to do it. Well, actually I do, but it consists in revamping the API completely, ditching the super columns in favor of composite column name. Don't get me wrong, I'm 100% in, but it's well beyond this particular ticket and I think that as long as we have a generic and simple enough encoding for those composite type, I believe it's ok to include this as is and just have a decent description of the encoding for client to use. We can still branch that later on on thrift anyway.

          (2) creating a DynamicType that could be used for dynamic types

          I do agree with you that it could be useful and I understand the use case. I also agree that whatever we do, we should make it clear that it is not beginner stuff and that you shouldn't use a dynamic composite type unless you really need it (since the possibility of shooting yourself in the foot if you don't know what you do is fairly high). All this being said, I took a stab at a DynamicCompositeType and I think the result really makes sense.

          The main idea is really just to modify my static CompositeType so that each component starts with the name of the comparator class (that way, this is again completely generic, custom comparator can be used without problems). There is then 2 optimisations to make it a reasonable solution:

          1. FBUtilities.getComparator() uses a cache, to avoid the cost of introspection.
          2. There is a notion of alias. They are user defined and are local to each DynamicCompositeType (cf. test/conf/cassandra.yaml in the patch). An alias is a simple (printable) ascii character (there is around 100 possible alias which should be aplenty). You can then use those alias in the encoded column name, in which case you only need 2 bytes to encode the comparator, making it fairly compact.
            The encoded format is very close to the one of the static CompositeType, even though it is a tiny bit more complex due to the aliases handling. But given that dynamic composite type is not meant for the faint of heart, that should be fine.

          I'm attaching a new patch (against 0.7 branch) with both the static and dynamic composite type. The bulk of the code is shared between them. This also include a fair amount of unit tests.

          Show
          Sylvain Lebresne added a comment - (1) having validation only require that the components, if they're provided, are of the correct type (looking at Sylvain's code, the compare function should work, but the validation wouldn't) That's a bug. The last test of validate() should be negated. The only goal was forbid having leading bytes after all the comparator have been used (because compare makes the assumption that there will not have bytes without a proper comparator for them). However it is clear that we should allow a key having less component than comparator (btw, the unit test in my patch was actually doing it). This is fixed in the patch I'm attaching. My other concern is that one of the things that got stripped out was the MATCH_MINIMUM, MATCH_MAXIMUM feature, which made implementing greater-than-equals, less-than-equals, etc. in ranges much easier I'm not familiar enough with your code to be sure what those were doing exactly, but I suppose that the goal is that, if for instance you have the following column names: test:bar:42 test:foo:24 test:foo:42 test:foobar then you want to be able to query anything that starts with the two component test and foo (that is test:foo:24 and test:foo:42 in my example). Greater-than is already doable in my previous patch (up to the bug in validation). For the less-than part, I agree that it is nice to be able to do it easily. In my new patch, I add a leading byte to each component, whose purpose is to always be 0, except for lesser-than query. That way, you can do the query above easily. The price is a slightly more complicated encoding but I think it's totally worth it. While i agree in theory, this is why thrift exists right? I don't know thrift as well as you do, but I would think that changing the type of column name would be a massive compatibility breaker. And I really don't see a not crappy way to do it. Well, actually I do, but it consists in revamping the API completely, ditching the super columns in favor of composite column name. Don't get me wrong, I'm 100% in, but it's well beyond this particular ticket and I think that as long as we have a generic and simple enough encoding for those composite type, I believe it's ok to include this as is and just have a decent description of the encoding for client to use. We can still branch that later on on thrift anyway. (2) creating a DynamicType that could be used for dynamic types I do agree with you that it could be useful and I understand the use case. I also agree that whatever we do, we should make it clear that it is not beginner stuff and that you shouldn't use a dynamic composite type unless you really need it (since the possibility of shooting yourself in the foot if you don't know what you do is fairly high). All this being said, I took a stab at a DynamicCompositeType and I think the result really makes sense. The main idea is really just to modify my static CompositeType so that each component starts with the name of the comparator class (that way, this is again completely generic, custom comparator can be used without problems). There is then 2 optimisations to make it a reasonable solution: FBUtilities.getComparator() uses a cache, to avoid the cost of introspection. There is a notion of alias. They are user defined and are local to each DynamicCompositeType (cf. test/conf/cassandra.yaml in the patch). An alias is a simple (printable) ascii character (there is around 100 possible alias which should be aplenty). You can then use those alias in the encoded column name, in which case you only need 2 bytes to encode the comparator, making it fairly compact. The encoded format is very close to the one of the static CompositeType, even though it is a tiny bit more complex due to the aliases handling. But given that dynamic composite type is not meant for the faint of heart, that should be fine. I'm attaching a new patch (against 0.7 branch) with both the static and dynamic composite type. The bulk of the code is shared between them. This also include a fair amount of unit tests.
          Hide
          Todd Nine added a comment -

          As per Ed's previous comment. I use over 30 different indexing schemes on our data currently in my JDO code. The ultimate goal for this feature is to support the implementation of a JPA framework that works similarly to GAE.

          Having the ability to build the indexes that a user specifies without dynamically creating CFs is a must have for us. There are a lot of issues surrounding the complexity of building the index itself in the plugin that are outside the scope of this issue. However we don't really have a comparator mechanism to support these types of indexes.

          In all use cases we defined, our searches and therefore indexes need an order by clause as well as a query criteria to support the paging that most applications will require. This order could simply be a natural ordering of entity keys, or it could be on specific properties of the related entity. As applications grow in size, so will the complexity and number of indexes to support it, I'm concerned creating this many CFs could cause serious issues. This doesn't have to be a well advertised feature for the end user to create CFs with, but I feel very strongly that a dynamic type for CF's is a must in order to proceed with the JPA plugin we've been designing.

          Show
          Todd Nine added a comment - As per Ed's previous comment. I use over 30 different indexing schemes on our data currently in my JDO code. The ultimate goal for this feature is to support the implementation of a JPA framework that works similarly to GAE. Having the ability to build the indexes that a user specifies without dynamically creating CFs is a must have for us. There are a lot of issues surrounding the complexity of building the index itself in the plugin that are outside the scope of this issue. However we don't really have a comparator mechanism to support these types of indexes. In all use cases we defined, our searches and therefore indexes need an order by clause as well as a query criteria to support the paging that most applications will require. This order could simply be a natural ordering of entity keys, or it could be on specific properties of the related entity. As applications grow in size, so will the complexity and number of indexes to support it, I'm concerned creating this many CFs could cause serious issues. This doesn't have to be a well advertised feature for the end user to create CFs with, but I feel very strongly that a dynamic type for CF's is a must in order to proceed with the JPA plugin we've been designing.
          Hide
          Ed Anuff added a comment -

          Why? How many indexes are you creating?

          Do you mean how many indexes or how many index types? Lots of relatively small indexes, one index potentially for every relationship, but I'm not sure that's what you meant. In terms of index types, without a dynamic capability, then if I want to create an index on integer values, that's one CF, if I want to create an index on string values, that's another CF, if I want to create an index sorted first by lastname, then by firstname, that's another CF. I tried that approach and it made for some fairly convoluted code, but more concerning, I had close to 20 CFs, since maintaining a CF index requires at least one other CFs to store related metadata. I was able to consolidate that down to about 4 CFs, much more manageable code and Cassandra became a much happier camper.

          Show
          Ed Anuff added a comment - Why? How many indexes are you creating? Do you mean how many indexes or how many index types? Lots of relatively small indexes, one index potentially for every relationship, but I'm not sure that's what you meant. In terms of index types, without a dynamic capability, then if I want to create an index on integer values, that's one CF, if I want to create an index on string values, that's another CF, if I want to create an index sorted first by lastname, then by firstname, that's another CF. I tried that approach and it made for some fairly convoluted code, but more concerning, I had close to 20 CFs, since maintaining a CF index requires at least one other CFs to store related metadata. I was able to consolidate that down to about 4 CFs, much more manageable code and Cassandra became a much happier camper.
          Hide
          Jonathan Ellis added a comment -

          if for every index type we want to add, we need to add a new CF, this will quickly become as infeasible (this is inherently why we cannot use 2ndary indexes to begin with)

          Why? How many indexes are you creating?

          Show
          Jonathan Ellis added a comment - if for every index type we want to add, we need to add a new CF, this will quickly become as infeasible (this is inherently why we cannot use 2ndary indexes to begin with) Why? How many indexes are you creating?
          Hide
          Nate McCall added a comment -

          I see what tjake means here in that we want to duplicate the structure of a thrift message for the column name, but that is essentially the flexibility we need in order to make JPA relationships "work" over even a moderate number of entities. Eg. if for every index type we want to add, we need to add a new CF, this will quickly become as infeasible (this is inherently why we cannot use 2ndary indexes to begin with).

          Per Ed's last comment, how about a combination of Sylvain's approach with a more dynamic encoding scheme that included a 'type' field between the length and the data?

          Show
          Nate McCall added a comment - I see what tjake means here in that we want to duplicate the structure of a thrift message for the column name, but that is essentially the flexibility we need in order to make JPA relationships "work" over even a moderate number of entities. Eg. if for every index type we want to add, we need to add a new CF, this will quickly become as infeasible (this is inherently why we cannot use 2ndary indexes to begin with). Per Ed's last comment, how about a combination of Sylvain's approach with a more dynamic encoding scheme that included a 'type' field between the length and the data?
          Hide
          Ed Anuff added a comment -

          Ok, I'll let the "hacky" part slide. Sylvain's approach is much more elegant in how it's declared, and using the existing comparers is nice and something I did initially until I decided I'd rather have everything under one roof for performance and maintainability as well as in order to add some things to make queries easier. So, I'm in agreement with the way he's proposing.

          The central question here is does the final implementation provide sufficiently flexibility for index building. That's why a composite type is necessary. I have similar concerns to Todd, losing the dynamic capability is actually bit more of a problem the more I think about it, I'd look to solve that in two ways (1) having validation only require that the components, if they're provided, are of the correct type (looking at Sylvain's code, the compare function should work, but the validation wouldn't), and (2) creating a DynamicType that could be used for dynamic types. Given that a number of the languages that are talking to Cassandra are dynamic languages and even people using Java, like me, might be using JSON types, some form of dynamic support is a good idea, but I'd be happy to separate that into a different comparator. My other concern is that one of the things that got stripped out was the MATCH_MINIMUM, MATCH_MAXIMUM feature, which made implementing greater-than-equals, less-than-equals, etc. in ranges much easier. That might be the wrong place to implement that, though, and might be a bit of a hack.

          We've got somewhat of a challenge at the application tier, which we're seeing in the Hector project, in terms of providing a uniform way to do the type of indexing needed for ORM, which the current secondary indexes just don't satisfy. The benefit of why we need a sufficiently flexible composite mechanism in core is because we're already assuming the capability in order to implement this stuff. It doesn't have to be my code or my format, but it really should meet the needs of the app builders. Might a more forgiving version of Sylvain's CompositeType plus a new DynamicType be the way to go?

          Show
          Ed Anuff added a comment - Ok, I'll let the "hacky" part slide. Sylvain's approach is much more elegant in how it's declared, and using the existing comparers is nice and something I did initially until I decided I'd rather have everything under one roof for performance and maintainability as well as in order to add some things to make queries easier. So, I'm in agreement with the way he's proposing. The central question here is does the final implementation provide sufficiently flexibility for index building. That's why a composite type is necessary. I have similar concerns to Todd, losing the dynamic capability is actually bit more of a problem the more I think about it, I'd look to solve that in two ways (1) having validation only require that the components, if they're provided, are of the correct type (looking at Sylvain's code, the compare function should work, but the validation wouldn't), and (2) creating a DynamicType that could be used for dynamic types. Given that a number of the languages that are talking to Cassandra are dynamic languages and even people using Java, like me, might be using JSON types, some form of dynamic support is a good idea, but I'd be happy to separate that into a different comparator. My other concern is that one of the things that got stripped out was the MATCH_MINIMUM, MATCH_MAXIMUM feature, which made implementing greater-than-equals, less-than-equals, etc. in ranges much easier. That might be the wrong place to implement that, though, and might be a bit of a hack. We've got somewhat of a challenge at the application tier, which we're seeing in the Hector project, in terms of providing a uniform way to do the type of indexing needed for ORM, which the current secondary indexes just don't satisfy. The benefit of why we need a sufficiently flexible composite mechanism in core is because we're already assuming the capability in order to implement this stuff. It doesn't have to be my code or my format, but it really should meet the needs of the app builders. Might a more forgiving version of Sylvain's CompositeType plus a new DynamicType be the way to go?
          Hide
          T Jake Luciani added a comment -

          While i agree in theory, this is why thrift exists right? I suppose this could be done later if it was a issue so I'm +0

          Show
          T Jake Luciani added a comment - While i agree in theory, this is why thrift exists right? I suppose this could be done later if it was a issue so I'm +0
          Hide
          Sylvain Lebresne added a comment -

          I would think to add a proper composite type we would need to change the thrift api. so a row key becomes an array rather than a binary. Otherwise the onus is on the clients to adhere to the composite internal encoding rules (across all languages)

          Actually we still kind of impose some encoding rule with our comparators (IntegerType assumes little-endian encoding). But I agree that those should be as straightforward as could be. That's why I've tried to keep the encoding as simple as possible. It's

          <length of part1><part1><length of part2><part2>....
          

          I don't think this get much more simple and I think this is a ok burden on the client.

          Show
          Sylvain Lebresne added a comment - I would think to add a proper composite type we would need to change the thrift api. so a row key becomes an array rather than a binary. Otherwise the onus is on the clients to adhere to the composite internal encoding rules (across all languages) Actually we still kind of impose some encoding rule with our comparators (IntegerType assumes little-endian encoding). But I agree that those should be as straightforward as could be. That's why I've tried to keep the encoding as simple as possible. It's <length of part1><part1><length of part2><part2>.... I don't think this get much more simple and I think this is a ok burden on the client.
          Hide
          Todd Nine added a comment -

          I do see your point. However, as a user and a developer of a JPA plugin take this use case for an example of our requriements

          Entities: User -> Vehicles

          A User has a collection of vehicles. These can be ordered in several ways. For this example, lets say by time created and name. This would lead to several different column comparators. For properties on the user, first name, last name, these would be UTF8 columns, for the collections, these would be composite. I would end up with columns of the following definitions to allow us to quickly load or them.

          UTF8(firstname):value
          UTF8(lastname):value
          UTF8(vehicletime) LONG(vehicleTime) TIMEUUID(VEHICLEID) UTF8(VehicleProp) : value
          UTF8(vehiclename) UTF8(vehicleName) TIMEUUID(VEHICLEID) UTF8(VehicpeProp) : value

          In the event we move this index to an external CF, we lose the serialization scope and guaranteed atomic write of using a single row key during a mutation. While this is a separate issue as on Cassandra in general, we do need the ability to work around this until serialization scope can extend over multiple rows. Without this, to ensure that "lost writes" don't occur on indexing, a user would need to introduce an extra system such as Zookeeper.

          https://issues.apache.org/jira/browse/CASSANDRA-1684

          Show
          Todd Nine added a comment - I do see your point. However, as a user and a developer of a JPA plugin take this use case for an example of our requriements Entities: User -> Vehicles A User has a collection of vehicles. These can be ordered in several ways. For this example, lets say by time created and name. This would lead to several different column comparators. For properties on the user, first name, last name, these would be UTF8 columns, for the collections, these would be composite. I would end up with columns of the following definitions to allow us to quickly load or them. UTF8(firstname):value UTF8(lastname):value UTF8(vehicletime) LONG(vehicleTime) TIMEUUID(VEHICLEID) UTF8(VehicleProp) : value UTF8(vehiclename) UTF8(vehicleName) TIMEUUID(VEHICLEID) UTF8(VehicpeProp) : value In the event we move this index to an external CF, we lose the serialization scope and guaranteed atomic write of using a single row key during a mutation. While this is a separate issue as on Cassandra in general, we do need the ability to work around this until serialization scope can extend over multiple rows. Without this, to ensure that "lost writes" don't occur on indexing, a user would need to introduce an extra system such as Zookeeper. https://issues.apache.org/jira/browse/CASSANDRA-1684
          Hide
          T Jake Luciani added a comment -

          I would think to add a proper composite type we would need to change the thrift api. so a row key becomes an array rather than a binary. Otherwise the onus is on the clients to adhere to the composite internal encoding rules (across all languages)

          Show
          T Jake Luciani added a comment - I would think to add a proper composite type we would need to change the thrift api. so a row key becomes an array rather than a binary. Otherwise the onus is on the clients to adhere to the composite internal encoding rules (across all languages)
          Hide
          Jonathan Ellis added a comment -

          I'm not a fan of encouraging people to break the practice of "each row in a CF has the same schema." If power users like Ed or Todd want to do that, that's fine, that's why we made it pluggable, but I'm -0 on shipping that functionality as standard too.

          I'm more interested in the "getting rid of supercolumns" use case.

          Show
          Jonathan Ellis added a comment - I'm not a fan of encouraging people to break the practice of "each row in a CF has the same schema." If power users like Ed or Todd want to do that, that's fine, that's why we made it pluggable, but I'm -0 on shipping that functionality as standard too. I'm more interested in the "getting rid of supercolumns" use case.
          Hide
          Sylvain Lebresne added a comment -

          To be perfectly honest, I feel Ed's patch is a bit hacky to go into cassandra core (don't get me wrong, it's a neat hack). In particular the format of the column names is fairly complex and would have to be updated each time we want to add a comparator (not that add those every other day but still).
          I'm moreover not sure it gains much to go into cassandra tree, outside of putting the burden of maintaining it on Cassandra developers instead of Ed (and it's 1300 lines to maintain).
          I don't know, I'm kind of +0 on adding Ed's comparator.

          Show
          Sylvain Lebresne added a comment - To be perfectly honest, I feel Ed's patch is a bit hacky to go into cassandra core (don't get me wrong, it's a neat hack). In particular the format of the column names is fairly complex and would have to be updated each time we want to add a comparator (not that add those every other day but still). I'm moreover not sure it gains much to go into cassandra tree, outside of putting the burden of maintaining it on Cassandra developers instead of Ed (and it's 1300 lines to maintain). I don't know, I'm kind of +0 on adding Ed's comparator.
          Hide
          Todd Nine added a comment - - edited

          Enforcing all columns to be the same would break our indexing. Each row key is a different index, and the columns within that index are composed of different composite types. If this was enforced at the CF level, we would require a different CF for each index. Is it possible to allow both static and dynamic types by creating 2 composite index types? I.E StaticComposite using your patch and DynamicComposite using Eds?

          Show
          Todd Nine added a comment - - edited Enforcing all columns to be the same would break our indexing. Each row key is a different index, and the columns within that index are composed of different composite types. If this was enforced at the CF level, we would require a different CF for each index. Is it possible to allow both static and dynamic types by creating 2 composite index types? I.E StaticComposite using your patch and DynamicComposite using Eds?
          Hide
          Ed Anuff added a comment -

          This makes sense. In practice, I've made use of the fact that the embedded types were dynamic to arbitrarily store additional metadata in the column names, which this is going to preclude due to being strongly typed, but I think that can be worked around here.

          Show
          Ed Anuff added a comment - This makes sense. In practice, I've made use of the fact that the embedded types were dynamic to arbitrarily store additional metadata in the column names, which this is going to preclude due to being strongly typed, but I think that can be worked around here.
          Hide
          Sylvain Lebresne added a comment -

          I'm completely in favor of a CompositeType build-in. And the attached class is neat, but if we're going to add that to Cassandra core, we have some liberty that allows us to do this much more simply and in a more powerful way.

          The idea is to make the user declare which AbstracType the CompositeType is composed of. That is, we can do stuff like declaring in the yaml:

            - compare_with: CompositeType(BytesType, TimeUUIDType, IntegerType)
          

          A CompositeType column name has a number of components (3 above) and you say that each component is 2 bytes for the component bytes followed by the component bytes. You don't have to remember in the serialized form what is the type of each component since it is declared upfront.

          I'm attaching a patch (against 0.7) that implements just that. The advantages are:

          • it's simpler than the attached patch (much less line of code)
          • it's fully generic. You can use any comparator for the sub-component,
            including custom ones.
          • the serialized format of the column names is much simpler, it will be much
            easier for client to add support for this. And there is no reason it will
            ever change since it's already fully generic.
          • it's typed, so you have actual column name validation.
          Show
          Sylvain Lebresne added a comment - I'm completely in favor of a CompositeType build-in. And the attached class is neat, but if we're going to add that to Cassandra core, we have some liberty that allows us to do this much more simply and in a more powerful way. The idea is to make the user declare which AbstracType the CompositeType is composed of. That is, we can do stuff like declaring in the yaml: - compare_with: CompositeType(BytesType, TimeUUIDType, IntegerType) A CompositeType column name has a number of components (3 above) and you say that each component is 2 bytes for the component bytes followed by the component bytes. You don't have to remember in the serialized form what is the type of each component since it is declared upfront. I'm attaching a patch (against 0.7) that implements just that. The advantages are: it's simpler than the attached patch (much less line of code) it's fully generic. You can use any comparator for the sub-component, including custom ones. the serialized format of the column names is much simpler, it will be much easier for client to add support for this. And there is no reason it will ever change since it's already fully generic. it's typed, so you have actual column name validation.
          Show
          Ed Anuff added a comment - https://github.com/edanuff/CassandraCompositeType

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Ed Anuff
              Reviewer:
              Ed Anuff
            • Votes:
              10 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development