Cassandra
  1. Cassandra
  2. CASSANDRA-3625

Do something about DynamicCompositeType

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.7, 1.1.0
    • Component/s: Core
    • Labels:
      None

      Description

      Currently, DynamicCompositeType is a super dangerous type. We cannot leave it that way or people will get hurt.

      Let's recall that DynamicCompositeType allows composite column names without any limitation on what each component type can be. It was added to basically allow to use different rows of the same column family to each store a different index. So for instance you would have:

      index1: {
        "bar":24 -> someval
        "bar":42 -> someval
        "foo":12 -> someval
        ...
      }
      index2: {
        0:uuid1:3.2 -> someval
        1:uuid2:2.2 -> someval
        ...
      }
      ....
      

      where index1, index2, ... are rows.
      So each row have columns whose names have similar structure (so they can be compared), but between rows the structure can be different (we neve compare two columns from two different rows).

      But the problem is the following: what happens if in the index1 row above, you insert a column whose name is 0:uuid1 ? There is no really meaningful way to compare "bar":24 and 0:uuid1. The current implementation of DynamicCompositeType, when confronted with this, says that it is a user error and throw a MarshalException.
      The problem with that is that the exception is not throw at insert time, and it cannot be because of the dynamic nature of the comparator. But that means that if you do insert the wrong column in the wrong row, you end up corrupting a sstable.

      It is too dangerous a behavior. And it's probably made worst by the fact that some people probably think that DynamicCompositeType should be superior to CompositeType since you know, it's dynamic.

      One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. For example we could design that IntType < LongType < StringType ...

      Note that even if we do that, I would suggest renaming the DynamicCompositeType to something that suggest that CompositeType is always preferable to DynamicCompositeType unless you're really doing very advanced stuffs.

      Opinions?

        Activity

        Hide
        Ed Anuff added a comment -

        Ok, cool, just wanted to make sure we weren't going to see a variant on the original problem.

        +1

        Show
        Ed Anuff added a comment - Ok, cool, just wanted to make sure we weren't going to see a variant on the original problem. +1
        Hide
        Sylvain Lebresne added a comment -

        Shouldn't be a problem. FixedValueComparator is only ever return by DCT.getNextComparator(int, ByteBuffer, ByteBuffer), which is only called during compare() but not any other method. getString() and the like should only ever care about the actual comparator for the component. Getting FixedValueComparator for them would be a bug, so UnsupportedOperationException seems appropriate.

        Show
        Sylvain Lebresne added a comment - Shouldn't be a problem. FixedValueComparator is only ever return by DCT.getNextComparator(int, ByteBuffer, ByteBuffer), which is only called during compare() but not any other method. getString() and the like should only ever care about the actual comparator for the component. Getting FixedValueComparator for them would be a bug, so UnsupportedOperationException seems appropriate.
        Hide
        Ed Anuff added a comment -

        Are the UnsupportedOperationExceptions thrown by the FixedValueComparator for compose(), decompose(), getString(), and validate() going to cause any problems? getString() looks like it might get called.

        Show
        Ed Anuff added a comment - Are the UnsupportedOperationExceptions thrown by the FixedValueComparator for compose(), decompose(), getString(), and validate() going to cause any problems? getString() looks like it might get called.
        Hide
        Sylvain Lebresne added a comment -

        is there a test case for it?

        There wasn't, but I just added a simple unit test if that was the question.

        Show
        Sylvain Lebresne added a comment - is there a test case for it? There wasn't, but I just added a simple unit test if that was the question.
        Hide
        Ed Anuff added a comment -

        Thanks for taking this one on, I'd hoped to get a chance to work on it but wasn't able to. This looks like it will solve the problem nicely by doing the comparison on classnames. I'll try it out, is there a test case for it?

        Show
        Ed Anuff added a comment - Thanks for taking this one on, I'd hoped to get a chance to work on it but wasn't able to. This looks like it will solve the problem nicely by doing the comparison on classnames. I'll try it out, is there a test case for it?
        Hide
        Sylvain Lebresne added a comment -

        Ok, I took a shot at that since it'd be nice to get that fixed soon. Patch is against 1.0 (though I don't think there has been any chance to DCT since 0.8).

        Show
        Sylvain Lebresne added a comment - Ok, I took a shot at that since it'd be nice to get that fixed soon. Patch is against 1.0 (though I don't think there has been any chance to DCT since 0.8).
        Hide
        Sylvain Lebresne added a comment -

        Right, but I thought we were positing that You Shouldn't Do That. In which case as long as it doesn't crash, I'm good

        Without going in the debate of how useful or not that is, I think that as soon as it is allowed (to mix in the same row column with components of different types), some will do it, so we'd rather choose the more "coherent" solution and so I also prefer fixing some order on the types themselves and use that.

        As for picking the actual sort between types, I'd prefer avoiding a hash as it's not a very good uses for hash imo (I don't want to bother about collision, as unlikely as it is). But using the alias character (and falling back to good old string comparison on the class name if there is no alias) seems fine to me.

        Sylvain, are you working on this one or would you like me to take a stab at it?

        I haven't started writing anything so feel free to give it a shot.

        Show
        Sylvain Lebresne added a comment - Right, but I thought we were positing that You Shouldn't Do That. In which case as long as it doesn't crash, I'm good Without going in the debate of how useful or not that is, I think that as soon as it is allowed (to mix in the same row column with components of different types), some will do it, so we'd rather choose the more "coherent" solution and so I also prefer fixing some order on the types themselves and use that. As for picking the actual sort between types, I'd prefer avoiding a hash as it's not a very good uses for hash imo (I don't want to bother about collision, as unlikely as it is). But using the alias character (and falling back to good old string comparison on the class name if there is no alias) seems fine to me. Sylvain, are you working on this one or would you like me to take a stab at it? I haven't started writing anything so feel free to give it a shot.
        Hide
        Ed Anuff added a comment -

        I'm not sure we need a longer term solution than what I'm proposing. I think we're all in agreement that throwing the exception the way it's doing now is bad and that a deterministic though not necessarily transparent sort behavior is the best solution. Sylvain, are you working on this one or would you like me to take a stab at it?

        Show
        Ed Anuff added a comment - I'm not sure we need a longer term solution than what I'm proposing. I think we're all in agreement that throwing the exception the way it's doing now is bad and that a deterministic though not necessarily transparent sort behavior is the best solution. Sylvain, are you working on this one or would you like me to take a stab at it?
        Hide
        Boris Yen added a comment - - edited

        Not sure if this is doable or has any issue.

        I am thinking why not mimic that way the secondary index is implemented right now. Create one extra column family for keeping track of the comparators for each row. So, whenever a column is inserted to the cassandra, the cassandra needs to read before write to make sure the new column is valid. This would sacrifice the write performance of dynamicComposite column family, but at least it allows the cassandra to perform the validation before the actual write.

        Show
        Boris Yen added a comment - - edited Not sure if this is doable or has any issue. I am thinking why not mimic that way the secondary index is implemented right now. Create one extra column family for keeping track of the comparators for each row. So, whenever a column is inserted to the cassandra, the cassandra needs to read before write to make sure the new column is valid. This would sacrifice the write performance of dynamicComposite column family, but at least it allows the cassandra to perform the validation before the actual write.
        Hide
        Matt Stump added a comment -

        Until a long term solution is found, would it be possible to get something in the documentation warning people away from DynamicCompositeType? It was featured rather prominently in Ed's talk so people may mistakingly believe that DynamicCompositeType is the preferred method to create dynamic indexes.

        Show
        Matt Stump added a comment - Until a long term solution is found, would it be possible to get something in the documentation warning people away from DynamicCompositeType? It was featured rather prominently in Ed's talk so people may mistakingly believe that DynamicCompositeType is the preferred method to create dynamic indexes.
        Hide
        Ed Anuff added a comment -

        I'm not positing that at all, I can think of a number of good reasons why it can happen and is even desirable. I'd really strongly urge we do the compare on the component type. I don't think the fix is any more complicated and it will be much preferable from a data modelling standpoint.

        Show
        Ed Anuff added a comment - I'm not positing that at all, I can think of a number of good reasons why it can happen and is even desirable. I'd really strongly urge we do the compare on the component type. I don't think the fix is any more complicated and it will be much preferable from a data modelling standpoint.
        Hide
        Jonathan Ellis added a comment -

        For example, a row that had dynamic composite columns ("ed",5), ("jonathan",6), and (103, 32), that was sliced from ("ed") to ("jonathan") could have the (103, 32) in the middle

        Right, but I thought we were positing that You Shouldn't Do That. In which case as long as it doesn't crash, I'm good.

        Show
        Jonathan Ellis added a comment - For example, a row that had dynamic composite columns ("ed",5), ("jonathan",6), and (103, 32), that was sliced from ("ed") to ("jonathan") could have the (103, 32) in the middle Right, but I thought we were positing that You Shouldn't Do That. In which case as long as it doesn't crash, I'm good.
        Hide
        Ed Anuff added a comment -

        Each component in a composite consists of a type (either an alias byte or a Cassandra comparator type name) and the value. I'm suggesting doing a compare on the type in the case of types not being equivalent. The comparison could be a lexical compare or a hash comparison. I think doing the compare on the component type is better since the purpose of the composite is for slices and if we do a lexical compare of the component values then the slices are going to have weird results in the middle of them. For example, a row that had dynamic composite columns ("ed",5), ("jonathan",6), and (103, 32), that was sliced from ("ed") to ("jonathan") could have the (103, 32) in the middle. If we compare on the type, then that never happens.

        Show
        Ed Anuff added a comment - Each component in a composite consists of a type (either an alias byte or a Cassandra comparator type name) and the value. I'm suggesting doing a compare on the type in the case of types not being equivalent. The comparison could be a lexical compare or a hash comparison. I think doing the compare on the component type is better since the purpose of the composite is for slices and if we do a lexical compare of the component values then the slices are going to have weird results in the middle of them. For example, a row that had dynamic composite columns ("ed",5), ("jonathan",6), and (103, 32), that was sliced from ("ed") to ("jonathan") could have the (103, 32) in the middle. If we compare on the type, then that never happens.
        Hide
        Jonathan Ellis added a comment -

        One solution to that problem could be to decide of some random (but predictable) order between two incomparable component.

        That's the most straightforward suggestion IMO.

        I suggested we use the alias character byte or a hash of the classname

        Couldn't we just fall back to lexical sorting for non-comparable types? Might as well keep it simple.

        Show
        Jonathan Ellis added a comment - One solution to that problem could be to decide of some random (but predictable) order between two incomparable component. That's the most straightforward suggestion IMO. I suggested we use the alias character byte or a hash of the classname Couldn't we just fall back to lexical sorting for non-comparable types? Might as well keep it simple.
        Hide
        Ed Anuff added a comment -

        I don't think you mean "random but predictable" so much as "deterministic but opaque" in your description of the correct behavior.

        I raised this issue with DynamicCompositeType when it was introduced and I suggested we use the alias character byte or a hash of the classname (see https://issues.apache.org/jira/browse/CASSANDRA-2231#comment-13002170 ). I still think that's the best approach.

        Show
        Ed Anuff added a comment - I don't think you mean "random but predictable" so much as "deterministic but opaque" in your description of the correct behavior. I raised this issue with DynamicCompositeType when it was introduced and I suggested we use the alias character byte or a hash of the classname (see https://issues.apache.org/jira/browse/CASSANDRA-2231#comment-13002170 ). I still think that's the best approach.

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Sylvain Lebresne
            Reviewer:
            Ed Anuff
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development