Cassandra
  1. Cassandra
  2. CASSANDRA-4936

Less than operator when comparing timeuuids behaves as less than equal.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.1
    • Component/s: Core
    • Labels:
      None
    • Environment:

      Linux CentOS.

      Linux localhost.localdomain 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:37 EDT 2012 i686 i686 i386 GNU/Linux

      Description

      If we define the following column family using CQL3:

      CREATE TABLE useractivity (
      user_id int,
      activity_id 'TimeUUIDType',
      data text,
      PRIMARY KEY (user_id, activity_id)
      );

      Add some values to it.

      And then query it like:

      SELECT * FROM useractivity WHERE user_id = '3' AND activity_id < '2012-11-07 18:18:22-0800' ORDER BY activity_id DESC LIMIT 1;

      the record with timeuuid '2012-11-07 18:18:22-0800' returns in the results.

      According to the documentation, on CQL3 the '<' and '>' operators are strict, meaning not inclusive, so this seems to be a bug.

      1. 4936.txt
        24 kB
        Sylvain Lebresne

        Activity

        Cesar Lopez-Nataren created issue -
        Hide
        Jonathan Ellis added a comment -

        Are you sure the UUIDs do not differ in their non-time-based component? See CASSANDRA-4284.

        Show
        Jonathan Ellis added a comment - Are you sure the UUIDs do not differ in their non-time-based component? See CASSANDRA-4284 .
        Hide
        Sylvain Lebresne added a comment -

        This reminds me, I think that at the minimum we should provide a cqlsh option to display timeuuid as UUID rather than date. Otherwise it's painful to even check that the actual UUID differs. Though tbh I'm still not sure using dates by default is the best idea given this is a lossy representation and I'm afraid will confuse more people not used to type 1 UUID than anything else. Unless of course we adopt of the the format discussed in CASSANDRA-4284 (alternatively, we could just display the date and the uuid in parenthesis, though it'll be verbose).

        Show
        Sylvain Lebresne added a comment - This reminds me, I think that at the minimum we should provide a cqlsh option to display timeuuid as UUID rather than date. Otherwise it's painful to even check that the actual UUID differs. Though tbh I'm still not sure using dates by default is the best idea given this is a lossy representation and I'm afraid will confuse more people not used to type 1 UUID than anything else. Unless of course we adopt of the the format discussed in CASSANDRA-4284 (alternatively, we could just display the date and the uuid in parenthesis, though it'll be verbose).
        Hide
        Cesar Lopez-Nataren added a comment -

        Jonathan: What's the easiest way to find that out?

        Show
        Cesar Lopez-Nataren added a comment - Jonathan: What's the easiest way to find that out?
        Hide
        Tyler Hobbs added a comment -

        Are you sure the UUIDs do not differ in their non-time-based component?

        (Note: I suggested that Cesar open this ticket.) It almost certainly is a problem with the non-time-based component. However, the intention of the user is pretty clear for this query: select anything where the timestamp component is less than the provided one.

        I've been handling this specially in pycassa for quite a while. If the comparator is TimeUUID, and a datetime/timestamp is passed as a slice end, it will create a TimeUUID with special non-timestamp components. For example, since slice ends are inclusive in the thrift api, when creating the TimeUUID for the slice 'finish', the highest possible TimeUUID with the given timestamp will be used.

        We could do something similar with CQL, it would just require passing along information about whether to create the highest or lowest TimeUUID representation for a given datestamp based on the comparison operator that's used.

        If details about what non-timestamp components pycassa uses would be useful, let me know.

        Show
        Tyler Hobbs added a comment - Are you sure the UUIDs do not differ in their non-time-based component? (Note: I suggested that Cesar open this ticket.) It almost certainly is a problem with the non-time-based component. However, the intention of the user is pretty clear for this query: select anything where the timestamp component is less than the provided one. I've been handling this specially in pycassa for quite a while. If the comparator is TimeUUID, and a datetime/timestamp is passed as a slice end, it will create a TimeUUID with special non-timestamp components. For example, since slice ends are inclusive in the thrift api, when creating the TimeUUID for the slice 'finish', the highest possible TimeUUID with the given timestamp will be used. We could do something similar with CQL, it would just require passing along information about whether to create the highest or lowest TimeUUID representation for a given datestamp based on the comparison operator that's used. If details about what non-timestamp components pycassa uses would be useful, let me know.
        Jonathan Ellis made changes -
        Field Original Value New Value
        Fix Version/s 1.2.2 [ 12323924 ]
        Affects Version/s 0.8.0 [ 12316403 ]
        Affects Version/s 1.1.6 [ 12323257 ]
        Hide
        Sylvain Lebresne added a comment - - edited

        I really think the bug here is that TimeUUIDType.fromString() accepts a date as input. But a date is not a valid representation of a timeuuid, and the fromString method does arbitrarily pick some 0's for parts of the resulting UUID.

        In other words, the SELECT query above should be invalid.

        Now don't get me wrong, selecting timeuuid based on dates is useful but that is a slightly different problem. So what I think we should do is:

        1. refuse dates as valid timeuuid values because they just are not.
        2. add convenience methods to translate dates to precise timeuuid. For querying we would have 'startOf()' and 'endOf()' (where 'startOf(<date> )' (resp. 'endOf(<date> )') would return the smallest (resp. biggest) possible timeuuid at time <date> ). And for insertion we could optionally add 'random(<date> )' that would return a random timeuuid at time '<date>' (we could even accept 'now' as syntactic sugar for 'random(now)' if we feel like it).

        That would also mean that cqlsh should stop this non-sense of displaying timeuuid like date. Again, I understand the intention of making it more readable but this will confuse generations of CQL3 users. I do am in favor of finding a non confusing way to make it readable for users. In fact one solution could be to handle that on the CQL side and to allow 'SELECT dateOf(x ) FROM ...' that would return a date string from timeuuid x (but now it's clear, you've explicitly asked for a lossy representation of x).

        I note that this suggestion pretty much fixes the problem discussed in CASSANDRA-4284 too.

        I note that Tyler's solution of basically automatically generating the startOf() and endOf() method under the cover based on whether we've use an inclusive of exclusive operation may appear seductive but I don't think we should do that because:

        1. if you do that, what about SELECT ... WHERE activity_id = '2012-11-07 18:18:22-0800'. You still have no solution for that and by doing magic under the carpet for < and >, you've in fact blurred what = really does.
        2. "it would just require passing along information about whether to create the highest or lowest TimeUUID representation for a given datestamp based on the comparison operator that's used" <- while this seem simple on principle, this will yield very very ugly special cases internally. This is not 2 lines of code.
        3. more generally, this doesn't solve the fact that date are not valid representation of timeuuid. For example, I still think the first point mentioned in CASSANDRA-4284 is a bug in its own right.

        Allowing dates as valid representation of timeuuid is a bug, let's fix it.

        Show
        Sylvain Lebresne added a comment - - edited I really think the bug here is that TimeUUIDType.fromString() accepts a date as input. But a date is not a valid representation of a timeuuid, and the fromString method does arbitrarily pick some 0's for parts of the resulting UUID. In other words, the SELECT query above should be invalid. Now don't get me wrong, selecting timeuuid based on dates is useful but that is a slightly different problem. So what I think we should do is: refuse dates as valid timeuuid values because they just are not. add convenience methods to translate dates to precise timeuuid. For querying we would have 'startOf()' and 'endOf()' (where 'startOf(<date> )' (resp. 'endOf(<date> )') would return the smallest (resp. biggest ) possible timeuuid at time <date> ). And for insertion we could optionally add 'random(<date> )' that would return a random timeuuid at time '<date>' (we could even accept 'now' as syntactic sugar for 'random(now)' if we feel like it). That would also mean that cqlsh should stop this non-sense of displaying timeuuid like date. Again, I understand the intention of making it more readable but this will confuse generations of CQL3 users. I do am in favor of finding a non confusing way to make it readable for users. In fact one solution could be to handle that on the CQL side and to allow 'SELECT dateOf(x ) FROM ...' that would return a date string from timeuuid x (but now it's clear, you've explicitly asked for a lossy representation of x). I note that this suggestion pretty much fixes the problem discussed in CASSANDRA-4284 too. I note that Tyler's solution of basically automatically generating the startOf() and endOf() method under the cover based on whether we've use an inclusive of exclusive operation may appear seductive but I don't think we should do that because: if you do that, what about SELECT ... WHERE activity_id = '2012-11-07 18:18:22-0800'. You still have no solution for that and by doing magic under the carpet for < and >, you've in fact blurred what = really does. "it would just require passing along information about whether to create the highest or lowest TimeUUID representation for a given datestamp based on the comparison operator that's used" <- while this seem simple on principle, this will yield very very ugly special cases internally. This is not 2 lines of code. more generally, this doesn't solve the fact that date are not valid representation of timeuuid. For example, I still think the first point mentioned in CASSANDRA-4284 is a bug in its own right. Allowing dates as valid representation of timeuuid is a bug, let's fix it.
        Hide
        Sylvain Lebresne added a comment -

        Attaching a patch that implements what my previous comment describes.

        A few precisions:

        • the patch allows startOf('2012-11-07 18:18:22-0800') or even startOf(12432343423) but not startOf(? ). It's just that it's simpler to not support prepared marker for now. We can add it later, but I'd rather leave it for later. It's not excessively useful anyway since any good library will provide an equivalent to the startOf() method anyway (and so you can use that client side for prepared statements).
        • the patch change the fact that dates are accepted as valid TimeUUID representation because, as arged previously, this is bogus. However, CQL2 has done that bogus thing to, and I'm not sure it's worth fixing there as there might be people relying on that buggy behavior. So the patch maintain the buggy behavior fro CQL2.
        • I talks about adding a 'random(date)' method for insertions sake in my previous comment, but thinking about that a bit more, I'm not sure it's a good idea. Namely, the only way to generate a version 1 UUID according to the spec is based on the current time. Generating one from a timestamp is kind of not safe. Now I admit that if you use the timestamp but randomize all other bits, you probably end up with something having virtually no chance of collision, but still, I'm slightly reluctant to do that in Cassandra. I'd rather let people do that client side (and provide a UUID string) if they really want to. So instead the patch only provides a now() method that generate a new unique timeuuid based on the current time.
        • The patch also adds the conversion for select statement I mention in my previous comment. In fact it adds 2 methods dateOf() and unixTimestampOf(). This part is kind of optional and I can rip it out if there is objections (I meant to separate in 2 patches but scewed up and got lazy). That being said, I kind of like it and with that I think we can just have cqlsh stop printing timeUUID as dates (which the patch doesn't include however).

        Let's not shy away from the fact that this patch kinds of break backward compatibility. I say "kinds of" because as I've said, I really think allowing date litterals as timeuuid values is a bug so I really think this patch is a bug fix. And if we get that in 1.2.1, I really don't think they'll be any arm done. I also note that there isn't any way to fix this issue that don't "break backward compatibility". So I'm really sorry I didn't got that fix up before 1.2.0, but I still really think we should do it nonetheless.

        Show
        Sylvain Lebresne added a comment - Attaching a patch that implements what my previous comment describes. A few precisions: the patch allows startOf('2012-11-07 18:18:22-0800') or even startOf(12432343423) but not startOf(? ) . It's just that it's simpler to not support prepared marker for now. We can add it later, but I'd rather leave it for later. It's not excessively useful anyway since any good library will provide an equivalent to the startOf() method anyway (and so you can use that client side for prepared statements). the patch change the fact that dates are accepted as valid TimeUUID representation because, as arged previously, this is bogus. However, CQL2 has done that bogus thing to, and I'm not sure it's worth fixing there as there might be people relying on that buggy behavior. So the patch maintain the buggy behavior fro CQL2. I talks about adding a 'random(date)' method for insertions sake in my previous comment, but thinking about that a bit more, I'm not sure it's a good idea. Namely, the only way to generate a version 1 UUID according to the spec is based on the current time. Generating one from a timestamp is kind of not safe. Now I admit that if you use the timestamp but randomize all other bits, you probably end up with something having virtually no chance of collision, but still, I'm slightly reluctant to do that in Cassandra. I'd rather let people do that client side (and provide a UUID string) if they really want to. So instead the patch only provides a now() method that generate a new unique timeuuid based on the current time. The patch also adds the conversion for select statement I mention in my previous comment. In fact it adds 2 methods dateOf() and unixTimestampOf() . This part is kind of optional and I can rip it out if there is objections (I meant to separate in 2 patches but scewed up and got lazy). That being said, I kind of like it and with that I think we can just have cqlsh stop printing timeUUID as dates (which the patch doesn't include however). Let's not shy away from the fact that this patch kinds of break backward compatibility. I say "kinds of" because as I've said, I really think allowing date litterals as timeuuid values is a bug so I really think this patch is a bug fix. And if we get that in 1.2.1, I really don't think they'll be any arm done. I also note that there isn't any way to fix this issue that don't "break backward compatibility". So I'm really sorry I didn't got that fix up before 1.2.0, but I still really think we should do it nonetheless.
        Sylvain Lebresne made changes -
        Attachment 4936.txt [ 12564407 ]
        Sylvain Lebresne made changes -
        Fix Version/s 1.2.1 [ 12322953 ]
        Fix Version/s 1.2.2 [ 12323924 ]
        Sylvain Lebresne made changes -
        Attachment 4936.txt [ 12564407 ]
        Sylvain Lebresne made changes -
        Attachment 4936.txt [ 12564408 ]
        Sylvain Lebresne made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Assignee Sylvain Lebresne [ slebresne ]
        Hide
        Tyler Hobbs added a comment -

        I agree with all of your points. The one thing I might change are the function names startOf() and endOf(). Since these functions are dealing with dates and times, I think the names suggest they might be altering the time component. Perhaps minTimeUUID() and maxTimeUUID()?

        Regarding backwards compatibility: are we carefully tracking changes to CQL by version and documenting them somewhere? These kinds of changes need to be easily discoverable, even if they are only considered bugfixes.

        Show
        Tyler Hobbs added a comment - I agree with all of your points. The one thing I might change are the function names startOf() and endOf() . Since these functions are dealing with dates and times, I think the names suggest they might be altering the time component. Perhaps minTimeUUID() and maxTimeUUID() ? Regarding backwards compatibility: are we carefully tracking changes to CQL by version and documenting them somewhere? These kinds of changes need to be easily discoverable, even if they are only considered bugfixes.
        Hide
        Sylvain Lebresne added a comment -

        Perhaps minTimeUUID() and maxTimeUUID()?

        Agreed, that's probably less confusing. That's trivial to change though, I'll change before committing if it comes to that.

        are we carefully tracking changes to CQL by version and documenting them somewhere?

        Not yet, but I do intend to do it now that CQL3 is final. It'll probably be listed in http://cassandra.apache.org/doc/cql3/CQL.html. Though if that patch is committed, I'll also mention it clearly in the NEWS file.

        Show
        Sylvain Lebresne added a comment - Perhaps minTimeUUID() and maxTimeUUID()? Agreed, that's probably less confusing. That's trivial to change though, I'll change before committing if it comes to that. are we carefully tracking changes to CQL by version and documenting them somewhere? Not yet, but I do intend to do it now that CQL3 is final. It'll probably be listed in http://cassandra.apache.org/doc/cql3/CQL.html . Though if that patch is committed, I'll also mention it clearly in the NEWS file.
        Hide
        Jonathan Ellis added a comment -

        Tyler, can you review the patch?

        Show
        Jonathan Ellis added a comment - Tyler, can you review the patch?
        Hide
        Tyler Hobbs added a comment -

        Yes, I can review it.

        Show
        Tyler Hobbs added a comment - Yes, I can review it.
        Tyler Hobbs made changes -
        Reviewer thobbs
        Hide
        Tyler Hobbs added a comment -

        +1 on the patch

        I'd like to include the dateOf() and unixTimestampOf() functions; do you want update cqlsh here to not automatically convert timeUUIDs to a date format?

        Show
        Tyler Hobbs added a comment - +1 on the patch I'd like to include the dateOf() and unixTimestampOf() functions; do you want update cqlsh here to not automatically convert timeUUIDs to a date format?
        Hide
        Sylvain Lebresne added a comment -

        Alright, committed, thanks

        I'd like to include the dateOf() and unixTimestampOf() functions

        I've left them in the commit. I've also renamed startOf/endOf to minTimeUUID/maxTimeUUID as suggested above.

        do you want update cqlsh here to not automatically convert timeUUIDs to a date format?

        As that was simple, I did that too with the commit. Though if someone more knowledgeable of cqlsh wants to have a look at the commit and make sure I didn't forget something, that'd be appreciated.

        Show
        Sylvain Lebresne added a comment - Alright, committed, thanks I'd like to include the dateOf() and unixTimestampOf() functions I've left them in the commit. I've also renamed startOf/endOf to minTimeUUID/maxTimeUUID as suggested above. do you want update cqlsh here to not automatically convert timeUUIDs to a date format? As that was simple, I did that too with the commit. Though if someone more knowledgeable of cqlsh wants to have a look at the commit and make sure I didn't forget something, that'd be appreciated.
        Sylvain Lebresne made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12733324 ] patch-available, re-open possible [ 12753817 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12753817 ] reopen-resolved, no closed status, patch-avail, testing [ 12759000 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        63d 12h 17m 1 Sylvain Lebresne 11/Jan/13 13:31
        Patch Available Patch Available Resolved Resolved
        6d 23h 31m 1 Sylvain Lebresne 18/Jan/13 13:02

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Cesar Lopez-Nataren
            Reviewer:
            Tyler Hobbs
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development