Cassandra
  1. Cassandra
  2. CASSANDRA-139

thrift API should use lists instead of colon-delimited strings to specify column path

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 0.4
    • Component/s: Core
    • Labels:
      None
    1. 0001-thrift-changes.patch
      7 kB
      Jonathan Ellis
    2. 0001-thrift-changes.patch
      5 kB
      Jonathan Ellis
    3. 0003-update-system-tests.patch
      23 kB
      Jonathan Ellis
    4. 0004-missed-one-rename-column.column_name-column.name.patch
      12 kB
      Jonathan Ellis
    5. 0005-fix-get_column_count.patch
      3 kB
      Jonathan Ellis
    6. ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-139-thrift-changes.txt
      405 kB
      Jonathan Ellis
    7. ASF.LICENSE.NOT.GRANTED--0002-use-ColumnPath-ColumnParent-etc.-structs-to-encapsul.txt
      169 kB
      Jonathan Ellis

      Issue Links

        Activity

        Hide
        Hudson added a comment -

        Integrated in Cassandra #139 (See http://hudson.zones.apache.org/hudson/job/Cassandra/139/)
        missed one: rename column.column_name -> column.name
        patch by jbellis; reviewed by Evan Weaver and Sandeep Tata for
        update system tests; fix get_column_count
        patch by jbellis; reviewed by Evan Weaver and Sandeep Tata for
        thrift changes to remove : api.
        use ColumnPath, ColumnParent, etc. structs to encapsulate optional SuperColumns (and occasionally, optional Columns).
        patch by jbellis; reviewed by Evan Weaver and Sandeep Tata for

        Show
        Hudson added a comment - Integrated in Cassandra #139 (See http://hudson.zones.apache.org/hudson/job/Cassandra/139/ ) missed one: rename column.column_name -> column.name patch by jbellis; reviewed by Evan Weaver and Sandeep Tata for update system tests; fix get_column_count patch by jbellis; reviewed by Evan Weaver and Sandeep Tata for thrift changes to remove : api. use ColumnPath, ColumnParent, etc. structs to encapsulate optional SuperColumns (and occasionally, optional Columns). patch by jbellis; reviewed by Evan Weaver and Sandeep Tata for
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Evan Weaver added a comment -

        Don't see any obvious issues in the patchset.

        Ship it!

        Show
        Evan Weaver added a comment - Don't see any obvious issues in the patchset. Ship it!
        Hide
        Sandeep Tata added a comment -

        Unit tests and system tests pass.
        I verified the paths that deal with regular column reads/writes/deletes/get_slice*/batch_insert – they work fine for me. +1 on those.
        Haven't looked at the supercolumn APIs.

        Show
        Sandeep Tata added a comment - Unit tests and system tests pass. I verified the paths that deal with regular column reads/writes/deletes/get_slice*/batch_insert – they work fine for me. +1 on those. Haven't looked at the supercolumn APIs.
        Hide
        Jonathan Ellis added a comment -

        Turns out that's an old bug. I'd like to fix it sooner than later but it's related to CASSANDRA-286 (which I already have a patch for, that requires the -139 patchset). So let's address it there.

        Show
        Jonathan Ellis added a comment - Turns out that's an old bug. I'd like to fix it sooner than later but it's related to CASSANDRA-286 (which I already have a patch for, that requires the -139 patchset). So let's address it there.
        Hide
        Evan Weaver added a comment -

        Another issue, possibly not related to this patch:

        Keys don't appear to be really deleted by remove(key, CF). The values are gone, but get_key_range continues to return the key name even after compaction.

        Show
        Evan Weaver added a comment - Another issue, possibly not related to this patch: Keys don't appear to be really deleted by remove(key, CF). The values are gone, but get_key_range continues to return the key name even after compaction.
        Hide
        Jonathan Ellis added a comment -

        05 fixes get_column_count

        Show
        Jonathan Ellis added a comment - 05 fixes get_column_count
        Hide
        Evan Weaver added a comment -

        i assumed they would be all reified but strings are fine when there is only one field

        Show
        Evan Weaver added a comment - i assumed they would be all reified but strings are fine when there is only one field
        Hide
        Jonathan Ellis added a comment -

        signatures: no, it's correct the way it is, since it's not valid to pass a supercolumn name – only the CF is legal. we could alias string to superColumnParent but that's more obfuscatory than otherwise.

        will look at get_column_count.

        Show
        Jonathan Ellis added a comment - signatures: no, it's correct the way it is, since it's not valid to pass a supercolumn name – only the CF is legal. we could alias string to superColumnParent but that's more obfuscatory than otherwise. will look at get_column_count.
        Hide
        Evan Weaver added a comment - - edited

        Problems:

        • signatures for get_key_range, get_slice_super and get_slice_super_by_names need to accept a ColumnParent...currently take a CF string.
        • get_column_count no longer lets you count super columns (InvalidRequestException) or columns of supercolumns (java.lang.UnsupportedOperationException)

        Those are the regressions I hit in my test suite. Will review the actual code next.

        This is a big step forward. I am excited for the day when overlapping structs start to get unified.

        Show
        Evan Weaver added a comment - - edited Problems: signatures for get_key_range, get_slice_super and get_slice_super_by_names need to accept a ColumnParent...currently take a CF string. get_column_count no longer lets you count super columns (InvalidRequestException) or columns of supercolumns (java.lang.UnsupportedOperationException) Those are the regressions I hit in my test suite. Will review the actual code next. This is a big step forward. I am excited for the day when overlapping structs start to get unified.
        Hide
        Evan Weaver added a comment -

        Yep today...was traveling.

        Show
        Evan Weaver added a comment - Yep today...was traveling.
        Hide
        Jonathan Ellis added a comment -

        Evan, can you review?

        Show
        Jonathan Ellis added a comment - Evan, can you review?
        Hide
        Jonathan Ellis added a comment -

        system tests

        Show
        Jonathan Ellis added a comment - system tests
        Hide
        Jonathan Ellis added a comment -

        haven't finished updating the system tests to use the new api yet, but unit tests pass. if someone wants to get a head start on reviewing, great, because it's a lot of code.

        02
        the meat of the change

        01
        generated thrift changes. won't compile alone; isn't meant to.

        Show
        Jonathan Ellis added a comment - haven't finished updating the system tests to use the new api yet, but unit tests pass. if someone wants to get a head start on reviewing, great, because it's a lot of code. 02 the meat of the change 01 generated thrift changes. won't compile alone; isn't meant to.
        Hide
        Evan Weaver added a comment -

        I agree with that. My vote is for the key in the struct. But if we ship 0.4 with the key outside the struct, I won't complain unless I have an actual failing use case.

        Show
        Evan Weaver added a comment - I agree with that. My vote is for the key in the struct. But if we ship 0.4 with the key outside the struct, I won't complain unless I have an actual failing use case.
        Hide
        Jonathan Ellis added a comment -

        Gah, you edited your comment twice more.

        Basically we need to make a call for 04 and this is it. We can try to make it reasonably future-proof but past a certain point it's just bikeshedding.

        Show
        Jonathan Ellis added a comment - Gah, you edited your comment twice more. Basically we need to make a call for 04 and this is it. We can try to make it reasonably future-proof but past a certain point it's just bikeshedding.
        Hide
        Evan Weaver added a comment -

        Agree that this is the common case.

        I am vaguely worried about bigtable-like multigets though where you would select multiple CFs for multiple keys--in that case it makes a lot more sense to specify each path separately, so that the order of the returned values is predictable.

        Show
        Evan Weaver added a comment - Agree that this is the common case. I am vaguely worried about bigtable-like multigets though where you would select multiple CFs for multiple keys--in that case it makes a lot more sense to specify each path separately, so that the order of the returned values is predictable.
        Hide
        Jonathan Ellis added a comment -

        Yeah, I don't think it's worth complicating the common case to support the uncommon one. (And neither did Avinash, from the code fragments we have, even though they're the only ones with a schema that would use something like that.)

        Show
        Jonathan Ellis added a comment - Yeah, I don't think it's worth complicating the common case to support the uncommon one. (And neither did Avinash, from the code fragments we have, even though they're the only ones with a schema that would use something like that.)
        Hide
        Evan Weaver added a comment - - edited

        No multiget from multiple column families for a single key?

        For what it's worth I thought the previous way was more flexible, but I suppose your point is that that may be bad at this point. I liked having the full path excluding table as its own self-contained object though; I can see storing that full predicate as the id in a hydrated row, rather than storing the predicate and key separately.

        Show
        Evan Weaver added a comment - - edited No multiget from multiple column families for a single key? For what it's worth I thought the previous way was more flexible, but I suppose your point is that that may be bad at this point. I liked having the full path excluding table as its own self-contained object though; I can see storing that full predicate as the id in a hydrated row, rather than storing the predicate and key separately.
        Hide
        Jonathan Ellis added a comment -

        new version of proposed thrift file, with consistent naming convention (CamelCase structs; under_scored everything else)

        Show
        Jonathan Ellis added a comment - new version of proposed thrift file, with consistent naming convention (CamelCase structs; under_scored everything else)
        Hide
        Jonathan Ellis added a comment - - edited

        Michael Greene on IRC pointed out that you can make a case for keeping all arguments the same to a slice call, with a list of keys. This appears to be the direction Avinash was going over in CASSANDRA-70.

        This makes sense if most of your ops are "I want to apply the same predicate multiple times to different rows." Like a relational select. This would be sufficiently general for the app I ported to Cassandra (which did after all start on a rdbms).

        The alternative is to be like the batch_insert api is and for each key allow a different predicate. I can appreciate that argument from the standpoint of symmetry, but it seems that in the common case it would make life a pain, writing

        predicate = ...
        get_slice(table, [(key, predicate) for key in list_of_keys])

        instead of just

        get_slice(table, list_of_keys, predicate)

        I think that Lesher's 10th law applies: "Premature optimization is the root of all evil; premature generalization is premature optimization of the feature list." Let's plan on following Avinash's lead on the multiget API, which means that I'll pull both table and key out of the path structs shown in that patch.

        Show
        Jonathan Ellis added a comment - - edited Michael Greene on IRC pointed out that you can make a case for keeping all arguments the same to a slice call, with a list of keys. This appears to be the direction Avinash was going over in CASSANDRA-70 . This makes sense if most of your ops are "I want to apply the same predicate multiple times to different rows." Like a relational select. This would be sufficiently general for the app I ported to Cassandra (which did after all start on a rdbms). The alternative is to be like the batch_insert api is and for each key allow a different predicate. I can appreciate that argument from the standpoint of symmetry, but it seems that in the common case it would make life a pain, writing predicate = ... get_slice(table, [(key, predicate) for key in list_of_keys] ) instead of just get_slice(table, list_of_keys, predicate) I think that Lesher's 10th law applies: "Premature optimization is the root of all evil; premature generalization is premature optimization of the feature list." Let's plan on following Avinash's lead on the multiget API, which means that I'll pull both table and key out of the path structs shown in that patch.
        Hide
        Jonathan Ellis added a comment -

        good idea, Sandeep. I will do that.

        Show
        Jonathan Ellis added a comment - good idea, Sandeep. I will do that.
        Hide
        Evan Weaver added a comment -

        fine with me

        Show
        Evan Weaver added a comment - fine with me
        Hide
        Sandeep Tata added a comment -

        Could we pull Table out of the struct – list<column_path_t> should work fine for multi-get and multi-put.

        Show
        Sandeep Tata added a comment - Could we pull Table out of the struct – list<column_path_t> should work fine for multi-get and multi-put.
        Hide
        Evan Weaver added a comment -

        Optional params is great because we can unify the supercolumn and standardcolumn APIs (no reason for the client to have to introspect on that all the time).

        For my use cases I need multiget from one CF with multiple keys more than vis versa...I suppose we would want to take optional lists for every slot except keyspace, returning a list in some deterministic order.

        Show
        Evan Weaver added a comment - Optional params is great because we can unify the supercolumn and standardcolumn APIs (no reason for the client to have to introspect on that all the time). For my use cases I need multiget from one CF with multiple keys more than vis versa...I suppose we would want to take optional lists for every slot except keyspace, returning a list in some deterministic order.
        Hide
        Jonathan Ellis added a comment -

        you can't have optional parameters in thrift except in a struct. so we're forced to use struct for some; for the rest it seems best to be consistent in struct usage even though it's not necessary.

        for multiget I would take the approach of the batch_ methods and have table + key + list<cf> or list<cf, column>, etc.

        (as I may have noted on the multiget issue, multiget is a superset of ordinary get so it probably doesn't make sense to keep both around ultimately.)

        Show
        Jonathan Ellis added a comment - you can't have optional parameters in thrift except in a struct. so we're forced to use struct for some; for the rest it seems best to be consistent in struct usage even though it's not necessary. for multiget I would take the approach of the batch_ methods and have table + key + list<cf> or list<cf, column>, etc. (as I may have noted on the multiget issue, multiget is a superset of ordinary get so it probably doesn't make sense to keep both around ultimately.)
        Hide
        Evan Weaver added a comment - - edited

        Looks reasonable!

        Once we get multiget we will have the same situation for keys, so i like the struct solution. Internal layer is agnostic but consistent, and clients can go nuts. A standard may eventually emerge.

        Is a struct the same as named parameters in Thrift or are there other benefits to reifying the param list?

        For multi-get/multi-insert, would you pass multiple structs?

        Show
        Evan Weaver added a comment - - edited Looks reasonable! Once we get multiget we will have the same situation for keys, so i like the struct solution. Internal layer is agnostic but consistent, and clients can go nuts. A standard may eventually emerge. Is a struct the same as named parameters in Thrift or are there other benefits to reifying the param list? For multi-get/multi-insert, would you pass multiple structs?
        Hide
        Jonathan Ellis added a comment -

        proposed new thrift api.

        using structs instead of list<string> makes it (a) much clearer for users, (b) much easier to sanity-check server-side, (c) much less important whether "string key" or "string columnFamily" comes first.

        look reasonable?

        Show
        Jonathan Ellis added a comment - proposed new thrift api. using structs instead of list<string> makes it (a) much clearer for users, (b) much easier to sanity-check server-side, (c) much less important whether "string key" or "string columnFamily" comes first. look reasonable?
        Hide
        Jonathan Ellis added a comment -

        If you watch the fb cassandra video (not the nosql one) about 20% of the way in they explain their data model and they have like 4 CFs with the same keys.

        Show
        Jonathan Ellis added a comment - If you watch the fb cassandra video (not the nosql one) about 20% of the way in they explain their data model and they have like 4 CFs with the same keys.
        Hide
        Evan Weaver added a comment -

        Takes an array of CFs right? I see your point.

        I can't imagine that that's the common use case though.

        Show
        Evan Weaver added a comment - Takes an array of CFs right? I see your point. I can't imagine that that's the common use case though.
        Hide
        Jonathan Ellis added a comment -

        One reason to care about routing order:

        The batch_ methods allow inserting to multiple CFs at the same time, under the same key.

        Show
        Jonathan Ellis added a comment - One reason to care about routing order: The batch_ methods allow inserting to multiple CFs at the same time, under the same key.
        Hide
        Chris Goffinet added a comment -

        +1. After speaking to Evan more on this, I tend to agree:

        keyspace.get(column_family, key, [supercolumn], column)

        [12:26:17] <evn> its so you can pretend that all the CFs are your fields for a specific use case where you have to do range scans over fields
        [12:26:28] <evn> rather than a bag of columns you have a bag of column families

        Show
        Chris Goffinet added a comment - +1. After speaking to Evan more on this, I tend to agree: keyspace.get(column_family, key, [supercolumn] , column) [12:26:17] <evn> its so you can pretend that all the CFs are your fields for a specific use case where you have to do range scans over fields [12:26:28] <evn> rather than a bag of columns you have a bag of column families
        Show
        Evan Weaver added a comment - Example of switched order is in these tests: http://github.com/fauna/cassandra_client/blob/4e11d79ebebe4fbafd6ea54d20c435728a12555d/test/cassandra_client_test.rb
        Hide
        Jonathan Ellis added a comment -

        Nobody is arguing in favor of string concatenation. Let's avoid straw men here please.

        "the CF is never optional" is a good point.

        Show
        Jonathan Ellis added a comment - Nobody is arguing in favor of string concatenation. Let's avoid straw men here please. "the CF is never optional" is a good point.
        Hide
        Toby DiPasquale added a comment -

        +1. I would agree with Evan on this one. String concatenation is not awesome.

        Show
        Toby DiPasquale added a comment - +1. I would agree with Evan on this one. String concatenation is not awesome.
        Hide
        Evan Weaver added a comment - - edited

        The switch won't be silent, because we will drop the magic colons at the same time.

        Re. routing order:

        What we care about here is what knowledge the client can act on.

        • It's important to know that a CF is a single a file with keys inside it, because that's what lets you control column-major vs. row-major storage order.
        • It is not important to know that a single key has all its column families on the same host.

        Hbase and Hypertable use routing order, but they also use the colons, so I don't they are trustworthy API design models. In the interests of not needlessly confusing the majority of developers who are familiar with SQL but not BigTable, I think we should switch it.

        This normalizes some signatures because range queries let you look up based on CF and (optional) key:

        get_key_range(@keyspace, column_families, key_range.begin, key_range.end, limit)

        So we already use storage order! It makes sense to trail with the optional parameters, and the CF is never optional.

        Show
        Evan Weaver added a comment - - edited The switch won't be silent, because we will drop the magic colons at the same time. Re. routing order: What we care about here is what knowledge the client can act on. It's important to know that a CF is a single a file with keys inside it, because that's what lets you control column-major vs. row-major storage order. It is not important to know that a single key has all its column families on the same host. Hbase and Hypertable use routing order, but they also use the colons, so I don't they are trustworthy API design models. In the interests of not needlessly confusing the majority of developers who are familiar with SQL but not BigTable, I think we should switch it. This normalizes some signatures because range queries let you look up based on CF and (optional) key: get_key_range(@keyspace, column_families, key_range.begin, key_range.end, limit) So we already use storage order! It makes sense to trail with the optional parameters, and the CF is never optional.
        Hide
        Jonathan Ellis added a comment -

        key-first is also consistent in that (key, cf, [sc], column) is the order it needs things for routing. So while I am +1 on getting rid of string concatenation I am -0 on switching the argument order.

        (I also note that switching the order of identically typed parameters will silently screw over people following trunk – the new client will "work" against your old code but everything will fail with no obvious reason why.)

        Show
        Jonathan Ellis added a comment - key-first is also consistent in that (key, cf, [sc] , column) is the order it needs things for routing. So while I am +1 on getting rid of string concatenation I am -0 on switching the argument order. (I also note that switching the order of identically typed parameters will silently screw over people following trunk – the new client will "work" against your old code but everything will fail with no obvious reason why.)
        Hide
        Evan Weaver added a comment - - edited

        OK, the old style API came from the BigTable paper. HBase uses it too:

        Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1");

        This is bad for three reasons:

        1. Concatenating strings; are you serious?
        2. It is targeted for a specific use case where your table contains just one kind of business object.
        3. It doesn't represent actual storage order, so it's super confusing.

        I think all APIs should take parameters in storage order in all cases. That is:

        keyspace.get(column_family, key, [supercolumn], column)

        Show
        Evan Weaver added a comment - - edited OK, the old style API came from the BigTable paper. HBase uses it too: Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1"); This is bad for three reasons: 1. Concatenating strings; are you serious? 2. It is targeted for a specific use case where your table contains just one kind of business object. 3. It doesn't represent actual storage order, so it's super confusing. I think all APIs should take parameters in storage order in all cases. That is: keyspace.get(column_family, key, [supercolumn] , column)
        Hide
        Jonathan Ellis added a comment -

        I see. No, you're not missing anything. (Although it's just column names that can't have the :, keys are fine.)

        Show
        Jonathan Ellis added a comment - I see. No, you're not missing anything. (Although it's just column names that can't have the :, keys are fine.)
        Hide
        Evan Weaver added a comment - - edited

        If your serialization format produces a dump for a particular object that happens to include a ':' character, you can't use that object as a key or column name in any of the methods that expect a concatenated string where ':' has special meaning...

        All the user-provided values need to support the full range of possible "just byte arrays".

        Maybe I am missing something.

        Show
        Evan Weaver added a comment - - edited If your serialization format produces a dump for a particular object that happens to include a ':' character, you can't use that object as a key or column name in any of the methods that expect a concatenated string where ':' has special meaning... All the user-provided values need to support the full range of possible "just byte arrays". Maybe I am missing something.
        Hide
        Jonathan Ellis added a comment -

        How so?

        (Not defending the status quo by any means, just not sure what you mean.)

        Show
        Jonathan Ellis added a comment - How so? (Not defending the status quo by any means, just not sure what you mean.)
        Hide
        Evan Weaver added a comment -

        This currently makes it impossible to serialize column names.

        Show
        Evan Weaver added a comment - This currently makes it impossible to serialize column names.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Jonathan Ellis
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development