CouchDB
  1. CouchDB
  2. COUCHDB-1373

Time-order​ed document ids including the database identity

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: Database Core
    • Labels:
    • Skill Level:
      Regular Contributors Level (Easy to Medium)

      Description

      This suggestion is for an enhancement to the document id generation algorithms in CouchDb. I am new to CouchDb, and this question addresses an old issue (https://issues.apache.org/jira/browse/COUCHDB-465) so please forgive me if I am retreading old ground.

      My application has a number of mutually replicating CouchDb instances and I would like document ids to be monotonically-increasing per-instance, and globally unique, and for the instance where the document was created to be determinable from the id. (To be more accurate - I don't need to know anything about the instance itself; just whether any two documents originated from the same instance.) The utc_random algorithm is not far from meeting these requirements, as ids are monotonic and almost certainly globally unique. However, the instance cannot be determined from the id, and there is a tiny chance of an id clash between two instances. Both of these issues could be solved if the random part of the id could be replaced with a suffix that is fixed in the ini file for each instance.

      To address this I have a modified version of couch_uuids.erl introducing a new utc_machine_id algorithm which reads a machine_id string from the ini file and then generates ids using an internal utc_suffix method that just appends the string to the usual utc 14-byte string. utc_random then also uses the utc_suffix method, but its suffix is the usual random byte string.

      However, it is obviously a nuisance to have to maintain a non-standard distribution, so I wondered if there is enough call for this sort of thing to make it a part of the standard distribution? If there is, I'd be very happy to make my code available for discussion/modification/inclusion. If there are good reasons why this is a bad idea, then I'd also be very interested to hear them so that I can rethink my ideas. (It happens that the privacy and guessability concerns raised in the original discussion do not apply in my case.) If this question has been beaten to death, then I'm sorry for bothering the group, and would be grateful if someone could point me to the discussions so that I can understand the issues.

      1. 0003-utc_id_suffix.patch
        6 kB
        Nick North
      2. 0002-utc_id_suffix.patch
        5 kB
        Nick North
      3. 0001-Add-etap-for-jira-1373.patch
        3 kB
        Bob Dionne
      4. utc_machine_id.patch
        3 kB
        Nick North
      5. couch_uuids.patch
        2 kB
        Nick North

        Activity

        Hide
        Bob Dionne added a comment -

        Nick,

        This seems reasonable, I'm sure there might be some issues to consider but if it's done the way the others were I don't see why not. I've run into this sort of thing before in systems where you want unique ids with a certain provenance so you assign each player an additional unique id that becomes part of the generated keys.
        couch_uuids is a gen_server whose state is determined by an algorithm defined in the config file. I see no problem with adding another. If you could submit a patch it would be easier to assess or better yet a link to a source repo.

        Best,

        Bob

        Show
        Bob Dionne added a comment - Nick, This seems reasonable, I'm sure there might be some issues to consider but if it's done the way the others were I don't see why not. I've run into this sort of thing before in systems where you want unique ids with a certain provenance so you assign each player an additional unique id that becomes part of the generated keys. couch_uuids is a gen_server whose state is determined by an algorithm defined in the config file. I see no problem with adding another. If you could submit a patch it would be easier to assess or better yet a link to a source repo. Best, Bob
        Hide
        Nick North added a comment -

        Thanks for the reply. I'll submit a patch as soon as I've worked out how to use git.

        Show
        Nick North added a comment - Thanks for the reply. I'll submit a patch as soon as I've worked out how to use git.
        Hide
        Nick North added a comment - - edited

        I've now attached a patch to implement the utc_machine_id uuid algorithm. Quick walkthrough:

        1. state reads in a new machine_id string from the couchdb ini file section, defaulting to "". I put it in the couchdb section as it might be useful elsewhere, but it could live in the uuids section if it were not to be used anywhere else. state also checks to see if the new utc_machine_id algorithm is specified.
        2. handle_call has an extra clause for utc_machine_id algorithm, which just calls utc_suffix with the machine_id string as a suffix to be appended to the UTC timestamp. utc_suffix is also used for the utc_random algorithm, but with the usual random string as the UTC timestamp suffix.

        Show
        Nick North added a comment - - edited I've now attached a patch to implement the utc_machine_id uuid algorithm. Quick walkthrough: 1. state reads in a new machine_id string from the couchdb ini file section, defaulting to "". I put it in the couchdb section as it might be useful elsewhere, but it could live in the uuids section if it were not to be used anywhere else. state also checks to see if the new utc_machine_id algorithm is specified. 2. handle_call has an extra clause for utc_machine_id algorithm, which just calls utc_suffix with the machine_id string as a suffix to be appended to the UTC timestamp. utc_suffix is also used for the utc_random algorithm, but with the usual random string as the UTC timestamp suffix.
        Hide
        Paul Joseph Davis added a comment -

        Your whitespace is wonky. Indentation in Erlang code should be four spaces and not tabs. There's also some pure white-space changes.

        Also, you should add a note in the default.ini.tpl.in file to indicate that the new algorithm exists as well as the fact that it requires the couchdb/machine_id setting to do what it's supposed to. I'm also hesitating if machine_id shouldn't be in the uuid section or not. Not super important but I find it a bit odd to have it there.

        Also, I wonder if it'd be possible to generate some sort of default machine id if one isn't specified. Something like, a sha of all mac addresses in sorted order? Then again I'm not sure how hard it'd be to get that list from Erlang in a portable fashion.

        Show
        Paul Joseph Davis added a comment - Your whitespace is wonky. Indentation in Erlang code should be four spaces and not tabs. There's also some pure white-space changes. Also, you should add a note in the default.ini.tpl.in file to indicate that the new algorithm exists as well as the fact that it requires the couchdb/machine_id setting to do what it's supposed to. I'm also hesitating if machine_id shouldn't be in the uuid section or not. Not super important but I find it a bit odd to have it there. Also, I wonder if it'd be possible to generate some sort of default machine id if one isn't specified. Something like, a sha of all mac addresses in sorted order? Then again I'm not sure how hard it'd be to get that list from Erlang in a portable fashion.
        Hide
        Nick North added a comment -

        Thanks for the comments. Whitespace is now fixed, I've commented default.ini.tpl.in, and moved machine_id to the uuids section of the file as that's less intrusive. Revised patch attached as utc_machine_id.patch.

        Show
        Nick North added a comment - Thanks for the comments. Whitespace is now fixed, I've commented default.ini.tpl.in, and moved machine_id to the uuids section of the file as that's less intrusive. Revised patch attached as utc_machine_id.patch.
        Hide
        Bob Dionne added a comment -

        Nick,

        looks pretty good. There's still a trailing whitespace error in default.ini.tpl.in There are git commands for catching those and/or you can set emacs to correct those on save or whatever. I'm not sure also about putting machine_id in the uuids section, in fact it we're going to have it in the ini then it doesn't even need to be "machine_id" it could be "id_suffix". Just a thought.

        We use etap [1] for testing, among other things. In test/etap you can see the tests. Attached is a patch that extends 041-uuid-gen.t for your new case.

        Cheers,

        Bob

        [1] https://github.com/ngerakines/etap

        Show
        Bob Dionne added a comment - Nick, looks pretty good. There's still a trailing whitespace error in default.ini.tpl.in There are git commands for catching those and/or you can set emacs to correct those on save or whatever. I'm not sure also about putting machine_id in the uuids section, in fact it we're going to have it in the ini then it doesn't even need to be "machine_id" it could be "id_suffix". Just a thought. We use etap [1] for testing, among other things. In test/etap you can see the tests. Attached is a patch that extends 041-uuid-gen.t for your new case. Cheers, Bob [1] https://github.com/ngerakines/etap
        Hide
        Bob Dionne added a comment -

        test for couchdb-1373

        Show
        Bob Dionne added a comment - test for couchdb-1373
        Hide
        Nick North added a comment -

        Thanks for all your help. id_suffix does seem a better name if it is only to be used in uuid generation, so the attached patch 0002-utc_id_suffix.patch changes the algorithm name to utc_id_suffix and uses an id_suffix ini setting. It incorporates your etap test, but with renames of machine_id to id_suffix throughout.

        I'm not sure of the usual convention for patches here. The idea of 0002-utc_id_suffix.patch is that it supersedes all previous attached files and should be applied instead of them.

        Show
        Nick North added a comment - Thanks for all your help. id_suffix does seem a better name if it is only to be used in uuid generation, so the attached patch 0002-utc_id_suffix.patch changes the algorithm name to utc_id_suffix and uses an id_suffix ini setting. It incorporates your etap test, but with renames of machine_id to id_suffix throughout. I'm not sure of the usual convention for patches here. The idea of 0002-utc_id_suffix.patch is that it supersedes all previous attached files and should be applied instead of them.
        Hide
        Bob Dionne added a comment -

        Nick,

        It looks good at a glance. I'll try to test it and review it some more later. We don't really have a hard convention for patches, the 0002-... approach seems fine. Often folks just post drafts[1] and proposed patches as github links while they are under discussion. Sometimes I also delete older patches if it gets confusing. This project is pioneering the use of Git at ASF so there's been some discussion about workflows and so forth.
        Anyway I'll be happy to commit this if no one finds any issue within the next day or so.

        Cheers,

        Bob

        [1] https://github.com/bdionne/couchdb/compare/master...jira-1373

        Show
        Bob Dionne added a comment - Nick, It looks good at a glance. I'll try to test it and review it some more later. We don't really have a hard convention for patches, the 0002-... approach seems fine. Often folks just post drafts [1] and proposed patches as github links while they are under discussion. Sometimes I also delete older patches if it gets confusing. This project is pioneering the use of Git at ASF so there's been some discussion about workflows and so forth. Anyway I'll be happy to commit this if no one finds any issue within the next day or so. Cheers, Bob [1] https://github.com/bdionne/couchdb/compare/master...jira-1373
        Hide
        Nick North added a comment -

        Many thanks. I can see that it would also be nice to have a TestSuffix etap test that checks that the suffix remains unchanged for multiple utc_id_suffix UUIDs. I've knocked together some speculative code for that but won't propose it unless/until I can get enough bits running on my system to try it out.

        Show
        Nick North added a comment - Many thanks. I can see that it would also be nice to have a TestSuffix etap test that checks that the suffix remains unchanged for multiple utc_id_suffix UUIDs. I've knocked together some speculative code for that but won't propose it unless/until I can get enough bits running on my system to try it out.
        Hide
        Nick North added a comment -

        One more tweak: added tests of utc_id_suffix to share/www/script/test/uuids.js. Patch 0003-utc_id_suffix.patch includes this, and all previous changes.

        Show
        Nick North added a comment - One more tweak: added tests of utc_id_suffix to share/www/script/test/uuids.js. Patch 0003-utc_id_suffix.patch includes this, and all previous changes.
        Show
        Robert Newson added a comment - Some food for thought: http://blog.boundary.com/2012/01/12/Flake-A-decentralized-k-ordered-id-generation-service-in-Erlang.html
        Hide
        Nick North added a comment -

        The Flake algorithm is interesting. I also started with Twitter's algorithm and, at the time, shared Twitter's desire to fit ids into a 64-bit long - that affected my choice of suffix. I'm also not an experienced Erlang programmer, so was heavily influenced by the existing CouchDb utc_random algorithm as it could be adapted with minimal changes.

        I think Flake is overly complex, because they use a millisecond clock plus a separate sequence id. This is perhaps because they want an algorithm that can be implemented in any language. In Erlang we can use the full microsecond precision of the clock and rely on the fact that repeated calls to Now are guaranteed to be monotonic increasing. However, Flake's use of the MAC address as a machine identifier saves on config file entries, and might be adopted.

        Left to my own devices I would stick with the current proposal, but a more Flake-like one would suit me as well. But if I had to implement it, I'm not sure how long it would take with my limited Erlang knowledge and dev environment.

        Show
        Nick North added a comment - The Flake algorithm is interesting. I also started with Twitter's algorithm and, at the time, shared Twitter's desire to fit ids into a 64-bit long - that affected my choice of suffix. I'm also not an experienced Erlang programmer, so was heavily influenced by the existing CouchDb utc_random algorithm as it could be adapted with minimal changes. I think Flake is overly complex, because they use a millisecond clock plus a separate sequence id. This is perhaps because they want an algorithm that can be implemented in any language. In Erlang we can use the full microsecond precision of the clock and rely on the fact that repeated calls to Now are guaranteed to be monotonic increasing. However, Flake's use of the MAC address as a machine identifier saves on config file entries, and might be adopted. Left to my own devices I would stick with the current proposal, but a more Flake-like one would suit me as well. But if I had to implement it, I'm not sure how long it would take with my limited Erlang knowledge and dev environment.
        Hide
        Robert Newson added a comment -

        grabbing the MAC by name is as simple as;

        fun(Name) ->

        {ok, List}

        = inet:getifaddrs(), If = proplists:get_value(Name, List), proplists:get_value(hwaddr, If) end.

        So I suggest the value in the .ini file is the name of the interface ("eth0", for example) and not a uuid, this will ensure that any live node is using a different value.

        Show
        Robert Newson added a comment - grabbing the MAC by name is as simple as; fun(Name) -> {ok, List} = inet:getifaddrs(), If = proplists:get_value(Name, List), proplists:get_value(hwaddr, If) end. So I suggest the value in the .ini file is the name of the interface ("eth0", for example) and not a uuid, this will ensure that any live node is using a different value.
        Hide
        Nick North added a comment -

        Thanks - that looks pretty straightforward, except that the Erlang documentation at http://www.erlang.org/doc/man/inet.html#getifaddrs-0 says that Solaris does not return a

        {hwaddr, _}

        tuple. But we can sort out some workaround for that.

        I am about to go on holiday for three weeks, but can take a look at incorporating the MAC address when I return, unless someone else wants to have a go in the interim. It might also be nice to give the user the option of specifying a specific suffix if they want, or else give an interface name (and then also have a sensible default if this algorithm is specified with neither a suffix nor an interface).

        Show
        Nick North added a comment - Thanks - that looks pretty straightforward, except that the Erlang documentation at http://www.erlang.org/doc/man/inet.html#getifaddrs-0 says that Solaris does not return a {hwaddr, _} tuple. But we can sort out some workaround for that. I am about to go on holiday for three weeks, but can take a look at incorporating the MAC address when I return, unless someone else wants to have a go in the interim. It might also be nice to give the user the option of specifying a specific suffix if they want, or else give an interface name (and then also have a sensible default if this algorithm is specified with neither a suffix nor an interface).
        Hide
        Bob Dionne added a comment -

        I think the option of letting the user specify a suffix in the ini file is still good and supports a use case that occurs a lot, for example when some governing body is responsible for the provenance of work that is collaborative and distributed and is responsible for assigning unique suffixes to each instance.

        I'm also not convinced that using the MAC address doesn't have problems, so I'd see adding that as as additional approach. It seems reasonable, but it can be changed, what happens when the database is replicated to another, etc...

        Show
        Bob Dionne added a comment - I think the option of letting the user specify a suffix in the ini file is still good and supports a use case that occurs a lot, for example when some governing body is responsible for the provenance of work that is collaborative and distributed and is responsible for assigning unique suffixes to each instance. I'm also not convinced that using the MAC address doesn't have problems, so I'd see adding that as as additional approach. It seems reasonable, but it can be changed, what happens when the database is replicated to another, etc...
        Hide
        Nick North added a comment -

        Bumping this as there were no comments on my pull request https://github.com/apache/couchdb/pull/28 which was intended to address it. If I'm going about this the wrong way, then please do let me know so that I can improve my approach. Many thanks.

        Show
        Nick North added a comment - Bumping this as there were no comments on my pull request https://github.com/apache/couchdb/pull/28 which was intended to address it. If I'm going about this the wrong way, then please do let me know so that I can improve my approach. Many thanks.

          People

          • Assignee:
            Unassigned
            Reporter:
            Nick North
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development