CouchDB
  1. CouchDB
  2. COUCHDB-1303

Add a _bulk_update handler similar to _update but for bulk document changes

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Skill Level:
      Dont Know

      Description

      _update handlers are great (and getting better!) for building RESTful API's inside CouchDB. One limitation I found tonight is that _update can only do a single document at a time. If the API I'm building needs to update multiple docs (in a similar fashion to _bulk_docs), then an outside "proxy" script is required. It would be ideal to have a _bulk_update handler to allow for the same functionality as _update, but with the ability to insert multiple documents at once.

      Perhaps the current _update handler API could be extended to support multiple IDs/documents, but a separate API endpoint would be seem reasonable if needed.

      Thanks for considering this idea.

        Activity

        Hide
        Benjamin Young added a comment -
        Show
        Benjamin Young added a comment - Others have hit this situation as well: http://stackoverflow.com/questions/4061766/couch-db-bulk-update-using-handlers/7669106
        Hide
        Benoit Chesneau added a comment -

        posted on dev@ , copied here.

        I would really prefer to sit down a little and start to rethink the couchapp engine. Restful means imo that on a particular resource we could react on each actions (here HTTP verbs). Instead we did the same error Rails did by having a resource per actions or so. Bulk updates would be another hack around this so I'm -1 on such features. New ticket about what i would like to design is coming along.

        Show
        Benoit Chesneau added a comment - posted on dev@ , copied here. I would really prefer to sit down a little and start to rethink the couchapp engine. Restful means imo that on a particular resource we could react on each actions (here HTTP verbs). Instead we did the same error Rails did by having a resource per actions or so. Bulk updates would be another hack around this so I'm -1 on such features. New ticket about what i would like to design is coming along.
        Hide
        Benjamin Young added a comment -

        I do agree that the more the more "RESTful" thing to do would actually be allowed to POST to the /db/ URL with a specific mimetype that designated that one was sending a representation containing several documents to be added (which of course could use it's own ticket). However, as we allow end users to extend CouchDB's functionality with "CouchApp" additions (which is fabulous!), we need to offer them as much power as possible so this can move out of being a "toy" (for some) into being a viable "stack."

        The develoepr generated additions (such as _update handlers) can live wherever they need to in the CouchDB URL space (all of them are currently under _design/

        {app}

        ), as long as they are wrap-able by the URL rewriter, the developer can make their own API's and vhost them wherever they'd like.

        That's the goal of this request. I agree that the whole CouchApp approach/concept could use a fresh approach, but I'm not sure it's a good reason to stall what we have moving now.

        I look forward to seeing your CouchApp-related proposals, but I'd prefer this (and any other) idea be measured on its merits relative to the current code--not potential rewrites. There are likely other reasons this might be bad, but "because we should start over" isn't one of them.

        Show
        Benjamin Young added a comment - I do agree that the more the more "RESTful" thing to do would actually be allowed to POST to the /db/ URL with a specific mimetype that designated that one was sending a representation containing several documents to be added (which of course could use it's own ticket). However, as we allow end users to extend CouchDB's functionality with "CouchApp" additions (which is fabulous!), we need to offer them as much power as possible so this can move out of being a "toy" (for some) into being a viable "stack." The develoepr generated additions (such as _update handlers) can live wherever they need to in the CouchDB URL space (all of them are currently under _design/ {app} ), as long as they are wrap-able by the URL rewriter, the developer can make their own API's and vhost them wherever they'd like. That's the goal of this request. I agree that the whole CouchApp approach/concept could use a fresh approach, but I'm not sure it's a good reason to stall what we have moving now. I look forward to seeing your CouchApp-related proposals, but I'd prefer this (and any other) idea be measured on its merits relative to the current code--not potential rewrites. There are likely other reasons this might be bad, but "because we should start over" isn't one of them.
        Hide
        Ari Najarian added a comment -

        So... it's been 2 years since the last comment on this thread, but I thought I'd throw my 2 cents in. I know this is much more complex an undertaking than I can imagine, but I would find this functionality extremely useful were it implemented. As far as the REST API is concerned, this is what I envision: since bulk methods already take an array of documents (that currently have to be specified), why not 'pipe' in the results of a view (with the same standard query parameters to select different portions), and just include an additional parameter to specify the design document / update handler to use on the operation?

        Say you want to 'reassign' a bunch of documents from one parent doc to another (e.g. tasks on a project, new manager for a bunch of staff, etc.). The bulk update API call could look something like this:

        /db/bulk_update?mode=view&source=ddoc/_view/tasks&handler=ddoc/_update/reassign_project&startkey=['task_id']&endkey=['task_id',{}]&new_project_id=1234

        or

        /db/bulk_update?mode=view&source=ddoc/_view/staff&handler=ddoc/_update/reassign_manager&startkey=['old_manager_id']&endkey=['old_manager_id',{}]&new_manager_id=1234

        The mode param could be 'view' or 'docs' - in the case of the former, you provide the view name via source. In the case of the latter, you provide the documents as JSON in the request body. Standard view query parameters (startkey, endkey, keys, etc.) could be used with mode=view, and any additional params in the request would simply get passed into the update handler, where they could be retrieved in req.form or req.query.

        While not the prettiest implementation (and could definitely use some refinement), all the pieces are there to source a document array from an existing view, and a handler from an existing design document.

        The real beauty of bulk update handlers implemented this way can be illustrated in the following scenario: say that you're using email addresses as unique IDs for all staff in your system. Some staff are managers, others are employees. Employees have a 'manager' key whose value is the email address of the manager, so you can group them all by manager by using a complex key on a view index. One manager changes their email address IRL. So we PUT to:

        db/_design/ddoc/_update/change_email/employee_id?new=new@email.com

        The change_email update handler looks like this (!!):

        function(doc,req){
          var old = doc.email;
          doc.email = req.query.new;
          if(doc.manager){
            response = {
              code: 302,
              headers: {
                "Location" : "/" + req.info.db_name + "/bulk_update?mode=view&source=ddoc/_view/staff&handler=ddoc/_update/reassign_manager&startkey=['" + old + "']&endkey=['" + old + "',{}]&new_manager_id=" + doc.email,
                "Content-Type" : "application/json"
              },
              body: ""
            }
            return[doc,response];
          }
          else{
            return[doc,"Email has been updated"];
          }
        }
        

        So, we update one record using a single update handler, and then conditionally redirect to the _bulk_update handler to update all the "related" documents in the database! Any time a manager changes their email address, we no longer have to worry about updating all the other docs using middleware. This would be so awesome.

        I've come to rely very heavily on CouchApps, and the more I can do without leaving the stack, the happier / more productive / more effective I am. Bulk update handlers would fill a huge gap in my projects, by getting the server to do more heavy lifting, instead of relying on the client to do everything.

        I hope the possibilities I outlined above excite you guys as much as they excite me. Perhaps we can revisit bulk update handlers in an upcoming version?

        Show
        Ari Najarian added a comment - So... it's been 2 years since the last comment on this thread, but I thought I'd throw my 2 cents in. I know this is much more complex an undertaking than I can imagine, but I would find this functionality extremely useful were it implemented. As far as the REST API is concerned, this is what I envision: since bulk methods already take an array of documents (that currently have to be specified), why not 'pipe' in the results of a view (with the same standard query parameters to select different portions), and just include an additional parameter to specify the design document / update handler to use on the operation? Say you want to 'reassign' a bunch of documents from one parent doc to another (e.g. tasks on a project, new manager for a bunch of staff, etc.). The bulk update API call could look something like this: /db/bulk_update?mode=view&source=ddoc/_view/tasks&handler=ddoc/_update/reassign_project&startkey=['task_id']&endkey=['task_id',{}]&new_project_id=1234 or /db/bulk_update?mode=view&source=ddoc/_view/staff&handler=ddoc/_update/reassign_manager&startkey=['old_manager_id']&endkey=['old_manager_id',{}]&new_manager_id=1234 The mode param could be 'view' or 'docs' - in the case of the former, you provide the view name via source . In the case of the latter, you provide the documents as JSON in the request body. Standard view query parameters ( startkey, endkey, keys , etc.) could be used with mode=view , and any additional params in the request would simply get passed into the update handler, where they could be retrieved in req.form or req.query . While not the prettiest implementation (and could definitely use some refinement), all the pieces are there to source a document array from an existing view, and a handler from an existing design document. The real beauty of bulk update handlers implemented this way can be illustrated in the following scenario: say that you're using email addresses as unique IDs for all staff in your system. Some staff are managers, others are employees. Employees have a 'manager' key whose value is the email address of the manager, so you can group them all by manager by using a complex key on a view index. One manager changes their email address IRL. So we PUT to: db/_design/ddoc/_update/change_email/employee_id?new=new@email.com The change_email update handler looks like this (!!): function (doc,req){ var old = doc.email; doc.email = req.query. new ; if (doc.manager){ response = { code: 302, headers: { "Location" : "/" + req.info.db_name + "/bulk_update?mode=view&source=ddoc/_view/staff&handler=ddoc/_update/reassign_manager&startkey=['" + old + "']&endkey=['" + old + "',{}]&new_manager_id=" + doc.email, "Content-Type" : "application/json" }, body: "" } return [doc,response]; } else { return [doc,"Email has been updated"]; } } So, we update one record using a single update handler, and then conditionally redirect to the _bulk_update handler to update all the "related" documents in the database! Any time a manager changes their email address, we no longer have to worry about updating all the other docs using middleware. This would be so awesome. I've come to rely very heavily on CouchApps, and the more I can do without leaving the stack, the happier / more productive / more effective I am. Bulk update handlers would fill a huge gap in my projects, by getting the server to do more heavy lifting, instead of relying on the client to do everything. I hope the possibilities I outlined above excite you guys as much as they excite me. Perhaps we can revisit bulk update handlers in an upcoming version?
        Hide
        Benjamin Young added a comment - - edited

        It would seem that allowing _update to store multiple documents would be the easier approach.

        This is the current code for saving documents and responding to an _update request:
        https://github.com/apache/couchdb/blob/master/src/couch_mrview/src/couch_mrview_show.erl#L131

        Perhaps making the first element of the return array support multiple JSON objects (i.e. return [[doc1, doc2],

        {..headers..}

        ]; ) would be a good first step.

        The next step would be to allow _update (with no doc_id specified in the URL support the bulk_docs JSON format for POSTs).

        Likely this could be broken into two separate commits (and Jira tickets) as they have separate value:
        1. sending in a single _update request and having multiple documents changed
        2. sending in a bulk_docs formatted _update POST and having those documents handled and/or other documents generated/updated.

        I don't think (now) that a _bulk_update endpoint would be needed.

        Additionally, the use of redirect responses from an _update handler might work now has a work around, it's not something I'd care to depend on (nor see shipped in CouchDB) due to there being no guarantee that the Location header would actually be followed by the client, and therefore no guarantee that the second (or more) updates would actually be done. It could certainly work if you control the whole stack and are OK with the potential of the redirects failing at times.

        Thanks for drawing attention to this issue in any case!

        Show
        Benjamin Young added a comment - - edited It would seem that allowing _update to store multiple documents would be the easier approach. This is the current code for saving documents and responding to an _update request: https://github.com/apache/couchdb/blob/master/src/couch_mrview/src/couch_mrview_show.erl#L131 Perhaps making the first element of the return array support multiple JSON objects (i.e. return [ [doc1, doc2] , {..headers..} ]; ) would be a good first step. The next step would be to allow _update (with no doc_id specified in the URL support the bulk_docs JSON format for POSTs). Likely this could be broken into two separate commits (and Jira tickets) as they have separate value: 1. sending in a single _update request and having multiple documents changed 2. sending in a bulk_docs formatted _update POST and having those documents handled and/or other documents generated/updated. I don't think (now) that a _bulk_update endpoint would be needed. Additionally, the use of redirect responses from an _update handler might work now has a work around, it's not something I'd care to depend on (nor see shipped in CouchDB) due to there being no guarantee that the Location header would actually be followed by the client, and therefore no guarantee that the second (or more) updates would actually be done. It could certainly work if you control the whole stack and are OK with the potential of the redirects failing at times. Thanks for drawing attention to this issue in any case!
        Hide
        Robert Newson added a comment -

        Is a bulk version of update handlers desirable given that each item in the bulk update can fail independently? Wouldn't that be very confusing? How would users handle that?

        Show
        Robert Newson added a comment - Is a bulk version of update handlers desirable given that each item in the bulk update can fail independently? Wouldn't that be very confusing? How would users handle that?
        Hide
        Benjamin Young added a comment -

        Yeah, that needs exploration, discussion, etc. There'd certainly be value in _update (or similar) handling multiple document saving. Likely, there'd need to be ways to do some or all of the following:
        a. report per-doc failure info
        b. handle document-saving in the function (this seems like a Bad Thing due to blocking the engine)
        c. provide an "all or nothing" style feature (or make it mandatory, maybe) where nothing gets stored if they all can't be stored.

        Does validate_doc_update run on _update?
        Are conflicts the key issue (as we'd not have read any previous docs prior to write-attempts in the mentioned setup at least)?

        The error condition handling is where the complexity/bloat/etc will happen...as ever...

        Show
        Benjamin Young added a comment - Yeah, that needs exploration, discussion, etc. There'd certainly be value in _update (or similar) handling multiple document saving. Likely, there'd need to be ways to do some or all of the following: a. report per-doc failure info b. handle document-saving in the function (this seems like a Bad Thing due to blocking the engine) c. provide an "all or nothing" style feature (or make it mandatory, maybe) where nothing gets stored if they all can't be stored. Does validate_doc_update run on _update ? Are conflicts the key issue (as we'd not have read any previous docs prior to write-attempts in the mentioned setup at least)? The error condition handling is where the complexity/bloat/etc will happen...as ever...
        Hide
        Robert Newson added a comment -

        validate_doc_update runs on all updates. An update handler is just a function that runs before a database update, the return value of the update handler includes the update to be attempted. This is also why an update handler (for a single document) can fail with a 409.

        "all or nothing" cannot happen, we are not building a distributed transaction engine.

        Show
        Robert Newson added a comment - validate_doc_update runs on all updates. An update handler is just a function that runs before a database update, the return value of the update handler includes the update to be attempted. This is also why an update handler (for a single document) can fail with a 409. "all or nothing" cannot happen, we are not building a distributed transaction engine.
        Hide
        Benjamin Young added a comment -

        Good points.

        Given that, we'd need a straight "wrapper" for the _bulk_docs endpoint, but with some user-created handling code:
        http://docs.couchdb.org/en/latest/api/database/bulk-api.html#post--db-_bulk_docs

        Likely the perfect thing for someone to explore as a plugin.

        Show
        Benjamin Young added a comment - Good points. Given that, we'd need a straight "wrapper" for the _bulk_docs endpoint, but with some user-created handling code: http://docs.couchdb.org/en/latest/api/database/bulk-api.html#post--db-_bulk_docs Likely the perfect thing for someone to explore as a plugin.
        Hide
        Ari Najarian added a comment -

        Thanks, all, for revisiting this discussion.

        Just so I'm clear, what Benjamin is proposing (the wrapper for _bulk_docs) would still require some kind of middleware in order to 'prepare' the list of documents to send, correct? I'd likely have to get client-side JS / Node / Lasso / PHP / whatever to query ddoc/_view/viewname?include_docs=true, grab those results, and pipe them into this update wrapper.

        Is there a design reason why piping a view directly into the wrapper wouldn't work? If it's technically feasible, this would cut down on the round-trips to the server (and latency, and bandwidth usage, and overall throughput between CouchDB and the client). It also means less code in less places for developers (like me!).

        Show
        Ari Najarian added a comment - Thanks, all, for revisiting this discussion. Just so I'm clear, what Benjamin is proposing (the wrapper for _bulk_docs) would still require some kind of middleware in order to 'prepare' the list of documents to send, correct? I'd likely have to get client-side JS / Node / Lasso / PHP / whatever to query ddoc/_view/viewname?include_docs=true, grab those results, and pipe them into this update wrapper. Is there a design reason why piping a view directly into the wrapper wouldn't work? If it's technically feasible, this would cut down on the round-trips to the server (and latency, and bandwidth usage, and overall throughput between CouchDB and the client). It also means less code in less places for developers (like me!).
        Hide
        Benjamin Young added a comment -

        No, this would (if built) handle multiple documents sent in likely in a format similar to what _bulk_docs receives and process them in the same manner, but with a JS function sitting in front.

        The "wrapper" comment was merely about the JS function wrapping the Erlang that does the _bulk_docs processing. Nothing more, and certainly with no requirements outside of CouchDB.

        There are ways (like what you describe) to get close to that now, but that conversation's best for the user's email list. Thanks!

        Show
        Benjamin Young added a comment - No, this would (if built) handle multiple documents sent in likely in a format similar to what _bulk_docs receives and process them in the same manner, but with a JS function sitting in front. The "wrapper" comment was merely about the JS function wrapping the Erlang that does the _bulk_docs processing. Nothing more, and certainly with no requirements outside of CouchDB. There are ways (like what you describe) to get close to that now, but that conversation's best for the user's email list. Thanks!

          People

          • Assignee:
            Unassigned
            Reporter:
            Benjamin Young
          • Votes:
            5 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development