CouchDB
  1. CouchDB
  2. COUCHDB-442

Add a "view" or "format" function to process source doc on query

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: JavaScript View Server
    • Labels:
      None

      Description

      It is common practice to do emit(key, null) in a map function and then query with ?include_docs=true to retrieve the documents that were responsible for the entries. However, the full document may include information that is privileged or the full document may be substantially larger than the information needed to be transferred to the client.

      The proposed enhancement is to allow defining a "view" function in addition to the existing "map" and "reduce" on a view. If specified, the view function would take the id, key, value and doc and return a JSON value that would be added as the "view" member to the row in the result set.

      One of the use cases on http://wiki.apache.org/couchdb/Authentication_and_Authorization is to be able to specify that a user can retrieve the values from a view, but not add include_docs since that may expose information that they are not authorized to view. Without the "view" function, there would be pressure to start pushing things into the emitted value.

      Production of views would be likely controlled using a include_views=true in the query string.

        Activity

        Hide
        Paul Joseph Davis added a comment -

        What reason is there for not just putting the extra data in the value? I can understand barring include_docs=true for clients this extra functionality seems quite unnecessary.

        Show
        Paul Joseph Davis added a comment - What reason is there for not just putting the extra data in the value? I can understand barring include_docs=true for clients this extra functionality seems quite unnecessary.
        Hide
        Curt Arnold added a comment -

        Say we have a personnel system where documents contain some confidential info (perhaps Social Security Number) and some less restricted info like education location, phone number, blog entries, etc. Say the split is only 5% confidential and 95% public. I want to create views by university, graduation date, phone number, location, department, etc. On each of the queries, I'd like a class of users to see everything public about the person but not any of the confidential info.

        Without the view function, one option would be calling something like:

        emit(key, sanitize(doc));

        in the map function for each of the views which would be 95% as bad as doing emit(key, doc). Basically, everything that would motivate you to do emit(key, null) over emit(key, doc) comes into play, but just slightly reduced.

        Another workaround would be to grab the ids and just attempt to retrieve all the underlying documents. Some other part of the authentication system would need to prevent retrieving the confidential info either by rejecting the request for the entire document or sanitizing it.

        Allowing access to a views but disabling include_docs and preventing direct retrievals of doc is my best analogy to authorizing access to SQL views but restricting direct access to the tables.

        While the main motivation is preparing for a more beefed up authentication and authorization, it would seem to have some independent usefulness. Plus it would appear to require any additional resources until serializing the result set and then only if the user added ?include_views=true.

        Show
        Curt Arnold added a comment - Say we have a personnel system where documents contain some confidential info (perhaps Social Security Number) and some less restricted info like education location, phone number, blog entries, etc. Say the split is only 5% confidential and 95% public. I want to create views by university, graduation date, phone number, location, department, etc. On each of the queries, I'd like a class of users to see everything public about the person but not any of the confidential info. Without the view function, one option would be calling something like: emit(key, sanitize(doc)); in the map function for each of the views which would be 95% as bad as doing emit(key, doc). Basically, everything that would motivate you to do emit(key, null) over emit(key, doc) comes into play, but just slightly reduced. Another workaround would be to grab the ids and just attempt to retrieve all the underlying documents. Some other part of the authentication system would need to prevent retrieving the confidential info either by rejecting the request for the entire document or sanitizing it. Allowing access to a views but disabling include_docs and preventing direct retrievals of doc is my best analogy to authorizing access to SQL views but restricting direct access to the tables. While the main motivation is preparing for a more beefed up authentication and authorization, it would seem to have some independent usefulness. Plus it would appear to require any additional resources until serializing the result set and then only if the user added ?include_views=true.
        Hide
        Paul Joseph Davis added a comment -

        Still not convinced. I'm basically reading this as an argument for adding some special code for making list and show functions not be tied to the underlying document or view which should be more than reasonable. include_docs=true would definitely need to respect ACL's of any sort, and it should do that with a trivial patch if it doesn't do it already.

        The example for emit(key, sanitize(doc)) is lost on me. If you have a view, and want the user to have access to extra information in the doc when reading that view, just include it in the view and use whatever access control we might have in place for reading views.

        And lastly, If you're data is 95% public then I'd start suggesting that you might want to consider two databases, one for the private and one for public. The only analogy that I can think of is trying to keep water in a colander, you can try and plug every hole or turn off the tap. I'm a turn off the tap kind of guy.

        Show
        Paul Joseph Davis added a comment - Still not convinced. I'm basically reading this as an argument for adding some special code for making list and show functions not be tied to the underlying document or view which should be more than reasonable. include_docs=true would definitely need to respect ACL's of any sort, and it should do that with a trivial patch if it doesn't do it already. The example for emit(key, sanitize(doc)) is lost on me. If you have a view, and want the user to have access to extra information in the doc when reading that view, just include it in the view and use whatever access control we might have in place for reading views. And lastly, If you're data is 95% public then I'd start suggesting that you might want to consider two databases, one for the private and one for public. The only analogy that I can think of is trying to keep water in a colander, you can try and plug every hole or turn off the tap. I'm a turn off the tap kind of guy.
        Hide
        Curt Arnold added a comment -

        Not sure if I'm understanding the first sentence. I am specifically suggesting additional code that is part of a view. A view/format/display function that would be a peer to the map and reduce function which could be used to process the matched documents so that only the information that is necessary to support the related business function is exposed. Something analogous to being able to included only certain columns in a SQL view.

        The case was a hypothetical and you would definitely need to partition information across documents if users who could not see the SSN could update other parts of the info on an individual. However, if you start defining 20 different business functions and each has need to see a slightly different subset of the data, then trying to accomplish that through partitioning (either document or database) becomes untenable. Partitioning sensitive info into a different database then would cost you the ability to do views that combined sensitive and less-sensitive info.

        Having the extra function which executes when serializing is an optimization over emitting a part of the document at map time, but it seems like it would be a very desirable optimization when authentication and authorization is more mature and could be useful now. Just trying to get things that fall out from the authentication and authorizations discussions visible in JIRA for elaboration and consideration.

        Show
        Curt Arnold added a comment - Not sure if I'm understanding the first sentence. I am specifically suggesting additional code that is part of a view. A view/format/display function that would be a peer to the map and reduce function which could be used to process the matched documents so that only the information that is necessary to support the related business function is exposed. Something analogous to being able to included only certain columns in a SQL view. The case was a hypothetical and you would definitely need to partition information across documents if users who could not see the SSN could update other parts of the info on an individual. However, if you start defining 20 different business functions and each has need to see a slightly different subset of the data, then trying to accomplish that through partitioning (either document or database) becomes untenable. Partitioning sensitive info into a different database then would cost you the ability to do views that combined sensitive and less-sensitive info. Having the extra function which executes when serializing is an optimization over emitting a part of the document at map time, but it seems like it would be a very desirable optimization when authentication and authorization is more mature and could be useful now. Just trying to get things that fall out from the authentication and authorizations discussions visible in JIRA for elaboration and consideration.
        Hide
        Paul Joseph Davis added a comment -

        This description "A view/format/display function that would be a peer to the map and reduce function which could be used to process the matched documents so that only the information that is necessary to support the related business function is exposed." is pretty much what _list and _show are meant for. If _list doesn't have access to include_docs (which theoretically it should already have) then we can definitely look at adding that.

        I'd also prefer to keep general discussions on the dev@ list so that we can try and keep the JIRA SNR at a sane level.

        Show
        Paul Joseph Davis added a comment - This description "A view/format/display function that would be a peer to the map and reduce function which could be used to process the matched documents so that only the information that is necessary to support the related business function is exposed." is pretty much what _list and _show are meant for. If _list doesn't have access to include_docs (which theoretically it should already have) then we can definitely look at adding that. I'd also prefer to keep general discussions on the dev@ list so that we can try and keep the JIRA SNR at a sane level.
        Hide
        Curt Arnold added a comment -

        I wasn't familiar with _lists and _shows, I've been head down into an system using CouchDB for a while and hadn't tracked their development. There are some similarities between the functionality, but not quite the same.

        To accomplish the authorization goal, you would need to be able to specify that a user could run a particular view only when it is in the context of a specific list request. That would require the view processor to be aware of its context and run with some elevated privilege which sounds like a recipe for problems.

        JIRA works well to focus discussion around specific feature requests. Having specific byte-sized enhancements defined in JIRA should make it easier to new people to pick up and contribute to the project. However, I'll switch to starting discussion on new features on the mailing list first and then create the JIRA entry when the discussion dies down.

        Show
        Curt Arnold added a comment - I wasn't familiar with _lists and _shows, I've been head down into an system using CouchDB for a while and hadn't tracked their development. There are some similarities between the functionality, but not quite the same. To accomplish the authorization goal, you would need to be able to specify that a user could run a particular view only when it is in the context of a specific list request. That would require the view processor to be aware of its context and run with some elevated privilege which sounds like a recipe for problems. JIRA works well to focus discussion around specific feature requests. Having specific byte-sized enhancements defined in JIRA should make it easier to new people to pick up and contribute to the project. However, I'll switch to starting discussion on new features on the mailing list first and then create the JIRA entry when the discussion dies down.
        Hide
        Paul Joseph Davis added a comment -

        Right, we'd need to add ACL support to _list and _show to accomplish the original intent, but that was my reason for doubting the feature; we can just roll this functionality into the existing feature set.

        Closing this because it probably won't happen. When some of the upcoming auth/access patches hit similar functionality will most likely get rolled into _list and _show. For interested listeners, keep an ear to the dev@ ground for updates.

        Show
        Paul Joseph Davis added a comment - Right, we'd need to add ACL support to _list and _show to accomplish the original intent, but that was my reason for doubting the feature; we can just roll this functionality into the existing feature set. Closing this because it probably won't happen. When some of the upcoming auth/access patches hit similar functionality will most likely get rolled into _list and _show. For interested listeners, keep an ear to the dev@ ground for updates.
        Hide
        Chris Anderson added a comment -

        for what it's worth:

        emit(key, sanitize(doc));

        this is the "right" way to do it. include_docs should be avoided in production code paths (admin is fine) especially where you'll be selecting a non-trivial number of rows.

        Show
        Chris Anderson added a comment - for what it's worth: emit(key, sanitize(doc)); this is the "right" way to do it. include_docs should be avoided in production code paths (admin is fine) especially where you'll be selecting a non-trivial number of rows.
        Hide
        Adam Kocoloski added a comment -

        Hmm Chris, I'm not sure I agree with that statement. I've seen production setups with ~5 GB view indexes where the view emitted (key,null) and used include_docs=true for the lookup. If the view had used sanitize(doc) it would've been much too big to be cached in main memory. Then you're pretty well screwed.

        Show
        Adam Kocoloski added a comment - Hmm Chris, I'm not sure I agree with that statement. I've seen production setups with ~5 GB view indexes where the view emitted (key,null) and used include_docs=true for the lookup. If the view had used sanitize(doc) it would've been much too big to be cached in main memory. Then you're pretty well screwed.
        Hide
        Sam Bisbee added a comment -

        Resolved for a while. Closing.

        Show
        Sam Bisbee added a comment - Resolved for a while. Closing.

          People

          • Assignee:
            Unassigned
            Reporter:
            Curt Arnold
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development