Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11
    • Fix Version/s: None
    • Component/s: JavaScript View Server
    • Labels:
      None
    • Skill Level:
      Committers Level (Medium to Hard)

      Description

      A common operation I find myself performing repeatedly is:

      • request a view (maybe with some basic filter like "keys" or a range of keys)
      • in my client, filter this view based on some complex criteria, leaving me with a small set of document IDs (complex as in array intersections, compound boolean operations, & other stuff not possible in the HTTP view API)
      • go back to Couch and fetch the complete documents for these IDs.

      List Views almost get me to the point of doing this purely in Couch. I can enumerate over a view and do some complex things with it. But I can't output entire documents, unless I use the include_docs=true flag which murders the performance of the list view.Apparently because the entire view is fetched with including docs, THEN passed on to the list view JS. Typically my complex filter criteria is contained in the view itself, so there is no need to fetch the entire document until I know I have a match.

      In summary, a Filter View would execute some arbitrary JavaScript on each view row, with access to HTTP request parameters, and return "true" for rows that match. The output would be a list of IDs for whom the function returned true. include_docs=true would include the matching documents.

      Performance would certainly not be as good as fetching a raw view, but it would indisputably be better than fetching the entire view over HTTP to a client, deserializing the JSON, doing some stuff, then making another HTTP request, and deserializing more JSON.

      I looked at the various entry points for list views in the Couch source. Unfortunately it will take me some time to come up to speed with the source (if I ever have the time ...), and I hope that what I'm asking for could be a simple extension to the List Views for someone very familiar with this area.

        Activity

        Hide
        Robert Newson added a comment -


        How far could you get by using a list function that filters the rows it emits?

        Show
        Robert Newson added a comment - How far could you get by using a list function that filters the rows it emits?
        Hide
        Luke Burton added a comment -

        Not far, because the objective is to perform the filter on data available in the view, and only do the expensive fetch of the entire document when the filter criteria is met.

        Say I have a Couch database full of images with metadata. My goal is to fetch a bunch of images that contain a particular tag, from a particular author, of less than a particular focal length.

        To do this, I could build a filter view that emits [tags, author, focalLength]. I pass in my match criteria as HTTP parameters. The filter view would enumerate each row and see if req.tag is in row.tags, whether req.author = row.author, and whether req.focalLength > row.focalLength. I could then emit only the complete documents that match.

        To do this with a list view, I would need to supply include_docs=true, to get access to the entire image document so I could actually return it upon a match. This means Couch is retrieving in memory a potentially multi-gigabyte view document, then handing it off to the list view javascript for transformation. Expensive! What if only five images actually match?

        As I mentioned above, you can do all this on the client side - fetch the view, process it, get a list of IDs, then fetch them - but it requires multiple calls over the wire. And it's putting what I consider to be "database oriented" stuff into the front end, rather than in the database itself ...

        Show
        Luke Burton added a comment - Not far, because the objective is to perform the filter on data available in the view, and only do the expensive fetch of the entire document when the filter criteria is met. Say I have a Couch database full of images with metadata. My goal is to fetch a bunch of images that contain a particular tag, from a particular author, of less than a particular focal length. To do this, I could build a filter view that emits [tags, author, focalLength] . I pass in my match criteria as HTTP parameters. The filter view would enumerate each row and see if req.tag is in row.tags, whether req.author = row.author, and whether req.focalLength > row.focalLength. I could then emit only the complete documents that match. To do this with a list view, I would need to supply include_docs=true, to get access to the entire image document so I could actually return it upon a match. This means Couch is retrieving in memory a potentially multi-gigabyte view document, then handing it off to the list view javascript for transformation. Expensive! What if only five images actually match? As I mentioned above, you can do all this on the client side - fetch the view, process it, get a list of IDs, then fetch them - but it requires multiple calls over the wire. And it's putting what I consider to be "database oriented" stuff into the front end, rather than in the database itself ...
        Hide
        Robert Newson added a comment -

        include_docs=true does not bring in the attachments binary data, just a stub. Another way of saying that is that you cannot emit the matching image data in a view anyway.

        To make it faster, you could emit the data you wish to match on in either the key or the value of the emit() call and then you don't need include_docs=true, which is another (internal) lookup.

        with a list function over a view, where the list function only emits rows that match your criteria, you would follow up with calls to pull the attachment itself.

        To your last point, it's true that you would do all this stuff server-side with an RDBMS (stored procedures, etc), but that's not always the natural thing to do with couch.

        Show
        Robert Newson added a comment - include_docs=true does not bring in the attachments binary data, just a stub. Another way of saying that is that you cannot emit the matching image data in a view anyway. To make it faster, you could emit the data you wish to match on in either the key or the value of the emit() call and then you don't need include_docs=true, which is another (internal) lookup. with a list function over a view, where the list function only emits rows that match your criteria, you would follow up with calls to pull the attachment itself. To your last point, it's true that you would do all this stuff server-side with an RDBMS (stored procedures, etc), but that's not always the natural thing to do with couch.
        Hide
        Luke Burton added a comment -

        Hah, I picked a bad example. Forget attachments and images. In my case, I actually have rather large hashes stored in documents. I chose images thinking it would better illustrate the point of working with unwieldy documents. Sorry for the confusion!

        So in my case I don't have any affordances for using attachments, and I can't create a view that emits the values that I want along with the lookup keys because what I need is the entire document at the end (it populates records in a SproutCore frontend).

        As to what is "natural" with Couch: this is a database that can host its own apps! I don't think I'm crossing any natural couch boundaries here Really this is just a small extension to the idea of list views. In fact you could use the list view entirely for this if we just added one JS call that effectively did: "emit these complete document IDs." But I think it's different enough that it might merit its own place.

        Show
        Luke Burton added a comment - Hah, I picked a bad example. Forget attachments and images. In my case, I actually have rather large hashes stored in documents. I chose images thinking it would better illustrate the point of working with unwieldy documents. Sorry for the confusion! So in my case I don't have any affordances for using attachments, and I can't create a view that emits the values that I want along with the lookup keys because what I need is the entire document at the end (it populates records in a SproutCore frontend). As to what is "natural" with Couch: this is a database that can host its own apps! I don't think I'm crossing any natural couch boundaries here Really this is just a small extension to the idea of list views. In fact you could use the list view entirely for this if we just added one JS call that effectively did: "emit these complete document IDs." But I think it's different enough that it might merit its own place.
        Hide
        Luke Burton added a comment -

        Wait a second. What I'm asking for basically already exists, but I didn't know about it. You can apply exactly the kind of filter documents I'm talking about on the _changes feed, as discussed in the CouchDB book.

        All I want to do is use exactly that against any view. Is that possible??

        Show
        Luke Burton added a comment - Wait a second. What I'm asking for basically already exists, but I didn't know about it. You can apply exactly the kind of filter documents I'm talking about on the _changes feed, as discussed in the CouchDB book. All I want to do is use exactly that against any view. Is that possible??
        Hide
        Chris Anderson added a comment -

        The _changes filters are what I'd recommend for this process. For the moment, the best way to do what you are asking (with built-in Couch tools) is to use filtered replication to create a database for each set of filter parameters.

        In many cases this is easiest expressed as a database-per-user with filters used to ensure everyone only sees the data they are allowed to see.

        Then you would define your views on the filtered databases.

        What you request here is not insane, but it's essentially asking Couch to do the filter operation (potentially expensively) at each read operation, instead of incrementally allowing low-latency access to the computed dataset. I think using filtered replication is a good tradeoff between disk usage and app responsiveness, and would suggest it instead of a _list operation.

        Show
        Chris Anderson added a comment - The _changes filters are what I'd recommend for this process. For the moment, the best way to do what you are asking (with built-in Couch tools) is to use filtered replication to create a database for each set of filter parameters. In many cases this is easiest expressed as a database-per-user with filters used to ensure everyone only sees the data they are allowed to see. Then you would define your views on the filtered databases. What you request here is not insane, but it's essentially asking Couch to do the filter operation (potentially expensively) at each read operation, instead of incrementally allowing low-latency access to the computed dataset. I think using filtered replication is a good tradeoff between disk usage and app responsiveness, and would suggest it instead of a _list operation.
        Hide
        Luke Burton added a comment -

        _changes won't work for us in this case, I don't think. We need this feature to decide if the requester should have access to a document or not (and more fun stuff too, but this is a big one). Say for instance you break this out into a per-user database scenario:

        "PersonA" database gets a feed from "Master". Master has a new document added. The replication filter detects that PersonA does indeed have access to this document, so we replicate it across to PersonA. Now someone removes PersonA's access to that document. The replication filter looks at that change and at that point has to turn that request into a delete operation, not an update. Can it do that? From what I have seen, I don't think it can.

        Sure this operation is expensive on the Couch side, but in our stack we have to do it somewhere. Right now it's two expensive HTTP calls from the client. We could completely gut our middle tier layer if Couch allowed us to do this. Its only job at that point would be to attach information from our authorization system on any "view filter" requests coming through.

        Another problem is the massive duplication this approach would entail. We have tens of thousands of documents and hundreds of people. Basically we would be multiplying our database size by close to the number of users, all to filter out a few secure documents. I'd rather pay a small CPU cost to keep things simple. And AFAIK Couch doesn't yet bring up replication on restart, so I'd also have to write some script to trigger a few hundred replications ... maintenance hassle ...

        Show
        Luke Burton added a comment - _changes won't work for us in this case, I don't think. We need this feature to decide if the requester should have access to a document or not (and more fun stuff too, but this is a big one). Say for instance you break this out into a per-user database scenario: "PersonA" database gets a feed from "Master". Master has a new document added. The replication filter detects that PersonA does indeed have access to this document, so we replicate it across to PersonA. Now someone removes PersonA's access to that document. The replication filter looks at that change and at that point has to turn that request into a delete operation, not an update . Can it do that? From what I have seen, I don't think it can. Sure this operation is expensive on the Couch side, but in our stack we have to do it somewhere . Right now it's two expensive HTTP calls from the client. We could completely gut our middle tier layer if Couch allowed us to do this. Its only job at that point would be to attach information from our authorization system on any "view filter" requests coming through. Another problem is the massive duplication this approach would entail. We have tens of thousands of documents and hundreds of people. Basically we would be multiplying our database size by close to the number of users, all to filter out a few secure documents. I'd rather pay a small CPU cost to keep things simple. And AFAIK Couch doesn't yet bring up replication on restart, so I'd also have to write some script to trigger a few hundred replications ... maintenance hassle ...
        Hide
        Chris Anderson added a comment -

        The forced deleting of remote documents for just some users isn't currently supported (and isn't really compatible with offline data.)

        I'd be be happy to support include_doc lookups for JSON list responses. It's a fair amount of code, so I'd expect whoever will be using it to write it.

        If you do create this feature, I'm happy to commit it, but I want to note my architectural reservations here.

        Show
        Chris Anderson added a comment - The forced deleting of remote documents for just some users isn't currently supported (and isn't really compatible with offline data.) I'd be be happy to support include_doc lookups for JSON list responses. It's a fair amount of code, so I'd expect whoever will be using it to write it. If you do create this feature, I'm happy to commit it, but I want to note my architectural reservations here.
        Hide
        Luke Burton added a comment -

        I'll have a crack at implementing something simple. At the very least, it's worth conducting the experiment!

        Show
        Luke Burton added a comment - I'll have a crack at implementing something simple. At the very least, it's worth conducting the experiment!
        Hide
        Erik Pearson added a comment -

        I see nothing has happened here for a while!

        Wondering if any movement has happened in this area?

        My use case is a little different, but I could really use a dynamic filter.

        I've been happily using couchdb as the data source for a cms. The cms rendering layer does not have direct access to querying couchdb, rather there are two types of adapter functions – for processing templates on the server (json, handlebars-like, lisp), and for processing templates on on the browser (json, handlebars, javascript). The adapter functions basically massage the couchdb output, and provide a security layer (no direct access to couchdb from client code.)

        This has worked great.

        However, one function that I would love to move to couchdb is simple view filtering.
        Basically, I need view results filtering that is orthogonal to the view index. For example, I have a set of content that is ordered in alpha order by title, but I want to filter out any expired content. Expiration is determined by comparing the current time to the expiration time on the content. In order to use the view for filtering, I would need to sort by expiration time, but I don't want to do that. What I do currently is get the content in the order I need (there is already filtering going on to select type of content and category) and then apply a filter in my adapter code.

        What I envision is simply a dynamic filter and/or map function on the "other" side of the view, applied as the view is generated. A filter function would just return true or false, depending on whether the row should be included in the view or not, and the map would allow altering the view data. The function should be applicable to any view. The filter should be able to receive query parameters as well. Filters should also be chainable.

        The list function seems like the wrong approach to this. I don't want to iterate over content, just supply one or more functions that is applied to each row, as I would expect a filter or map to do.

        So I guess I'm curious whether this proposal has ever taken off, any work done on it, how it might fit into the current architecture of couchdb. I see that no-one is assigned to it. I might be happy to try to work on this if no one else is ...

        Show
        Erik Pearson added a comment - I see nothing has happened here for a while! Wondering if any movement has happened in this area? My use case is a little different, but I could really use a dynamic filter. I've been happily using couchdb as the data source for a cms. The cms rendering layer does not have direct access to querying couchdb, rather there are two types of adapter functions – for processing templates on the server (json, handlebars-like, lisp), and for processing templates on on the browser (json, handlebars, javascript). The adapter functions basically massage the couchdb output, and provide a security layer (no direct access to couchdb from client code.) This has worked great. However, one function that I would love to move to couchdb is simple view filtering. Basically, I need view results filtering that is orthogonal to the view index. For example, I have a set of content that is ordered in alpha order by title, but I want to filter out any expired content. Expiration is determined by comparing the current time to the expiration time on the content. In order to use the view for filtering, I would need to sort by expiration time, but I don't want to do that. What I do currently is get the content in the order I need (there is already filtering going on to select type of content and category) and then apply a filter in my adapter code. What I envision is simply a dynamic filter and/or map function on the "other" side of the view, applied as the view is generated. A filter function would just return true or false, depending on whether the row should be included in the view or not, and the map would allow altering the view data. The function should be applicable to any view. The filter should be able to receive query parameters as well. Filters should also be chainable. The list function seems like the wrong approach to this. I don't want to iterate over content, just supply one or more functions that is applied to each row, as I would expect a filter or map to do. So I guess I'm curious whether this proposal has ever taken off, any work done on it, how it might fit into the current architecture of couchdb. I see that no-one is assigned to it. I might be happy to try to work on this if no one else is ...

          People

          • Assignee:
            Unassigned
            Reporter:
            Luke Burton
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development