CouchDB
  1. CouchDB
  2. COUCHDB-834

startkey_docid/endkey_docid don't work without an exact startkey/endkey match

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Component/s: JavaScript View Server
    • Labels:
      None
    • Skill Level:
      Regular Contributors Level (Easy to Medium)

      Description

      This issue popped up when I wanted to paginate through a list of documents using a combined array key, using a startkey and endkey that's based solely on the first part of said key. First part is a reference to a different document, second part is a timestamp to keep the list sorted by creation time. The list of documents can be fetched using startkey=["key"] and endkey=["key", {}]

      Now, I wanted to add pagination to this list, only fetching so many documents starting at startkey_docid, which failed using this setup. It seems (and Jan validated that assumption by analyzing the source) that both startkey needs to be an exact match for startkey_docid to have any effect. If there's no exact match, CouchDB will silently ignore the startkey_docid, a behaviour that's undocumented and to be quite frank, unintuitive.

      Consider the following two documents, both pointing to the same other_id:

      {"_id": "one", "other_id": "other", "second_key": "one"} {"_id": "two", "other_id": "other", "second_key": "two"}

      And a simple map/reduce function that just emits the combined key:

      {
      "other_documents": {
      "reduce": "_sum",
      "map": " function(doc)

      { \n emit([doc.other_id, doc.second_key], 1);\n }

      \n"
      }
      }

      Querying the view like this gives the expected results:

      curl 'http://localhost:5984/startkey_bug/_design/other_documents/_view/other_documents?reduce=false&startkey=["other"]&endkey=["other",{}]'

      {"total_rows":2,"offset":0,"rows":[

      {"id":"one","key":["other","one"],"value":1}

      ,

      {"id":"two","key":["other","two"],"value":1}

      ]}

      If I add in a startkey_docid of two, I'd expect CouchDB to skip to the second result in the list, skipping the first, but it doesn't:

      curl 'http://localhost:5984/startkey_bug/_design/other_documents/_view/other_documents?reduce=false&startkey=["other"]&endkey=["other",{}]&startkey_docid=two'

      {"total_rows":2,"offset":0,"rows":[

      {"id":"one","key":["other","one"],"value":1}

      ,

      {"id":"two","key":["other","two"],"value":1}

      ]}

      However, it does what I'd expect when I specify an exact startkey (the endkey is still the same):

      curl 'http://localhost:5984/startkey_bug/_design/other_documents/_view/other_documents?reduce=false&startkey=["other","one"]&endkey=["other",{}]&startkey_docid=two'

      {"total_rows":2,"offset":1,"rows":[

      {"id":"two","key":["other","two"],"value":1}

      ]}

      If you add in an exact endkey, the situation doesn't change, and the result is as expected.

      Having an exact startkey is an acceptable workaround, but I'd still say this behaviour is not intuitive, and either should be fixed to work the same in all of the above situations. If not, at least the documentation should properly reflect these situation, explaining the proper workarounds.

      Update: I just checked how this works out when using descending=true, the same is true for the swapped endkey and startkey parameters. Specifying and endkey_docid requires to specify an exact endkey match.

        Activity

        Hide
        Prudhvi added a comment -

        Hi Mathias Meyer,

        As you said,
        > However, it does what I'd expect when I specify an exact startkey (the endkey is still the same):

        > curl 'http://localhost:5984/startkey_bug/_design/other_documents/_view/other_documents?reduce=false&startkey=["other","one"]&endkey=["other",{}]&startkey_docid=two'

        > {"total_rows":2,"offset":1,"rows":[
        >

        {"id":"two","key":["other","two"],"value":1}

        > ]}

        Actually it SEEMS it works if you specify exact startkey along with endkey filter and startkey_docid.
        However it DOESNT work in all cases.
        in my case the view has several rows with same startkey( with different doc _id ), so when I try with exact same key along with endkey filter and startkey_docid, it fetches all the results with that exactly same startkey ignoring the startkey_docid

        Like,
        {"total_rows":3,"offset":1,"rows":[

        { "id": "two", "key": ["other","two"], "value":1 } { "id": "three", "key": ["other","two"], "value":4 } { "id": "four", "key": ["other","two"], "value":5 }

        ]}

        The real problem here is with ignoring the startkey_docid while querying the view.

        Show
        Prudhvi added a comment - Hi Mathias Meyer, As you said, > However, it does what I'd expect when I specify an exact startkey (the endkey is still the same): > curl 'http://localhost:5984/startkey_bug/_design/other_documents/_view/other_documents?reduce=false&startkey=["other","one"]&endkey=["other",{}]&startkey_docid=two' > {"total_rows":2,"offset":1,"rows":[ > {"id":"two","key":["other","two"],"value":1} > ]} Actually it SEEMS it works if you specify exact startkey along with endkey filter and startkey_docid. However it DOESNT work in all cases. in my case the view has several rows with same startkey( with different doc _id ), so when I try with exact same key along with endkey filter and startkey_docid, it fetches all the results with that exactly same startkey ignoring the startkey_docid Like, {"total_rows":3,"offset":1,"rows":[ { "id": "two", "key": ["other","two"], "value":1 } { "id": "three", "key": ["other","two"], "value":4 } { "id": "four", "key": ["other","two"], "value":5 } ]} The real problem here is with ignoring the startkey_docid while querying the view.
        Hide
        Paul Joseph Davis added a comment - - edited

        startkey_docid and endkey_docid only come into effect when you have identical keys.

        Internally, view rows are stored like this:

        [key1, docid1], value
        [key2, docid2], value
        [key3, docid3], value
        [key4, docid4], value
        [key5, docid5], value

        Sorting with docids is the same as if you emit an array key. If the first elements of the array are different it will never consult the second element. Its similar to sorting strings. You only need to look at the prefix that's shared to figure out which comes first.

        Show
        Paul Joseph Davis added a comment - - edited startkey_docid and endkey_docid only come into effect when you have identical keys. Internally, view rows are stored like this: [key1, docid1] , value [key2, docid2] , value [key3, docid3] , value [key4, docid4] , value [key5, docid5] , value Sorting with docids is the same as if you emit an array key. If the first elements of the array are different it will never consult the second element. Its similar to sorting strings. You only need to look at the prefix that's shared to figure out which comes first.
        Hide
        Prudhvi added a comment -

        Hi Paul,

        Is this a valid bug or we are using the wrong way of querying the view when there are multiple similar keys

        Show
        Prudhvi added a comment - Hi Paul, Is this a valid bug or we are using the wrong way of querying the view when there are multiple similar keys
        Hide
        Randall Leeds added a comment -

        It sounds like the real ticket issue is that startkey_docid is ignored when startkey is specified. If that's the case, I think it makes sense to me. It's not clear to me how that query should behave and so I'd say it's unspecified, or, if we want to get nasty, we could throw a HTTP 400 back. When you have a doc id you don't need a key. If you can think of a compelling case for passing both parameters in a single query, please explain exactly how you think it should work and we can consider it a feature request.

        Show
        Randall Leeds added a comment - It sounds like the real ticket issue is that startkey_docid is ignored when startkey is specified. If that's the case, I think it makes sense to me. It's not clear to me how that query should behave and so I'd say it's unspecified, or, if we want to get nasty, we could throw a HTTP 400 back. When you have a doc id you don't need a key. If you can think of a compelling case for passing both parameters in a single query, please explain exactly how you think it should work and we can consider it a feature request.
        Hide
        Paul Joseph Davis added a comment -

        @Prudhvi

        No its not a bug. You can basically think of this if you switched all of your emit calls from emit(key, val) to emit([key, doc._id], val) and then just change all of your startkey values to [startkey, docid].

        The important part to remember here is that this is extremely simple. Consider a large sorted array. All that the various key related options are doing is defining a slice of this array to return. At it's most basic this is how all indexing works. You just need to find the part in a sorted list that is relevant to your query.

        In this particular case, its just important that sorting only looks at as much of a key as is necessary to make a decision. Given something like these two keys:

        [1, 2, 3, 100]
        [1, 2, 4, 0]

        We have to look at the first three positions to determine the sorting. The first position is equal, so we check the second which is also equal, then the third position finally tells us that 3 < 4 and we can stop looking. The values 100 and 0 will never be considered in defining the sort order between these two keys. If a third key came in that was [1, 2, 3, 99] then we would have to compare 99 < 100 to figure out that it goes first in the list.

        The startkey_docid parameter is slightly special here. Internally all index keys are stored as a 2-tuple of

        {Key, DocId}

        for bookkeeping so that we can do incremental map/reduce. This also allows HTTP requests to differentiate between identical keys coming from multiple documents. But as in the example above, the DocId will never be consulted unless the Keys were identical.

        @Randall

        I think you said that backwards. The only issue that's similar is that startkey_docid has no effect if startkey isn't specified. That could be a 400, but whenever I try and make the HTTP query parsing strict people tell me to Relax and I die a little inside.

        Show
        Paul Joseph Davis added a comment - @Prudhvi No its not a bug. You can basically think of this if you switched all of your emit calls from emit(key, val) to emit( [key, doc._id] , val) and then just change all of your startkey values to [startkey, docid] . The important part to remember here is that this is extremely simple. Consider a large sorted array. All that the various key related options are doing is defining a slice of this array to return. At it's most basic this is how all indexing works. You just need to find the part in a sorted list that is relevant to your query. In this particular case, its just important that sorting only looks at as much of a key as is necessary to make a decision. Given something like these two keys: [1, 2, 3, 100] [1, 2, 4, 0] We have to look at the first three positions to determine the sorting. The first position is equal, so we check the second which is also equal, then the third position finally tells us that 3 < 4 and we can stop looking. The values 100 and 0 will never be considered in defining the sort order between these two keys. If a third key came in that was [1, 2, 3, 99] then we would have to compare 99 < 100 to figure out that it goes first in the list. The startkey_docid parameter is slightly special here. Internally all index keys are stored as a 2-tuple of {Key, DocId} for bookkeeping so that we can do incremental map/reduce. This also allows HTTP requests to differentiate between identical keys coming from multiple documents. But as in the example above, the DocId will never be consulted unless the Keys were identical. @Randall I think you said that backwards. The only issue that's similar is that startkey_docid has no effect if startkey isn't specified. That could be a 400, but whenever I try and make the HTTP query parsing strict people tell me to Relax and I die a little inside.
        Hide
        Prudhvi added a comment - - edited

        @paul
        thank you for the explanation,
        my view emits [ param1, param2 ] -> value
        you said it internally stores as [ key, docid ] where key is [ param1, param2 ]
        --> [ [ "other", "two" ] , doc_id1 ] ,
        [ [ "other", "two" ] , doc_id2 ] ,
        [ [ "other", "two" ] , doc_id3 ]
        we are querying the view for pagination with startkey=[ "other" ], endkey=[ "other", {} ], and startkey_docid = doc_id2 .
        since the first two params are exactly same it should consider the third param (doc_id) for exact match.
        However its ignoring the third param(doc_id), we expected a list of results with doc_id2 row as first row. its returning a list with doc_id1 as first row.
        This is confusing as we queried for doc_id2 as first row in the result for pagination.

        @randall
        may be it needs to be a feature ticket or could be a bug.
        Please look at the example mentioned above
        how can we query for pagination when multiple same keys exists?

        Show
        Prudhvi added a comment - - edited @paul thank you for the explanation, my view emits [ param1, param2 ] -> value you said it internally stores as [ key, docid ] where key is [ param1, param2 ] --> [ [ "other", "two" ] , doc_id1 ] , [ [ "other", "two" ] , doc_id2 ] , [ [ "other", "two" ] , doc_id3 ] we are querying the view for pagination with startkey=[ "other" ], endkey=[ "other", {} ], and startkey_docid = doc_id2 . since the first two params are exactly same it should consider the third param (doc_id) for exact match. However its ignoring the third param(doc_id), we expected a list of results with doc_id2 row as first row. its returning a list with doc_id1 as first row. This is confusing as we queried for doc_id2 as first row in the result for pagination. @randall may be it needs to be a feature ticket or could be a bug. Please look at the example mentioned above how can we query for pagination when multiple same keys exists?
        Hide
        Paul Joseph Davis added a comment -

        @Prudhvi

        First, stop thinking that its ignoring startkey_docid and reconsider your assumptions.

        Second, yes, the first element of ["other"] and ["other", {}] are the same.

        Third, your broken assumption is that assumption two is important.

        Fourth, what is important is that ["other"] is less than ["other", "two"].

        Fifth, what is also important is that [["other"], whatever_may_exist] is less than [["other", "two"], doc_id1]

        Sixth, this isn't a bug.

        Seventh, to change the startkey_docid behavior would require a tertiary index on view keys which for the time being is highly unlikely. If you need the behavior your desire you should create a view for it (because the tertiary index doesn't provide anything you can't accomplish with another view).

        Show
        Paul Joseph Davis added a comment - @Prudhvi First, stop thinking that its ignoring startkey_docid and reconsider your assumptions. Second, yes, the first element of ["other"] and ["other", {}] are the same. Third, your broken assumption is that assumption two is important. Fourth, what is important is that ["other"] is less than ["other", "two"] . Fifth, what is also important is that [ ["other"] , whatever_may_exist] is less than [ ["other", "two"] , doc_id1] Sixth, this isn't a bug. Seventh, to change the startkey_docid behavior would require a tertiary index on view keys which for the time being is highly unlikely. If you need the behavior your desire you should create a view for it (because the tertiary index doesn't provide anything you can't accomplish with another view).
        Hide
        Randall Leeds added a comment -

        My understanding of the purpose of startkey_docid and the structure of the index was wrong. Thanks for the correction, Paul.

        Show
        Randall Leeds added a comment - My understanding of the purpose of startkey_docid and the structure of the index was wrong. Thanks for the correction, Paul.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mathias Meyer
          • Votes:
            3 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development