CouchDB
  1. CouchDB
  2. COUCHDB-1288

More efficient builtin filters _doc_ids and _design

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: None
    • Labels:
      None

      Description

      We have the _doc_ids and _design _changes filter as of CouchDB 1.1.0.
      While they meet the expectations of applications/users, they're far from efficient for large databases.
      Basically the implementation folds the entire seq btree and then filters values by the document's ID, causing too much IO and busting caches. This makes replication by doc IDs not so efficient as it could be.

      The proposed patch avoids this by doing direct lookups in the ID btree, for _doc_ids, and ranged fold for _design.

      If there are no objections, I would apply to branch 1.2.x besides

      1. couchdb_1288_2.patch
        6 kB
        Filipe Manana
      2. couchdb_1288_3.patch
        20 kB
        Filipe Manana

        Activity

        Hide
        Filipe Manana added a comment -

        Applied to trunk and branch 1.2.x

        Show
        Filipe Manana added a comment - Applied to trunk and branch 1.2.x
        Hide
        Filipe Manana added a comment -

        Thanks Bob.

        If it's separate issue, unrelated to any changes from this patch, it should go into a separate patch/ticket

        Show
        Filipe Manana added a comment - Thanks Bob. If it's separate issue, unrelated to any changes from this patch, it should go into a separate patch/ticket
        Hide
        Bob Dionne added a comment - - edited

        Filipe,

        I started reviewing this and it looks good so far. There's an edge case we ran into the other day that @davisp and @kocolosk ran down. When you have `feed=continuous` and a hearbeat and a filter function that fails enough, the heartbeat timeout never triggers and no changes are sent. It's easy to reproduce, you can see how it's handled in fabric[1]. I can probably add it to this patch or open a second ticket if you prefer.

        Also, as an aside the `couch_changes:get_changes_timeout` is slightly awkward in the way heartbeat is handled. It appears to allow `heartbeat=true` and in that case defaults to the timeout in the config. That certainly does not agree with the documented semantics.

        Cheers,

        Bob

        [1] https://github.com/cloudant/fabric/commit/f9eea28e62496afcb

        Show
        Bob Dionne added a comment - - edited Filipe, I started reviewing this and it looks good so far. There's an edge case we ran into the other day that @davisp and @kocolosk ran down. When you have `feed=continuous` and a hearbeat and a filter function that fails enough, the heartbeat timeout never triggers and no changes are sent. It's easy to reproduce, you can see how it's handled in fabric [1] . I can probably add it to this patch or open a second ticket if you prefer. Also, as an aside the `couch_changes:get_changes_timeout` is slightly awkward in the way heartbeat is handled. It appears to allow `heartbeat=true` and in that case defaults to the timeout in the config. That certainly does not agree with the documented semantics. Cheers, Bob [1] https://github.com/cloudant/fabric/commit/f9eea28e62496afcb
        Hide
        Filipe Manana added a comment -

        Added patch with test case, including the case for continuous changes.

        Show
        Filipe Manana added a comment - Added patch with test case, including the case for continuous changes.
        Hide
        Filipe Manana added a comment -

        This still needs some small work for the continuous case and a test.

        Show
        Filipe Manana added a comment - This still needs some small work for the continuous case and a test.
        Hide
        Filipe Manana added a comment -

        Second version of the patch, for _doc_ids, the optimized code patch is only triggered if the number of doc IDs is not greater than 100. This is too avoid loading too many full_doc_info records into memory, which can be big if the rev trees are long and/or with many branches.

        Show
        Filipe Manana added a comment - Second version of the patch, for _doc_ids, the optimized code patch is only triggered if the number of doc IDs is not greater than 100. This is too avoid loading too many full_doc_info records into memory, which can be big if the rev trees are long and/or with many branches.

          People

          • Assignee:
            Unassigned
            Reporter:
            Filipe Manana
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development