Uploaded image for project: 'CouchDB'
  1. CouchDB
  2. COUCHDB-2965

Race condition in replicator rescan logic

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • Replication
    • None

    Description

      There is race condition between the full rescan and regular change feed processing in the couch_replicator_manger code.

      This race condition would lead to replication docs left in untriggered state when a rescan of all the docs is performed. The rescan might happen when nodes connect and disconnect. The likelihood of this race condition appear goes up if a lot of documents are updated and there is a back-up of messages in the replicator manager's mailbox.

      The race condition happens in the following way:

      • scan_all_dbs will find all replicator-looking-like database and for each send a {resume_scan, DbName} message to the main couch_replicator_manager process.

      But the race condition occurs because when change feeds stop, they call replicator manager with { rep_db_checkpoint, DbName } message. That updates db_to_seq ets table with the latest change sequence: https://github.com/apache/couchdb-couch-replicator/blob/master/src/couch_replicator_manager.erl#L225 Which means this sequence of operations could happen:

      • db_to_seq is reset to 0, scan_all_dbs is spawned
      • change feed stops at sequence 1042, it calls {rep_db_checkpoint, <<"_replicator">>}
      • {rep_db_checkpoint, <<"_replicator">>} call is handled, now latest db_to_seq for _replicator is 1042
      • {resume, <<"_replicator">>} is sent from scan_all_dbs process and received by replicator manager. It sees that db_to_seq has _replicator with latest sequence 1042, so it will either start from that instead of 0, thus skipping updates from 0 to 1042.

      This was seen by running the experiment with1000 replication documents were being updated. Around document 700 or so , node1 was killed (pkill -f node1) . node2 experienced the race condition on rescan and never picked up a bunch of document that should have belong to it. didn't.

      Attachments

        Activity

          People

            Unassigned Unassigned
            vatamane Nick Vatamaniuc
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: