Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
A deadlock that occurs that can get the starting replications to get stuck (and never update their state to triggered). This happens with a remote source and when using a single http connection and single worker.
The deadlock occurs in this case:
- Replication process starts, it starts the changes reader: https://github.com/apache/couchdb-couch-replicator/blob/master/src/couch_replicator.erl#L276
- Changes reader consumes the worker from httpc pool. At some point it will make a call back to the replication process to report how much work it has done using gen_server call report_seq_done
- In the meantime, main replication process calls get_pending_changes to get changes from the source. If the source is remote it will attempt to consumer a worker from httpc pool. However the worker is used by the change feed process. So get_pending_changes is blocked waiting for a worker to be released.
- So changes feed is waiting for report_seq_done call to replication process to return while holding a worker and main replication process is waiting for httpc pool to release the worker and it never responds to report_seq_done.
Attached python script (rep.py) to reproduce issue. Script creates n databases (tested with n=1000). Then replicates those databases to 1 single database. It also need Python CouchDB module from pip (or package repos).
1. It an can be run from ipython. By importing rep.
2. start dev cluster ./dev/run --admin=adm:pass
3. rep.replicate_1_to_n(1000)
wait....
4. rep.check_untriggered()
When it fails, result might look like this:
{ 'rdyno_00001_00006': None, 'rdyno_00001_00158': None }