Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: Replication
    • Labels:
      None

      Description

      04-replication-large-atts.t gets stuck around 558, sometimes a little earlier or later, but it times out eventually, regardless of the timeout. I tried doubling and such.

        Issue Links

          Activity

          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit eae95be9822b3b8af69d93c4ac33a8917e4403ec in couchdb's branch refs/heads/master from Robert Newson
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=eae95be ]

          Increase default socket buffer sizes for chttpd too (COUCHDB-1986)

          Show
          jira-bot ASF subversion and git services added a comment - Commit eae95be9822b3b8af69d93c4ac33a8917e4403ec in couchdb's branch refs/heads/master from Robert Newson [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=eae95be ] Increase default socket buffer sizes for chttpd too ( COUCHDB-1986 )
          Hide
          wohali Joan Touzet added a comment -

          I believe this was resolved in the last checkpoint and in rc.5.

          Show
          wohali Joan Touzet added a comment - I believe this was resolved in the last checkpoint and in rc.5.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit dd1c629fca71997c6c7dae5658ccb4b9f7855e68 in couchdb-couch-replicator's branch refs/heads/1843-feature-bigcouch from Benoit Chesneau
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=dd1c629 ]

          don't crash if ibrowse is already started

          fix #COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit dd1c629fca71997c6c7dae5658ccb4b9f7855e68 in couchdb-couch-replicator's branch refs/heads/1843-feature-bigcouch from Benoit Chesneau [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=dd1c629 ] don't crash if ibrowse is already started fix # COUCHDB-1986
          Hide
          kxepal Alexander Shorin added a comment -

          I have another good idea: we all know that OSX 10.9.2 is affected to this issue. And we also know that multipart parser speedup (COUCHDB-1953) are eventually solves this issue for it. So why not to use this case as starting point for deeper debugging? At least we have two stable states there: (timeouts and no COUCHDB-1953) and (no timeouts and COUCHDB-1953). If we suspect that the root of this issue is in the code, such behaviour is the key, I'm sure. At least it should more easily give the answer on "why so?" question and after that, we could apply the answer correctly on the others systems like FreeBSD 10 where current multipart improvements makes no any sense in solving this issue.

          Suddenly, I have no any MacOS around, but is there anyone who would like to help with debugging in this direction?

          Show
          kxepal Alexander Shorin added a comment - I have another good idea: we all know that OSX 10.9.2 is affected to this issue. And we also know that multipart parser speedup ( COUCHDB-1953 ) are eventually solves this issue for it. So why not to use this case as starting point for deeper debugging? At least we have two stable states there: (timeouts and no COUCHDB-1953 ) and (no timeouts and COUCHDB-1953 ). If we suspect that the root of this issue is in the code, such behaviour is the key, I'm sure. At least it should more easily give the answer on "why so?" question and after that, we could apply the answer correctly on the others systems like FreeBSD 10 where current multipart improvements makes no any sense in solving this issue. Suddenly, I have no any MacOS around, but is there anyone who would like to help with debugging in this direction?
          Hide
          benoitc Benoit Chesneau added a comment -

          Nick North this sounds like a good candidate yes. from the tests done by Alexander Shorin on irc it seems that the issue doest happen on recv but somewhere else in the code.

          Show
          benoitc Benoit Chesneau added a comment - Nick North this sounds like a good candidate yes. from the tests done by Alexander Shorin on irc it seems that the issue doest happen on recv but somewhere else in the code.
          Hide
          wendall911 Wendall Cada added a comment -

          I'd also like to see this stay open until we know the root cause as well.

          Show
          wendall911 Wendall Cada added a comment - I'd also like to see this stay open until we know the root cause as well.
          Hide
          benoitc Benoit Chesneau added a comment -

          the file upload issue is interresting, it happens behind a nginx proxy so using the 1.0 HTTP prorocol. @oriol what kind of request was made to upload the file? Was it a chunked encoded request, a request with content-lenght or no content-length at all? Which client did you use?

          Anyway to help solving the issue I propose the following:

          • isolate the test that is failing (or write a new one) in it's own test. so we can make sure to not have any regression (we should add the file uploadinging mentionned by Oriol too so we can validate it too.
          • annotate the code to have log some checkpoints in the parser. During our last test we noticed that the problem wasn't raised on recv, which is a good reason to think that the problem is somewhere else.
          Show
          benoitc Benoit Chesneau added a comment - the file upload issue is interresting, it happens behind a nginx proxy so using the 1.0 HTTP prorocol. @oriol what kind of request was made to upload the file? Was it a chunked encoded request, a request with content-lenght or no content-length at all? Which client did you use? Anyway to help solving the issue I propose the following: isolate the test that is failing (or write a new one) in it's own test. so we can make sure to not have any regression (we should add the file uploadinging mentionned by Oriol too so we can validate it too. annotate the code to have log some checkpoints in the parser. During our last test we noticed that the problem wasn't raised on recv, which is a good reason to think that the problem is somewhere else.
          Hide
          benoitc Benoit Chesneau added a comment -

          the issue isn't solved yet. The current patch is a workaround. We didn't find the root cause.

          Show
          benoitc Benoit Chesneau added a comment - the issue isn't solved yet. The current patch is a workaround. We didn't find the root cause.
          Hide
          dch Dave Cottlehuber added a comment -

          I assume we're OK with resolving this atm after Alexander Shorin's last patch.

          Show
          dch Dave Cottlehuber added a comment - I assume we're OK with resolving this atm after Alexander Shorin 's last patch.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9a0e561b1f3bc34810dd9112ab0a3fe3722c9a54 in couchdb's branch refs/heads/1.6.x from Alexander Shorin
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9a0e561 ]

          Increase mochiweb socket buffers up to 256KiB

          This dramatically improves replication speed for large documents and
          attachments and resolves replication test timeout for FreeBSD 10 and
          OSX 10.9.2

          Fixes COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9a0e561b1f3bc34810dd9112ab0a3fe3722c9a54 in couchdb's branch refs/heads/1.6.x from Alexander Shorin [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9a0e561 ] Increase mochiweb socket buffers up to 256KiB This dramatically improves replication speed for large documents and attachments and resolves replication test timeout for FreeBSD 10 and OSX 10.9.2 Fixes COUCHDB-1986
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9a0e561b1f3bc34810dd9112ab0a3fe3722c9a54 in couchdb's branch refs/heads/master from Alexander Shorin
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9a0e561 ]

          Increase mochiweb socket buffers up to 256KiB

          This dramatically improves replication speed for large documents and
          attachments and resolves replication test timeout for FreeBSD 10 and
          OSX 10.9.2

          Fixes COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9a0e561b1f3bc34810dd9112ab0a3fe3722c9a54 in couchdb's branch refs/heads/master from Alexander Shorin [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9a0e561 ] Increase mochiweb socket buffers up to 256KiB This dramatically improves replication speed for large documents and attachments and resolves replication test timeout for FreeBSD 10 and OSX 10.9.2 Fixes COUCHDB-1986
          Hide
          kxepal Alexander Shorin added a comment -

          Hi Oriol,

          Thanks for feedback, it's very useful. We're planning to increase (or, actually uncomment this line ) the buffers for upcoming 1.6.0 release.

          We suspect the issue in code because the issue suddenly been solved for OSX (but only for) after applying multipart parser improvements. But I feel that's the question for deeper debugging.

          However, the older releases are also affected to this issue (slowing down replication of large docs/atts), so uncommenting this line is the most portable and universal solution for speedup process for all.

          Show
          kxepal Alexander Shorin added a comment - Hi Oriol , Thanks for feedback, it's very useful. We're planning to increase (or, actually uncomment this line ) the buffers for upcoming 1.6.0 release. We suspect the issue in code because the issue suddenly been solved for OSX (but only for) after applying multipart parser improvements. But I feel that's the question for deeper debugging. However, the older releases are also affected to this issue (slowing down replication of large docs/atts), so uncommenting this line is the most portable and universal solution for speedup process for all.
          Hide
          olp Oriol added a comment - - edited

          Hello,

          We encountered a similar issue while uploading files to couchdb. For us, what made it really slow is the fact that we have couchdb behind an nginx proxy, which means that all connections go through the 'loopback' interface.

          The issue was due to the MTU of the 'loopback' interface in linux (MTU:16436) that it is set much bigger than on a physical ethernet nic (MTU:1500). This has a consequence on the MSS on the TCP layer which in turn has a consequence on the amount of data that couchdb receives.

          When sending a big file, all TCP packets are loaded with maximum payload up until the MSS size, so more data is sent without any ACK from the server. That means that the receiver buffer will fill up quicker on a 'loopback' interface than it does on a physical one. The default socket parameters set in couchdb prevent couchdb to fetch data quick enough from the TCP layer ending up on a 'TCP window full' situation. Ultimately, the connection is throttling at TCP level making it really slow.

          Increasing the 'recv_buffer' size to 131072 prevents the TCP layer to throttle and the file upload went, in our case, from 90 sec down to 0.7 secs.

          Reducing the MTU size of the interface should also work though we haven't check that.

          I don't really know if it is related to your specific issue but it may not be really onto couchdb code but more on the way the connection is made at a TCP level.

          Cheers, Oriol

          Show
          olp Oriol added a comment - - edited Hello, We encountered a similar issue while uploading files to couchdb. For us, what made it really slow is the fact that we have couchdb behind an nginx proxy, which means that all connections go through the 'loopback' interface. The issue was due to the MTU of the 'loopback' interface in linux (MTU:16436) that it is set much bigger than on a physical ethernet nic (MTU:1500). This has a consequence on the MSS on the TCP layer which in turn has a consequence on the amount of data that couchdb receives. When sending a big file, all TCP packets are loaded with maximum payload up until the MSS size, so more data is sent without any ACK from the server. That means that the receiver buffer will fill up quicker on a 'loopback' interface than it does on a physical one. The default socket parameters set in couchdb prevent couchdb to fetch data quick enough from the TCP layer ending up on a 'TCP window full' situation. Ultimately, the connection is throttling at TCP level making it really slow. Increasing the 'recv_buffer' size to 131072 prevents the TCP layer to throttle and the file upload went, in our case, from 90 sec down to 0.7 secs. Reducing the MTU size of the interface should also work though we haven't check that. I don't really know if it is related to your specific issue but it may not be really onto couchdb code but more on the way the connection is made at a TCP level. Cheers, Oriol
          Hide
          nicknorth Nick North added a comment -

          Sorry - I was not very clear. The problem was caused by me when adding my new code, and I fixed it before committing, so there's not much to point at now. But, in essence, the problem was with couch_doc:mp_parse_atts - the body_end clause needed to change from:

          mp_parse_atts(body_end) ->
          fun mp_parse_atts/1.

          to:

          mp_parse_atts(body_end) ->
          receive

          {get_bytes, Ref, From}

          ->
          From !

          {body_end, Ref, <<>>}

          end,
          fun mp_parse_atts/1.

          to push the end-of-body information through. The change is necessary because, when attachment lengths become optional, you have to change from "parse the number of bytes specified as the attachment length" to "parse until you hit the end of the attachment, and then notify everyone that you have finished". The interesting thing is that the bad code sometimes worked, and sometimes didn't, depending on whether other circumstances caused a hung process to be killed (I think) - the actual document loading work had all been done, but the parser had not cleared up after itself, so kept a socket alive somewhere until the process died.

          That's makes me wonder if something similar is happening here - the replication worked has finished, but the system is waiting for a "finished parsing" indicator that never turns up. If the same sort of thing is happening with the replicator, it would be around the couch_replicator_api_wrap:mp_parse_mixed/1 function and the process it's communicating with. but that's as far as I have got so far.

          Show
          nicknorth Nick North added a comment - Sorry - I was not very clear. The problem was caused by me when adding my new code, and I fixed it before committing, so there's not much to point at now. But, in essence, the problem was with couch_doc:mp_parse_atts - the body_end clause needed to change from: mp_parse_atts(body_end) -> fun mp_parse_atts/1. to: mp_parse_atts(body_end) -> receive {get_bytes, Ref, From} -> From ! {body_end, Ref, <<>>} end, fun mp_parse_atts/1. to push the end-of-body information through. The change is necessary because, when attachment lengths become optional, you have to change from "parse the number of bytes specified as the attachment length" to "parse until you hit the end of the attachment, and then notify everyone that you have finished". The interesting thing is that the bad code sometimes worked, and sometimes didn't, depending on whether other circumstances caused a hung process to be killed (I think) - the actual document loading work had all been done, but the parser had not cleared up after itself, so kept a socket alive somewhere until the process died. That's makes me wonder if something similar is happening here - the replication worked has finished, but the system is waiting for a "finished parsing" indicator that never turns up. If the same sort of thing is happening with the replicator, it would be around the couch_replicator_api_wrap:mp_parse_mixed/1 function and the process it's communicating with. but that's as far as I have got so far.
          Hide
          benoitc Benoit Chesneau added a comment -

          confirming the behaviour on fbsd 10.

          Nick North can you point the part of the code where you saw the issue during your tests? That can help.

          Show
          benoitc Benoit Chesneau added a comment - confirming the behaviour on fbsd 10. Nick North can you point the part of the code where you saw the issue during your tests? That can help.
          Hide
          nicknorth Nick North added a comment -

          That's a possibility: some bits of parsing look for a specific number of
          bytes, while others read until an "end" token. While working on JIRA 1956 I
          encountered behaviour rather like the large attachments issue when I mixed
          the two up and the parser hung, waiting for an end token that never
          arrived. MP parsing in the replicator partly does its own thing, and partly
          uses the general MP code, so this sort of problem may have crept in. I'll
          take a look at the replicator code later, and see if anything jumps out.

          Nick

          Show
          nicknorth Nick North added a comment - That's a possibility: some bits of parsing look for a specific number of bytes, while others read until an "end" token. While working on JIRA 1956 I encountered behaviour rather like the large attachments issue when I mixed the two up and the parser hung, waiting for an end token that never arrived. MP parsing in the replicator partly does its own thing, and partly uses the general MP code, so this sort of problem may have crept in. I'll take a look at the replicator code later, and see if anything jumps out. Nick
          Hide
          benoitc Benoit Chesneau added a comment - - edited

          Alexander Shorin thanks for the branch. It helps. I am creating some vms to test on freebsd right now.

          loud thinking but `RECBUF` means that the socket can accept more data before asking for more on the remote.

          Like Alexander Shorin said on irc we are on localhost, so latency errors can likely happen. Which let me wonder if we have an error in the MP parser, incorrectly asking for more data at the end. (issue with the content-lenght?) . Not sure how to test it right now.

          Show
          benoitc Benoit Chesneau added a comment - - edited Alexander Shorin thanks for the branch. It helps. I am creating some vms to test on freebsd right now. loud thinking but `RECBUF` means that the socket can accept more data before asking for more on the remote. Like Alexander Shorin said on irc we are on localhost, so latency errors can likely happen. Which let me wonder if we have an error in the MP parser, incorrectly asking for more data at the end. (issue with the content-lenght?) . Not sure how to test it right now.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7365f57714168e4fbfb603e51b95d201f7cce787 in couchdb's branch refs/heads/1986-recbuf-fix from Alexander Shorin
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=7365f57 ]

          Increase Mochiweb RECBUF up to 32768 bytes

          Magically solves slow big documents uploading issue
          for FreeBSD 10 and OSX 10.9.2

          COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7365f57714168e4fbfb603e51b95d201f7cce787 in couchdb's branch refs/heads/1986-recbuf-fix from Alexander Shorin [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=7365f57 ] Increase Mochiweb RECBUF up to 32768 bytes Magically solves slow big documents uploading issue for FreeBSD 10 and OSX 10.9.2 COUCHDB-1986
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 514339092e710482c6c8d4ec51c961718bc6bdb1 in couchdb's branch refs/heads/1986-recbuf-fix from Alexander Shorin
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=5143390 ]

          Increase Mochiweb RECBUF up to 32768 bytes

          Magically solves slow big documents uploading issue
          for FreeBSD 10 and OSX 10.9.2

          COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 514339092e710482c6c8d4ec51c961718bc6bdb1 in couchdb's branch refs/heads/1986-recbuf-fix from Alexander Shorin [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=5143390 ] Increase Mochiweb RECBUF up to 32768 bytes Magically solves slow big documents uploading issue for FreeBSD 10 and OSX 10.9.2 COUCHDB-1986
          Hide
          benoitc Benoit Chesneau added a comment -

          seriously these reports mean nothing. Can we have a proper branch to test with all the patches applied?

          Show
          benoitc Benoit Chesneau added a comment - seriously these reports mean nothing. Can we have a proper branch to test with all the patches applied?
          Hide
          andywenk Andy Wenk added a comment -

          Dave - good news from my side

          All tests successful.
          Files=50, Tests=1190, 239 wallclock secs ( 0.54 usr 0.11 sys + 72.14 cusr 19.56 csys = 92.35 CPU)
          Result: PASS

          +1

          Show
          andywenk Andy Wenk added a comment - Dave - good news from my side All tests successful. Files=50, Tests=1190, 239 wallclock secs ( 0.54 usr 0.11 sys + 72.14 cusr 19.56 csys = 92.35 CPU) Result: PASS +1
          Hide
          dch Dave Cottlehuber added a comment -

          And rolled a tarball off git master + this patch, http://people.apache.org/~dch/snapshots/couchdb/20140318/ if anybody wants to give it a try.

          Show
          dch Dave Cottlehuber added a comment - And rolled a tarball off git master + this patch, http://people.apache.org/~dch/snapshots/couchdb/20140318/ if anybody wants to give it a try.
          Hide
          dch Dave Cottlehuber added a comment -

          master + with the RECBUF hack,

          04-replication-large-atts.t test results:

          FreeBSD 9.2 amd64, R15b03-1

          • without buffer hack: FAIL
          • with the buffer hack: OK
            FreeBSD 10.0 amd64, R15b03-1
          • without buffer hack: FAIL
          • with the buffer hack: OK

          I've had a few segfaults with R16B03-1 so can't report reliably there.

          Show
          dch Dave Cottlehuber added a comment - master + with the RECBUF hack, 04-replication-large-atts.t test results: FreeBSD 9.2 amd64, R15b03-1 without buffer hack: FAIL with the buffer hack: OK FreeBSD 10.0 amd64, R15b03-1 without buffer hack: FAIL with the buffer hack: OK I've had a few segfaults with R16B03-1 so can't report reliably there.
          Hide
          andywenk Andy Wenk added a comment -

          I added the recbuf patch in 1.6.x. The patch is to change

          -define(RECBUF_SIZE, 8192).

          to

          -define(RECBUF_SIZE, 8192 * 4).

          in src/mochiweb/internal.hrl. The result for make check is:

          /Users/andwen/project/couchdb/couchdb-src-clean-1-6/src/couch_replicator/test/04-replication-large-atts.t ... 531/? Bailout called. Further testing stopped: Timeout waiting for replication to finish
          FAILED--Further testing stopped: Timeout waiting for replication to finish
          make[2]: *** [check] Error 127
          make[1]: *** [check-recursive] Error 1
          make: *** [check-recursive] Error 1

          then changed the recbuf patch to src/mochiweb/internal.hrl:

          -define(RECBUF_SIZE, 8192 * 8).

          And all test successfull.

          R16B03-1 (erts-5.10.4), Mac OS X 10.9.2, 16 GB 1600 MHz DDR3, 2.8 GHz Intel Core i7, APPLE SSD SM0512F Media 500MB

          Show
          andywenk Andy Wenk added a comment - I added the recbuf patch in 1.6.x. The patch is to change -define(RECBUF_SIZE, 8192). to -define(RECBUF_SIZE, 8192 * 4). in src/mochiweb/internal.hrl. The result for make check is: /Users/andwen/project/couchdb/couchdb-src-clean-1-6/src/couch_replicator/test/04-replication-large-atts.t ... 531/? Bailout called. Further testing stopped: Timeout waiting for replication to finish FAILED--Further testing stopped: Timeout waiting for replication to finish make [2] : *** [check] Error 127 make [1] : *** [check-recursive] Error 1 make: *** [check-recursive] Error 1 then changed the recbuf patch to src/mochiweb/internal.hrl: -define(RECBUF_SIZE, 8192 * 8). And all test successfull. R16B03-1 (erts-5.10.4), Mac OS X 10.9.2, 16 GB 1600 MHz DDR3, 2.8 GHz Intel Core i7, APPLE SSD SM0512F Media 500MB
          Hide
          andywenk Andy Wenk added a comment -

          Benoit Chesneau Erlang R16B03-1 (erts-5.10.4)

          Show
          andywenk Andy Wenk added a comment - Benoit Chesneau Erlang R16B03-1 (erts-5.10.4)
          Hide
          kxepal Alexander Shorin added a comment -

          Andy Wenk timeout without recbuf patch or with it?

          Show
          kxepal Alexander Shorin added a comment - Andy Wenk timeout without recbuf patch or with it?
          Hide
          benoitc Benoit Chesneau added a comment -

          Odd,

          I can confirm that all etst spass here. Also on travis-ci: https://travis-ci.org/refuge/rcouch

          It's looks like disk related issue or something. Also what is your erlang version?

          Show
          benoitc Benoit Chesneau added a comment - Odd, I can confirm that all etst spass here. Also on travis-ci: https://travis-ci.org/refuge/rcouch It's looks like disk related issue or something. Also what is your erlang version?
          Hide
          andywenk Andy Wenk added a comment -

          Alexander Shorin result for 1.6 make check:

          Users/andwen/project/couchdb/couchdb-src-clean/src/couch_replicator/test/04-replication-large-atts.t ... 477/? Bailout called. Further testing stopped: Timeout waiting for replication to finish
          FAILED--Further testing stopped: Timeout waiting for replication to finish
          make[2]: *** [check] Error 127
          make[1]: *** [check-recursive] Error 1
          make: *** [check-recursive] Error 1

          Benoit Chesneau make test -> all erlang tests pass.
          make check:

          ....
          erlang_views.js ... fail
          javascript traceback:

          FAIL
          Reason: The database could not be created, the file already exists.
          Trace back (most recent call first):

          508: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch.js
          CouchError([object Object])
          471: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch.js
          ([object CouchHTTP])
          33: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch.js
          ()
          100: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/test/erlang_views.js
          ()
          378: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch_test_runner.js
          run_on_modified_server([object Array],(function () {var doc = {_id: "1
          25: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/test/erlang_views.js
          ()
          36: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/javascript/cli_runner.js
          runTest()
          47: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/javascript/cli_runner.js
          ....
          view_pagination.js ... ok
          view_sandboxing.js ... ok
          view_update_seq.js ... ok

          1/87 tests failed
          make: *** [testjs] Error 1

          I guess this is just because the db already exists?

          Show
          andywenk Andy Wenk added a comment - Alexander Shorin result for 1.6 make check: Users/andwen/project/couchdb/couchdb-src-clean/src/couch_replicator/test/04-replication-large-atts.t ... 477/? Bailout called. Further testing stopped: Timeout waiting for replication to finish FAILED--Further testing stopped: Timeout waiting for replication to finish make [2] : *** [check] Error 127 make [1] : *** [check-recursive] Error 1 make: *** [check-recursive] Error 1 Benoit Chesneau make test -> all erlang tests pass. make check: .... erlang_views.js ... fail javascript traceback: FAIL Reason: The database could not be created, the file already exists. Trace back (most recent call first): 508: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch.js CouchError( [object Object] ) 471: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch.js ( [object CouchHTTP] ) 33: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch.js () 100: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/test/erlang_views.js () 378: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/couch_test_runner.js run_on_modified_server( [object Array] ,(function () {var doc = {_id: "1 25: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/out/share/www/script/test/erlang_views.js () 36: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/javascript/cli_runner.js runTest() 47: /Users/andwen/project/couchdb/couchdb-src-clean/test/etap/../../test/javascript/cli_runner.js .... view_pagination.js ... ok view_sandboxing.js ... ok view_update_seq.js ... ok 1/87 tests failed make: *** [testjs] Error 1 I guess this is just because the db already exists?
          Hide
          benoitc Benoit Chesneau added a comment -

          Andy Wenk can you by chance try on the 1994-merge-rcouch branch? I don't reproduce the error anymore here, but it may be just my machines...

          Show
          benoitc Benoit Chesneau added a comment - Andy Wenk can you by chance try on the 1994-merge-rcouch branch? I don't reproduce the error anymore here, but it may be just my machines...
          Hide
          andywenk Andy Wenk added a comment -

          Jan Lehnardt same problem:

          Test Summary Report
          -------------------
          ./test/etap/200-view-group-no-db-leaks.t (Wstat: 0 Tests: 28 Failed: 2)
          Failed tests: 5, 9
          Files=51, Tests=1214, 253 wallclock secs ( 0.54 usr 0.11 sys + 83.10 cusr 23.09 csys = 106.84 CPU)
          Result: FAIL
          make: *** [check] Error 1

          Show
          andywenk Andy Wenk added a comment - Jan Lehnardt same problem: Test Summary Report ------------------- ./test/etap/200-view-group-no-db-leaks.t (Wstat: 0 Tests: 28 Failed: 2) Failed tests: 5, 9 Files=51, Tests=1214, 253 wallclock secs ( 0.54 usr 0.11 sys + 83.10 cusr 23.09 csys = 106.84 CPU) Result: FAIL make: *** [check] Error 1
          Hide
          andywenk Andy Wenk added a comment -

          Alexander Shorin remotes/origin/1.6.x ?

          Jan Lehnardt will do and report

          Show
          andywenk Andy Wenk added a comment - Alexander Shorin remotes/origin/1.6.x ? Jan Lehnardt will do and report
          Hide
          janl Jan Lehnardt added a comment -

          Andy Wenk can you re-run this? The 200 test is also one of the ones that fails spuriously, albeit a lot less often.

          Show
          janl Jan Lehnardt added a comment - Andy Wenk can you re-run this? The 200 test is also one of the ones that fails spuriously, albeit a lot less often.
          Hide
          kxepal Alexander Shorin added a comment -

          Andy Wenk what about 1.6.x branch?

          Show
          kxepal Alexander Shorin added a comment - Andy Wenk what about 1.6.x branch?
          Hide
          andywenk Andy Wenk added a comment -

          I tested against master 5989bb324a and all tests pass

          I tested against 1986-fix-ibrowse-infinite-async-timeout b35884580 and the result is:

          Test Summary Report
          -------------------
          ./test/etap/200-view-group-no-db-leaks.t (Wstat: 0 Tests: 28 Failed: 1)
          Failed test: 9
          Files=51, Tests=1214, 299 wallclock secs ( 0.76 usr 0.15 sys + 115.80 cusr 30.51 csys = 147.22 CPU)
          Result: FAIL
          make: *** [check] Error 1

          Show
          andywenk Andy Wenk added a comment - I tested against master 5989bb324a and all tests pass I tested against 1986-fix-ibrowse-infinite-async-timeout b35884580 and the result is: Test Summary Report ------------------- ./test/etap/200-view-group-no-db-leaks.t (Wstat: 0 Tests: 28 Failed: 1) Failed test: 9 Files=51, Tests=1214, 299 wallclock secs ( 0.76 usr 0.15 sys + 115.80 cusr 30.51 csys = 147.22 CPU) Result: FAIL make: *** [check] Error 1
          Hide
          kxepal Alexander Shorin added a comment -

          Adam Kocoloski yes, there is the trace. I had to set retry and workers amount to 1 to reduce the output. Here it is:
          https://www.friendpaste.com/7IH6C4z31Lu5ZvCpz4rU1r

          Show
          kxepal Alexander Shorin added a comment - Adam Kocoloski yes, there is the trace. I had to set retry and workers amount to 1 to reduce the output. Here it is: https://www.friendpaste.com/7IH6C4z31Lu5ZvCpz4rU1r
          Hide
          kocolosk Adam Kocoloski added a comment -

          Alexander Shorin the stacktrace would be very useful if you have it. I checked to see if that binary is a string – it's not.

          Show
          kocolosk Adam Kocoloski added a comment - Alexander Shorin the stacktrace would be very useful if you have it. I checked to see if that binary is a string – it's not.
          Hide
          nicknorth Nick North added a comment -

          That's definitely interesting to know - many thanks. I'll go away and work out what's going on there. It's clearly not as ready for merging as I thought.

          Show
          nicknorth Nick North added a comment - That's definitely interesting to know - many thanks. I'll go away and work out what's going on there. It's clearly not as ready for merging as I thought.
          Hide
          kxepal Alexander Shorin added a comment -

          Nick North have tested with 1956-attachment-handling branch. It fails with the error:

          [info] [<0.238.0>] Retrying GET to http://127.0.0.1:17672/couch_test_rep_db_a/doc8?revs=true&open_revs=%5B%221-c02137a5611f322918a182dd598e0e75%22%5D&latest=true in 1.0 seconds due to error {exit_error,{case_clause,<<31,139,8,0,0,0,0,0,0,3,0,9,64,246,191,186,84,230,221,150,241,153,231,58,150,239,92,232,199,177,94,24,223,17,57,22,55,106,142,61,18,223,104,248,120,196,56,48,18,140,22,40,42,214,197,202,41,82,56,159,221,197,194,174,12,107,94,83,123,150,226,81,19,74,69,104,35,207,184,145,79,84,91,68,109,45,7,157,133,23,248,164,209,162,184,120,162,109,183,20,48,92,243,241,0,42,176,140,202,179,146,80,215,13,151,109,146,240,125,193,236,183,185,38,116,249,207,254,8,136,24,45,237,199,231,75,98,117,18,161,140,194,250,113,170,185,44,24,66,203,247,245,174,199,93,45,156,11,212,8,15,141,185,86,167,53,123,129,235,176,151,147,122,185,220,162,238,158,157,108,252,112,89,231,25,25,144,153,0,238,220,222,52,78,222,187,194,104,81,224,3,130,152,90,224,136,107,132,202,236,208,37,2,201,216,55,106,232,101,74,89,219,34,207,40,123,57,141,235,196,15,101,111,25,6,191,188,106,188,214,14,4,13,199,65,239,171,214,105,130,85,237,73,223,188,73,189,226,28,60,18,119,220,158,162,33,7,198,77,99,213,250,172,92,83,39,205,165,243,53,142,56,9,170,65,48,127,146,32,149,157,128,121,153,217,156,46,133,81,92,244,95,19,215,9,121,28,65,92,216,34,130,165,89,116,10,26,47,130,89,170,72,44,108,26,69,76,229,134,78,238,243,113,108,178,134,137,243,170,78,76,127,172,29,190,144,160,134,72,103,133,145,82,66,162,6,156,127,153,149,188,114,70,63,107,43,114,114,230,142,81,8,246,55,61,246,60,229,249,94,71,246,168,248,41,180,44,217,116,25,246,233,102,246,172,249,173,70,72,47,43,98,57,185,173,235,240,105,85,98,110,20,153,125,220,100,179,79,181,219,93,59,23,181,57,242,196,237,250,52,55,157,16,112,4,67,35,64,84,152,31,223,112,61,202,174,189,216,202,197,221,203,186,162,243,253,164,177,146,251,12,22,179,250,13,159,66,186,72,215,54,21,142,210,32,14,219,137,10,198,149,82,167,212,55,131,176,186,50,115,188,59,245,106,28,187,71,249,57,31,174,212,246,141,220,225,247,162,22,6,38,57,39,67,144,126,31,10,87,250,129,58,193,4,148,24,239,179,70,76,157,49,193,64,139,129,29,182,205,26,150,150,105,206,8,14,98,45,181,4,176,224,98,166,252,230,44,210,143,1,133,6,107,0,27,6,221,138,88,210,56,165,23,23,189,18,109,144,188,235,79,219,192,176,2,142,59,83,183,122,213,125,72,187,35,253,231,202,241,140,19,236,70,246,182,242,203,223,42,126,93,23,196,39,4,25,155,235,233,209,174,93,112,54,56,175,213,47,213,2>>}}
          

          and the request repeats

          Show
          kxepal Alexander Shorin added a comment - Nick North have tested with 1956-attachment-handling branch. It fails with the error: [info] [<0.238.0>] Retrying GET to http: //127.0.0.1:17672/couch_test_rep_db_a/doc8?revs= true &open_revs=%5B%221-c02137a5611f322918a182dd598e0e75%22%5D&latest= true in 1.0 seconds due to error {exit_error,{case_clause,<<31,139,8,0,0,0,0,0,0,3,0,9,64,246,191,186,84,230,221,150,241,153,231,58,150,239,92,232,199,177,94,24,223,17,57,22,55,106,142,61,18,223,104,248,120,196,56,48,18,140,22,40,42,214,197,202,41,82,56,159,221,197,194,174,12,107,94,83,123,150,226,81,19,74,69,104,35,207,184,145,79,84,91,68,109,45,7,157,133,23,248,164,209,162,184,120,162,109,183,20,48,92,243,241,0,42,176,140,202,179,146,80,215,13,151,109,146,240,125,193,236,183,185,38,116,249,207,254,8,136,24,45,237,199,231,75,98,117,18,161,140,194,250,113,170,185,44,24,66,203,247,245,174,199,93,45,156,11,212,8,15,141,185,86,167,53,123,129,235,176,151,147,122,185,220,162,238,158,157,108,252,112,89,231,25,25,144,153,0,238,220,222,52,78,222,187,194,104,81,224,3,130,152,90,224,136,107,132,202,236,208,37,2,201,216,55,106,232,101,74,89,219,34,207,40,123,57,141,235,196,15,101,111,25,6,191,188,106,188,214,14,4,13,199,65,239,171,214,105,130,85,237,73,223,188,73,189,226,28,60,18,119,220,158,162,33,7,198,77,99,213,250,172,92,83,39,205,165,243,53,142,56,9,170,65,48,127,146,32,149,157,128,121,153,217,156,46,133,81,92,244,95,19,215,9,121,28,65,92,216,34,130,165,89,116,10,26,47,130,89,170,72,44,108,26,69,76,229,134,78,238,243,113,108,178,134,137,243,170,78,76,127,172,29,190,144,160,134,72,103,133,145,82,66,162,6,156,127,153,149,188,114,70,63,107,43,114,114,230,142,81,8,246,55,61,246,60,229,249,94,71,246,168,248,41,180,44,217,116,25,246,233,102,246,172,249,173,70,72,47,43,98,57,185,173,235,240,105,85,98,110,20,153,125,220,100,179,79,181,219,93,59,23,181,57,242,196,237,250,52,55,157,16,112,4,67,35,64,84,152,31,223,112,61,202,174,189,216,202,197,221,203,186,162,243,253,164,177,146,251,12,22,179,250,13,159,66,186,72,215,54,21,142,210,32,14,219,137,10,198,149,82,167,212,55,131,176,186,50,115,188,59,245,106,28,187,71,249,57,31,174,212,246,141,220,225,247,162,22,6,38,57,39,67,144,126,31,10,87,250,129,58,193,4,148,24,239,179,70,76,157,49,193,64,139,129,29,182,205,26,150,150,105,206,8,14,98,45,181,4,176,224,98,166,252,230,44,210,143,1,133,6,107,0,27,6,221,138,88,210,56,165,23,23,189,18,109,144,188,235,79,219,192,176,2,142,59,83,183,122,213,125,72,187,35,253,231,202,241,140,19,236,70,246,182,242,203,223,42,126,93,23,196,39,4,25,155,235,233,209,174,93,112,54,56,175,213,47,213,2>>}} and the request repeats
          Hide
          nicknorth Nick North added a comment -

          It would be interesting to try with the 1956-attachment-handling branch applied. This uses mochiweb's chunked input handler. There's no obvious reason for it to be better, but then there's no obvious reason why it is currently bad either. (And it might even make it worse, which would also be useful to know.)

          Show
          nicknorth Nick North added a comment - It would be interesting to try with the 1956-attachment-handling branch applied. This uses mochiweb's chunked input handler. There's no obvious reason for it to be better, but then there's no obvious reason why it is currently bad either. (And it might even make it worse, which would also be useful to know.)
          Hide
          kxepal Alexander Shorin added a comment -

          Benoit Chesneau on master's HEAD (sorry, lost my note about actually commit, but it's not older than 5 days ago) and with Nick's patch. Same picture I have for 1.6.x branch, so actually it doesn't really matters.

          Summarizing: set recbuf to a minimun 32768 to let the issue been solved. Currently it's 8192 by default. So factor is 4. But it's workaround.

          Nick North yes, I noticed that mochiweb reads by 4096 chunks from socket while ibrowse send 8192 sized chunks. If you take a look on logs against different recbuf factor, you'll notice the pattern how timings are changing over time. I left there only 1 replication worker to produce quite solid picture.

          Show
          kxepal Alexander Shorin added a comment - Benoit Chesneau on master's HEAD (sorry, lost my note about actually commit, but it's not older than 5 days ago) and with Nick's patch. Same picture I have for 1.6.x branch, so actually it doesn't really matters. Summarizing: set recbuf to a minimun 32768 to let the issue been solved. Currently it's 8192 by default. So factor is 4. But it's workaround. Nick North yes, I noticed that mochiweb reads by 4096 chunks from socket while ibrowse send 8192 sized chunks. If you take a look on logs against different recbuf factor, you'll notice the pattern how timings are changing over time. I left there only 1 replication worker to produce quite solid picture.
          Hide
          nicknorth Nick North added a comment -

          It may have absolutely no bearing on the problem, but there are a few places in the attachment handling code that specify fixed or default chunk sizes as well.

          couch_httpd_db:receive_request_data/2 receives data in chunks up 4096.
          couch_db:flush_att/2 uses attachment_stream_buffer_size from the config file, defaulting to 4096.
          couch_db:write_streamed_attachment/3 writes in chunks of 16#2000.

          I don't know if any of these interact badly with RECBUF value, but I've stumbled across them in work on issue 1956, so mention them in case they are important.

          Show
          nicknorth Nick North added a comment - It may have absolutely no bearing on the problem, but there are a few places in the attachment handling code that specify fixed or default chunk sizes as well. couch_httpd_db:receive_request_data/2 receives data in chunks up 4096. couch_db:flush_att/2 uses attachment_stream_buffer_size from the config file, defaulting to 4096. couch_db:write_streamed_attachment/3 writes in chunks of 16#2000. I don't know if any of these interact badly with RECBUF value, but I've stumbled across them in work on issue 1956, so mention them in case they are important.
          Hide
          benoitc Benoit Chesneau added a comment -

          also thanks for the tests

          Show
          benoitc Benoit Chesneau added a comment - also thanks for the tests
          Hide
          benoitc Benoit Chesneau added a comment -

          On which branch are you running the tests?

          can you summarise the results of "factors" ? Also which one make it works?

          Show
          benoitc Benoit Chesneau added a comment - On which branch are you running the tests? can you summarise the results of "factors" ? Also which one make it works?
          Hide
          kxepal Alexander Shorin added a comment - - edited

          Ok, here I got some numbers for the update. Long post.

          • FreeBSD 9.1 with Erlang R15B02, clean install (vbox, 512 MB, 2 CPU)
          • FreeBSD 10 with Erlang R16B02, clean install (vbox, 512 MB, 2 CPU)

          I ran 04-replication-large-atts.t test on all of them:

          $ ./test/etap/run src/couch_replicator/test/04-replication-large-atts.t
          

          And there is the results for the master's HEAD:

          • FreeBSD 9.1: Files=1, Tests=1192, 47 wallclock secs ( 0.19 usr 0.20 sys + 17.39 cusr 18.25 csys = 36.03 CPU)
          • FreeBSD 10: Bailout called. Further testing stopped: Timeout waiting for replication to finish

          Not the news at all.

          Then, I applied small patch to measure the response time:

          diff --git a/src/couch_replicator/src/couch_replicator_httpc.erl b/src/couch_replicator/src/couch_replicator_httpc.erl
          index b8fb31b..14f31f8 100644
          --- a/src/couch_replicator/src/couch_replicator_httpc.erl
          +++ b/src/couch_replicator/src/couch_replicator_httpc.erl
          @@ -61,8 +61,9 @@ send_ibrowse_req(#httpdb{headers = BaseHeaders} = HttpDb, Params) ->
                   lists:ukeymerge(1, get_value(ibrowse_options, Params, []),
                       HttpDb#httpdb.ibrowse_options)
               ],
          -    Response = ibrowse:send_req_direct(
          -        Worker, Url, Headers2, Method, Body, IbrowseOptions, infinity),
          +    {Delay, Response} = timer:tc(ibrowse, send_req_direct,
          +        [Worker, Url, Headers2, Method, Body, IbrowseOptions, infinity]),
          +    ?LOG_INFO("Made request ~p ~p in ~p secs ~n", [Method, Url, Delay / 1000000]),
               {Worker, Response}.
          

          Since the most heavies requests are attachments uploading, let's filter the output:

          $ ./test/etap/run src/couch_replicator/test/04-replication-large-atts.t | egrep "Made request (post|put)"
          

          FreeBSD 9.1:

          [info] [<0.345.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.101148 secs 
          [info] [<0.208.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_a/_local/ed1e1b8b6cc2aea9eaeb935affe17fdd" in 0.194595 secs 
          [info] [<0.351.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.115996 secs 
          [info] [<0.208.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_a/_local/ed1e1b8b6cc2aea9eaeb935affe17fdd" in 0.19801 secs 
          [info] [<0.376.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198322 secs 
          [info] [<0.379.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.196964 secs 
          [info] [<0.378.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.197843 secs 
          [info] [<0.377.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198358 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc8?new_edits=false" in 0.995802 secs 
          [info] [<0.378.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc7?new_edits=false" in 0.996697 secs 
          [info] [<0.376.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc9?new_edits=false" in 0.998658 secs 
          [info] [<0.379.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc6?new_edits=false" in 0.998032 secs 
          [info] [<0.377.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198479 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc5?new_edits=false" in 0.328846 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc4?new_edits=false" in 0.329292 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc3?new_edits=false" in 0.339167 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc2?new_edits=false" in 0.329193 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc11?new_edits=false" in 0.329162 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc10?new_edits=false" in 0.329075 secs 
          [info] [<0.377.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc1?new_edits=false" in 0.329311 secs 
          [info] [<0.368.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_ensure_full_commit" in 0.099283 secs 
          [info] [<0.368.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/_local/76054f0aaccade46a4806b08a211bc16" in 0.198862 secs 
          [info] [<0.462.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.199692 secs 
          [info] [<0.466.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198908 secs 
          [info] [<0.464.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.199588 secs 
          [info] [<0.468.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.199018 secs 
          [info] [<0.500.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc9?new_edits=false" in 2.522225 secs 
          [info] [<0.510.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc6?new_edits=false" in 2.514634 secs 
          [info] [<0.506.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc7?new_edits=false" in 2.51673 secs 
          [info] [<0.505.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc8?new_edits=false" in 2.517766 secs 
          [info] [<0.462.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.197848 secs 
          [info] [<0.616.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.108526 secs 
          [info] [<0.450.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_ensure_full_commit" in 0.107361 secs 
          [info] [<0.450.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_a/_local/e8009d39fc94c9267bf51e9863c46895" in 0.202483 secs 
          [info] [<0.450.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/_local/e8009d39fc94c9267bf51e9863c46895" in 0.197966 secs 
          [info] [<0.570.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc4?new_edits=false" in 4.332465 secs 
          [info] [<0.567.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc5?new_edits=false" in 4.33544 secs 
          [info] [<0.578.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc2?new_edits=false" in 4.335835 secs 
          [info] [<0.574.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc3?new_edits=false" in 4.33976 secs 
          [info] [<0.588.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc1?new_edits=false" in 4.328189 secs 
          [info] [<0.585.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc10?new_edits=false" in 4.329808 secs 
          [info] [<0.582.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/doc11?new_edits=false" in 4.331474 secs 
          [info] [<0.625.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.099479 secs 
          [info] [<0.450.0>] Made request post "http://127.0.0.1:64967/couch_test_rep_db_b/_ensure_full_commit" in 0.099653 secs 
          [info] [<0.450.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_a/_local/e8009d39fc94c9267bf51e9863c46895" in 0.198995 secs 
          [info] [<0.450.0>] Made request put "http://127.0.0.1:64967/couch_test_rep_db_b/_local/e8009d39fc94c9267bf51e9863c46895" in 0.199502 secs 
          
          

          Nothing looks criminal.

          FreeBSD 10:

          # Creating target database
          # Triggering replication
          [info] [<0.309.0>] 127.0.0.1 - - HEAD /couch_test_rep_db_b/ 200
          [info] [<0.370.0>] Made request head "http://127.0.0.1:11166/couch_test_rep_db_b/" in 0.004577 secs 
          [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/ 200
          [info] [<0.370.0>] Made request get "http://127.0.0.1:11166/couch_test_rep_db_b/" in 0.100691 secs 
          [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651 404
          [info] [<0.370.0>] Made request get "http://127.0.0.1:11166/couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651" in 0.098708 secs 
          [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/38bbd51d509064584b986ecfc71e443a 404
          [info] [<0.370.0>] Made request get "http://127.0.0.1:11166/couch_test_rep_db_b/_local/38bbd51d509064584b986ecfc71e443a" in 0.099318 secs 
          [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/eaecd74f7be532d2b82aa4c137cc0b11 404
          [info] [<0.370.0>] Made request get "http://127.0.0.1:11166/couch_test_rep_db_b/_local/eaecd74f7be532d2b82aa4c137cc0b11" in 0.099245 secs 
          [info] [<0.370.0>] Replication `"2eaa8e3e950d010d82b20eff81e9f651"` is using:
          	4 worker processes
          	a worker batch size of 500
          	20 HTTP connections
          	a connection timeout of 30000 milliseconds
          	10 retries per request
          	socket options are: [{keepalive,true},{nodelay,false}]
          [info] [<0.2.0>] starting new replication `2eaa8e3e950d010d82b20eff81e9f651` at <0.370.0> (`couch_test_rep_db_a` -> `http://127.0.0.1:11166/couch_test_rep_db_b/`)
          [info] [<0.309.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.310.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.302.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.378.0>] Made request post "http://127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.198793 secs 
          [info] [<0.380.0>] Made request post "http://127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.200831 secs 
          [info] [<0.379.0>] Made request post "http://127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.20114 secs 
          [info] [<0.381.0>] Made request post "http://127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.20127 secs 
          [info] [<0.310.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc8?new_edits=false 201
          [info] [<0.309.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc9?new_edits=false 201
          [info] [<0.302.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc6?new_edits=false 201
          [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc7?new_edits=false 201
          [info] [<0.378.0>] Made request put "http://127.0.0.1:11166/couch_test_rep_db_b/doc9?new_edits=false" in 116.229091 secs 
          [info] [<0.379.0>] Made request put "http://127.0.0.1:11166/couch_test_rep_db_b/doc8?new_edits=false" in 116.223251 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:11166/couch_test_rep_db_b/doc6?new_edits=false" in 116.226021 secs 
          [info] [<0.380.0>] Made request put "http://127.0.0.1:11166/couch_test_rep_db_b/doc7?new_edits=false" in 116.227487 secs 
          [info] [<0.310.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.378.0>] Made request post "http://127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.198398 secs 
          [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_ensure_full_commit 201
          [info] [<0.370.0>] Made request post "http://127.0.0.1:11166/couch_test_rep_db_b/_ensure_full_commit" in 0.099793 secs 
          [info] [<0.370.0>] recording a checkpoint for `couch_test_rep_db_a` -> `http://127.0.0.1:11166/couch_test_rep_db_b/` at source update_seq 3
          [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651 201
          [info] [<0.370.0>] Made request put "http://127.0.0.1:11166/couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651" in 0.20571 secs 
          [info] [<0.310.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc5?new_edits=false 201
          [info] [<0.378.0>] Made request put "http://127.0.0.1:11166/couch_test_rep_db_b/doc5?new_edits=false" in 115.208404 secs 
          Bailout called.  Further testing stopped:  Timeout waiting for replication to finish
          Bail out! Timeout waiting for replication to finish
          FAILED--Further testing stopped: Timeout waiting for replication to finish
          

          Hold on a second: 115 seconds for pushing single doc to localhost. Btw, on 1.6.x branch this time equals to 120 secs so Nicks patch really works, but doesn't actually solves the issue.

          Ok, to make testing more fair let's try use Erlang R15B03 on FreeBSD 10:

          # Creating target database
          # Triggering replication
          [info] [<0.307.0>] 127.0.0.1 - - HEAD /couch_test_rep_db_b/ 200
          [info] [<0.370.0>] Made request head "http://127.0.0.1:51750/couch_test_rep_db_b/" in 0.003978 secs 
          [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/ 200
          [info] [<0.370.0>] Made request get "http://127.0.0.1:51750/couch_test_rep_db_b/" in 0.098668 secs 
          [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6 404
          [info] [<0.370.0>] Made request get "http://127.0.0.1:51750/couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6" in 0.107729 secs 
          [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/0344c3dab39f52994346ee2f286a109c 404
          [info] [<0.370.0>] Made request get "http://127.0.0.1:51750/couch_test_rep_db_b/_local/0344c3dab39f52994346ee2f286a109c" in 0.09909 secs 
          [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/2c46e037f893bea49f746b1c89fb073b 404
          [info] [<0.370.0>] Made request get "http://127.0.0.1:51750/couch_test_rep_db_b/_local/2c46e037f893bea49f746b1c89fb073b" in 0.099404 secs 
          [info] [<0.370.0>] Replication `"1fef0ca5eb7ea55b37727c12cbb474b6"` is using:
          	4 worker processes
          	a worker batch size of 500
          	20 HTTP connections
          	a connection timeout of 30000 milliseconds
          	10 retries per request
          	socket options are: [{keepalive,true},{nodelay,false}]
          [info] [<0.2.0>] starting new replication `1fef0ca5eb7ea55b37727c12cbb474b6` at <0.370.0> (`couch_test_rep_db_a` -> `http://127.0.0.1:51750/couch_test_rep_db_b/`)
          [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.306.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.303.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.311.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.382.0>] Made request post "http://127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.19943 secs 
          [info] [<0.379.0>] Made request post "http://127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.201037 secs 
          [info] [<0.380.0>] Made request post "http://127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.200908 secs 
          
          [info] [<0.381.0>] Made request post "http://127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.200642 secs 
          [info] [<0.311.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc9?new_edits=false 201
          [info] [<0.303.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc7?new_edits=false 201
          [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc8?new_edits=false 201
          [info] [<0.306.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc6?new_edits=false 201
          [info] [<0.379.0>] Made request put "http://127.0.0.1:51750/couch_test_rep_db_b/doc9?new_edits=false" in 115.210925 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:51750/couch_test_rep_db_b/doc7?new_edits=false" in 115.212048 secs 
          [info] [<0.382.0>] Made request put "http://127.0.0.1:51750/couch_test_rep_db_b/doc6?new_edits=false" in 115.21568 secs 
          [info] [<0.380.0>] Made request put "http://127.0.0.1:51750/couch_test_rep_db_b/doc8?new_edits=false" in 115.216761 secs 
          [info] [<0.311.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
          [info] [<0.381.0>] Made request post "http://127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.198427 secs 
          [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_ensure_full_commit 201
          [info] [<0.370.0>] Made request post "http://127.0.0.1:51750/couch_test_rep_db_b/_ensure_full_commit" in 0.107235 secs 
          [info] [<0.370.0>] recording a checkpoint for `couch_test_rep_db_a` -> `http://127.0.0.1:51750/couch_test_rep_db_b/` at source update_seq 2
          [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6 201
          [info] [<0.370.0>] Made request put "http://127.0.0.1:51750/couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6" in 0.206531 secs 
          [info] [<0.311.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc5?new_edits=false 201
          [info] [<0.381.0>] Made request put "http://127.0.0.1:51750/couch_test_rep_db_b/doc5?new_edits=false" in 115.208751 secs 
          Bailout called.  Further testing stopped:  Timeout waiting for replication to finish
          Bail out! Timeout waiting for replication to finish
          FAILED--Further testing stopped: Timeout waiting for replication to finish
          

          Still the same picture. Ok, so the problem could be either in FreeBSD 10 default configuration or in our code. Let's assume that it's the latter.

          First, it looks like client (ibrowser) works too slow. On previous debug I found that gen_tcp:send() takes about 200 milliseconds to send chunk with 8192 bytes to server. Not really sure what I tried to find by searching 8192 string in the source code, but I found that exactly the same numbers are used by mochiweb to define read buffer chunk size. So that's the idea: what will happens if I increase it?

          diff --git a/src/mochiweb/internal.hrl b/src/mochiweb/internal.hrl
          index 6db899a..338edca 100644
          --- a/src/mochiweb/internal.hrl
          +++ b/src/mochiweb/internal.hrl
          @@ -1,3 +1,3 @@
           
          --define(RECBUF_SIZE, 8192).
          +-define(RECBUF_SIZE, 8192 * 4).
          
          

          After the patch:

          FreeBSD 9.1

          [info] [<0.343.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.098174 secs 
          [info] [<0.206.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_a/_local/5852f11ad174ce93069e8d2f2da026fc" in 0.197739 secs 
          [info] [<0.347.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.106469 secs 
          [info] [<0.206.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_a/_local/5852f11ad174ce93069e8d2f2da026fc" in 0.199129 secs 
          [info] [<0.372.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198386 secs 
          [info] [<0.375.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.196966 secs 
          [info] [<0.374.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.197865 secs 
          [info] [<0.373.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198386 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc8?new_edits=false" in 1.035908 secs 
          [info] [<0.374.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc7?new_edits=false" in 1.036799 secs 
          [info] [<0.372.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc9?new_edits=false" in 1.038673 secs 
          [info] [<0.375.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc6?new_edits=false" in 1.038161 secs 
          [info] [<0.373.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198448 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc5?new_edits=false" in 0.339051 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc4?new_edits=false" in 0.329205 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc3?new_edits=false" in 0.339112 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc2?new_edits=false" in 0.329233 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc11?new_edits=false" in 0.368982 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc10?new_edits=false" in 0.329167 secs 
          [info] [<0.373.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc1?new_edits=false" in 0.389066 secs 
          [info] [<0.364.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_ensure_full_commit" in 0.1194 secs 
          [info] [<0.364.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/_local/85777ae3396849e65c872e08a7b42144" in 0.198753 secs 
          [info] [<0.459.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.199771 secs 
          [info] [<0.463.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198931 secs 
          [info] [<0.461.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.199617 secs 
          [info] [<0.465.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198968 secs 
          [info] [<0.496.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc9?new_edits=false" in 2.842012 secs 
          [info] [<0.505.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc6?new_edits=false" in 2.835165 secs 
          [info] [<0.504.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc7?new_edits=false" in 2.836418 secs 
          [info] [<0.500.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc8?new_edits=false" in 2.83799 secs 
          [info] [<0.459.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.197864 secs 
          [info] [<0.446.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_ensure_full_commit" in 0.106626 secs 
          [info] [<0.613.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.108089 secs 
          [info] [<0.446.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_a/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.203431 secs 
          [info] [<0.446.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.19678 secs 
          [info] [<0.574.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc2?new_edits=false" in 4.516424 secs 
          [info] [<0.570.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc3?new_edits=false" in 4.520756 secs 
          [info] [<0.566.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc4?new_edits=false" in 4.523575 secs 
          [info] [<0.563.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc5?new_edits=false" in 4.527971 secs 
          [info] [<0.584.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc1?new_edits=false" in 4.510487 secs 
          [info] [<0.581.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc10?new_edits=false" in 4.512293 secs 
          [info] [<0.578.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/doc11?new_edits=false" in 4.522901 secs 
          [info] [<0.622.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.099445 secs 
          [info] [<0.446.0>] Made request post "http://127.0.0.1:14927/couch_test_rep_db_b/_ensure_full_commit" in 0.099696 secs 
          [info] [<0.446.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_a/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.198469 secs 
          [info] [<0.446.0>] Made request put "http://127.0.0.1:14927/couch_test_rep_db_b/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.199494 secs 
          

          Nothing changed at all.

          FreeBSD 10:

          $ ./test/etap/run src/couch_replicator/test/04-replication-large-atts.t
          src/couch_replicator/test/04-replication-large-atts.t .. 
          ...
          All tests successful.
          Files=1, Tests=1192, 47 wallclock secs ( 0.43 usr  0.09 sys + 33.72 cusr 13.70 csys = 47.95 CPU)
          Result: PASS
          

          Whoa!

          Part of output with POST/PUT requests:

          info] [<0.349.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.094218 secs 
          [info] [<0.205.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_a/_local/05e6f7426223b3f0da8371efda4fa1a6" in 0.198086 secs 
          [info] [<0.352.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.109214 secs 
          [info] [<0.205.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_a/_local/05e6f7426223b3f0da8371efda4fa1a6" in 0.198422 secs 
          [info] [<0.378.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195682 secs 
          [info] [<0.379.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195524 secs 
          [info] [<0.381.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195412 secs 
          [info] [<0.380.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195646 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc6?new_edits=false" in 0.706663 secs 
          [info] [<0.380.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc7?new_edits=false" in 0.754916 secs 
          [info] [<0.379.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc8?new_edits=false" in 0.755129 secs 
          [info] [<0.378.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc9?new_edits=false" in 0.871189 secs 
          [info] [<0.381.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.203772 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc5?new_edits=false" in 0.378326 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc4?new_edits=false" in 0.409232 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc3?new_edits=false" in 0.418802 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc2?new_edits=false" in 0.399084 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc11?new_edits=false" in 0.404278 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc10?new_edits=false" in 0.323305 secs 
          [info] [<0.381.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc1?new_edits=false" in 0.359326 secs 
          [info] [<0.369.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_ensure_full_commit" in 0.108647 secs 
          [info] [<0.369.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/_local/76d88aed4242930e5cc84e6f22fb5cc7" in 0.206168 secs 
          [info] [<0.465.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199665 secs 
          [info] [<0.467.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199372 secs 
          [info] [<0.471.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199052 secs 
          [info] [<0.469.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199648 secs 
          [info] [<0.502.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc9?new_edits=false" in 1.746967 secs 
          [info] [<0.507.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc8?new_edits=false" in 1.879406 secs 
          [info] [<0.513.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc7?new_edits=false" in 1.890523 secs 
          [info] [<0.465.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.203363 secs 
          [info] [<0.506.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc6?new_edits=false" in 1.993508 secs 
          [info] [<0.623.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.100955 secs 
          [info] [<0.452.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_ensure_full_commit" in 0.101679 secs 
          [info] [<0.452.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_a/_local/f3b6cede324c784647b987424bb0aa7c" in 0.23693 secs 
          [info] [<0.452.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/_local/f3b6cede324c784647b987424bb0aa7c" in 0.191251 secs 
          [info] [<0.579.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc1?new_edits=false" in 3.288234 secs 
          [info] [<0.569.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc5?new_edits=false" in 3.306066 secs 
          [info] [<0.590.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc11?new_edits=false" in 3.319111 secs 
          [info] [<0.572.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc4?new_edits=false" in 3.343141 secs 
          [info] [<0.576.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc2?new_edits=false" in 3.352749 secs 
          [info] [<0.584.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc10?new_edits=false" in 3.480288 secs 
          [info] [<0.594.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/doc3?new_edits=false" in 3.478705 secs 
          [info] [<0.452.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_b/_ensure_full_commit" in 0.109566 secs 
          [info] [<0.628.0>] Made request post "http://127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.109595 secs 
          [info] [<0.452.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_a/_local/f3b6cede324c784647b987424bb0aa7c" in 0.198072 secs 
          [info] [<0.452.0>] Made request put "http://127.0.0.1:27440/couch_test_rep_db_b/_local/f3b6cede324c784647b987424bb0aa7c" in 0.199271 secs 
          

          Looks quite like as expected. But why?

          Paul Joseph Davis suggested me to track down the place where ?RECVBUF_SIZE matters - that's actually socket base options definition - and wrap with timer gen_tcp:recv/3 calls for mochiweb. The result was pretty interesting.

          Patch for logging:

          diff --git a/src/ibrowse/ibrowse_http_client.erl b/src/ibrowse/ibrowse_http_client.erl
          index a1cf6eb..6d05f29 100644
          --- a/src/ibrowse/ibrowse_http_client.erl
          +++ b/src/ibrowse/ibrowse_http_client.erl
          @@ -557,7 +557,11 @@ do_send(Req, #state{socket = Sock,
                               use_http_proxy = true,
                               proxy_tunnel_setup = Pts}) when Pts /= done ->  gen_tcp:send(Sock, Req);
           do_send(Req, #state{socket = Sock, is_ssl = true})  ->  ssl:send(Sock, Req);
          -do_send(Req, #state{socket = Sock, is_ssl = false}) ->  gen_tcp:send(Sock, Req).
          +do_send(Req, #state{socket = Sock, is_ssl = false}) ->  
          +  {Delay, Value} = timer:tc(gen_tcp, send, [Sock, Req]),
          +  io:fwrite("Sent ~p bytes in ~p sec ~n", [size(iolist_to_binary(Req)), Delay / 1000000]),
          +  Value.
          +
           
           %% @spec do_send_body(Sock::socket_descriptor(), Source::source_descriptor(), IsSSL::boolean()) -> ok | error()
           %% source_descriptor() = fun_arity_0           |
          
          diff --git a/src/mochiweb/mochiweb_socket.erl b/src/mochiweb/mochiweb_socket.erl
          index ad27204..8917f06 100644
          --- a/src/mochiweb/mochiweb_socket.erl
          +++ b/src/mochiweb/mochiweb_socket.erl
          @@ -44,7 +44,9 @@ after_accept(_Socket) -> ok.
           recv({ssl, Socket}, Length, Timeout) ->
               ssl:recv(Socket, Length, Timeout);
           recv(Socket, Length, Timeout) ->
          -    gen_tcp:recv(Socket, Length, Timeout).
          +    {Time, Value} = timer:tc(gen_tcp, recv, [Socket, Length, Timeout]),
          +    io:fwrite("  mochiweb_socket:recv / gen_tcp:recv(Socket, ~p, ~p) -- ~p sec ~n", [Length, Timeout, Time / 1000000]),
          +    Value.
           
           send({ssl, Socket}, Data) ->
               ssl:send(Socket, Data);
          

          And here is the output:

          So, actually this problem could be easily solved by changing socket options in config file with adding {recbuf,32768} value to the list (or by using one from example without patching anything.

          But this looks as workaround, isn't so? The question why so is remains opened.

          Show
          kxepal Alexander Shorin added a comment - - edited Ok, here I got some numbers for the update. Long post. FreeBSD 9.1 with Erlang R15B02, clean install (vbox, 512 MB, 2 CPU) FreeBSD 10 with Erlang R16B02, clean install (vbox, 512 MB, 2 CPU) I ran 04-replication-large-atts.t test on all of them: $ ./test/etap/run src/couch_replicator/test/04-replication-large-atts.t And there is the results for the master's HEAD : FreeBSD 9.1: Files=1, Tests=1192, 47 wallclock secs ( 0.19 usr 0.20 sys + 17.39 cusr 18.25 csys = 36.03 CPU) FreeBSD 10: Bailout called. Further testing stopped: Timeout waiting for replication to finish Not the news at all. Then, I applied small patch to measure the response time: diff --git a/src/couch_replicator/src/couch_replicator_httpc.erl b/src/couch_replicator/src/couch_replicator_httpc.erl index b8fb31b..14f31f8 100644 --- a/src/couch_replicator/src/couch_replicator_httpc.erl +++ b/src/couch_replicator/src/couch_replicator_httpc.erl @@ -61,8 +61,9 @@ send_ibrowse_req(#httpdb{headers = BaseHeaders} = HttpDb, Params) -> lists:ukeymerge(1, get_value(ibrowse_options, Params, []), HttpDb#httpdb.ibrowse_options) ], - Response = ibrowse:send_req_direct( - Worker, Url, Headers2, Method, Body, IbrowseOptions, infinity), + {Delay, Response} = timer:tc(ibrowse, send_req_direct, + [Worker, Url, Headers2, Method, Body, IbrowseOptions, infinity]), + ?LOG_INFO( "Made request ~p ~p in ~p secs ~n" , [Method, Url, Delay / 1000000]), {Worker, Response}. Since the most heavies requests are attachments uploading, let's filter the output: $ ./test/etap/run src/couch_replicator/test/04-replication-large-atts.t | egrep "Made request (post|put)" FreeBSD 9.1: [info] [<0.345.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.101148 secs [info] [<0.208.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_a/_local/ed1e1b8b6cc2aea9eaeb935affe17fdd" in 0.194595 secs [info] [<0.351.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.115996 secs [info] [<0.208.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_a/_local/ed1e1b8b6cc2aea9eaeb935affe17fdd" in 0.19801 secs [info] [<0.376.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198322 secs [info] [<0.379.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.196964 secs [info] [<0.378.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.197843 secs [info] [<0.377.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198358 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc8?new_edits= false " in 0.995802 secs [info] [<0.378.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc7?new_edits= false " in 0.996697 secs [info] [<0.376.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc9?new_edits= false " in 0.998658 secs [info] [<0.379.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc6?new_edits= false " in 0.998032 secs [info] [<0.377.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198479 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc5?new_edits= false " in 0.328846 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc4?new_edits= false " in 0.329292 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc3?new_edits= false " in 0.339167 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc2?new_edits= false " in 0.329193 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc11?new_edits= false " in 0.329162 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc10?new_edits= false " in 0.329075 secs [info] [<0.377.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc1?new_edits= false " in 0.329311 secs [info] [<0.368.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_ensure_full_commit" in 0.099283 secs [info] [<0.368.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/_local/76054f0aaccade46a4806b08a211bc16" in 0.198862 secs [info] [<0.462.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.199692 secs [info] [<0.466.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.198908 secs [info] [<0.464.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.199588 secs [info] [<0.468.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.199018 secs [info] [<0.500.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc9?new_edits= false " in 2.522225 secs [info] [<0.510.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc6?new_edits= false " in 2.514634 secs [info] [<0.506.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc7?new_edits= false " in 2.51673 secs [info] [<0.505.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc8?new_edits= false " in 2.517766 secs [info] [<0.462.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_revs_diff" in 0.197848 secs [info] [<0.616.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.108526 secs [info] [<0.450.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_ensure_full_commit" in 0.107361 secs [info] [<0.450.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_a/_local/e8009d39fc94c9267bf51e9863c46895" in 0.202483 secs [info] [<0.450.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/_local/e8009d39fc94c9267bf51e9863c46895" in 0.197966 secs [info] [<0.570.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc4?new_edits= false " in 4.332465 secs [info] [<0.567.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc5?new_edits= false " in 4.33544 secs [info] [<0.578.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc2?new_edits= false " in 4.335835 secs [info] [<0.574.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc3?new_edits= false " in 4.33976 secs [info] [<0.588.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc1?new_edits= false " in 4.328189 secs [info] [<0.585.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc10?new_edits= false " in 4.329808 secs [info] [<0.582.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/doc11?new_edits= false " in 4.331474 secs [info] [<0.625.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_a/_ensure_full_commit" in 0.099479 secs [info] [<0.450.0>] Made request post "http: //127.0.0.1:64967/couch_test_rep_db_b/_ensure_full_commit" in 0.099653 secs [info] [<0.450.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_a/_local/e8009d39fc94c9267bf51e9863c46895" in 0.198995 secs [info] [<0.450.0>] Made request put "http: //127.0.0.1:64967/couch_test_rep_db_b/_local/e8009d39fc94c9267bf51e9863c46895" in 0.199502 secs Nothing looks criminal. FreeBSD 10: # Creating target database # Triggering replication [info] [<0.309.0>] 127.0.0.1 - - HEAD /couch_test_rep_db_b/ 200 [info] [<0.370.0>] Made request head "http: //127.0.0.1:11166/couch_test_rep_db_b/" in 0.004577 secs [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/ 200 [info] [<0.370.0>] Made request get "http: //127.0.0.1:11166/couch_test_rep_db_b/" in 0.100691 secs [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651 404 [info] [<0.370.0>] Made request get "http: //127.0.0.1:11166/couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651" in 0.098708 secs [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/38bbd51d509064584b986ecfc71e443a 404 [info] [<0.370.0>] Made request get "http: //127.0.0.1:11166/couch_test_rep_db_b/_local/38bbd51d509064584b986ecfc71e443a" in 0.099318 secs [info] [<0.309.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/eaecd74f7be532d2b82aa4c137cc0b11 404 [info] [<0.370.0>] Made request get "http: //127.0.0.1:11166/couch_test_rep_db_b/_local/eaecd74f7be532d2b82aa4c137cc0b11" in 0.099245 secs [info] [<0.370.0>] Replication ` "2eaa8e3e950d010d82b20eff81e9f651" ` is using: 4 worker processes a worker batch size of 500 20 HTTP connections a connection timeout of 30000 milliseconds 10 retries per request socket options are: [{keepalive, true },{nodelay, false }] [info] [<0.2.0>] starting new replication `2eaa8e3e950d010d82b20eff81e9f651` at <0.370.0> (`couch_test_rep_db_a` -> `http: //127.0.0.1:11166/couch_test_rep_db_b/`) [info] [<0.309.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.310.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.302.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.378.0>] Made request post "http: //127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.198793 secs [info] [<0.380.0>] Made request post "http: //127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.200831 secs [info] [<0.379.0>] Made request post "http: //127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.20114 secs [info] [<0.381.0>] Made request post "http: //127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.20127 secs [info] [<0.310.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc8?new_edits= false 201 [info] [<0.309.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc9?new_edits= false 201 [info] [<0.302.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc6?new_edits= false 201 [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc7?new_edits= false 201 [info] [<0.378.0>] Made request put "http: //127.0.0.1:11166/couch_test_rep_db_b/doc9?new_edits= false " in 116.229091 secs [info] [<0.379.0>] Made request put "http: //127.0.0.1:11166/couch_test_rep_db_b/doc8?new_edits= false " in 116.223251 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:11166/couch_test_rep_db_b/doc6?new_edits= false " in 116.226021 secs [info] [<0.380.0>] Made request put "http: //127.0.0.1:11166/couch_test_rep_db_b/doc7?new_edits= false " in 116.227487 secs [info] [<0.310.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.378.0>] Made request post "http: //127.0.0.1:11166/couch_test_rep_db_b/_revs_diff" in 0.198398 secs [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_ensure_full_commit 201 [info] [<0.370.0>] Made request post "http: //127.0.0.1:11166/couch_test_rep_db_b/_ensure_full_commit" in 0.099793 secs [info] [<0.370.0>] recording a checkpoint for `couch_test_rep_db_a` -> `http: //127.0.0.1:11166/couch_test_rep_db_b/` at source update_seq 3 [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651 201 [info] [<0.370.0>] Made request put "http: //127.0.0.1:11166/couch_test_rep_db_b/_local/2eaa8e3e950d010d82b20eff81e9f651" in 0.20571 secs [info] [<0.310.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc5?new_edits= false 201 [info] [<0.378.0>] Made request put "http: //127.0.0.1:11166/couch_test_rep_db_b/doc5?new_edits= false " in 115.208404 secs Bailout called. Further testing stopped: Timeout waiting for replication to finish Bail out! Timeout waiting for replication to finish FAILED--Further testing stopped: Timeout waiting for replication to finish Hold on a second: 115 seconds for pushing single doc to localhost. Btw, on 1.6.x branch this time equals to 120 secs so Nicks patch really works, but doesn't actually solves the issue. Ok, to make testing more fair let's try use Erlang R15B03 on FreeBSD 10: # Creating target database # Triggering replication [info] [<0.307.0>] 127.0.0.1 - - HEAD /couch_test_rep_db_b/ 200 [info] [<0.370.0>] Made request head "http: //127.0.0.1:51750/couch_test_rep_db_b/" in 0.003978 secs [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/ 200 [info] [<0.370.0>] Made request get "http: //127.0.0.1:51750/couch_test_rep_db_b/" in 0.098668 secs [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6 404 [info] [<0.370.0>] Made request get "http: //127.0.0.1:51750/couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6" in 0.107729 secs [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/0344c3dab39f52994346ee2f286a109c 404 [info] [<0.370.0>] Made request get "http: //127.0.0.1:51750/couch_test_rep_db_b/_local/0344c3dab39f52994346ee2f286a109c" in 0.09909 secs [info] [<0.307.0>] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/2c46e037f893bea49f746b1c89fb073b 404 [info] [<0.370.0>] Made request get "http: //127.0.0.1:51750/couch_test_rep_db_b/_local/2c46e037f893bea49f746b1c89fb073b" in 0.099404 secs [info] [<0.370.0>] Replication ` "1fef0ca5eb7ea55b37727c12cbb474b6" ` is using: 4 worker processes a worker batch size of 500 20 HTTP connections a connection timeout of 30000 milliseconds 10 retries per request socket options are: [{keepalive, true },{nodelay, false }] [info] [<0.2.0>] starting new replication `1fef0ca5eb7ea55b37727c12cbb474b6` at <0.370.0> (`couch_test_rep_db_a` -> `http: //127.0.0.1:51750/couch_test_rep_db_b/`) [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.306.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.303.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.311.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.382.0>] Made request post "http: //127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.19943 secs [info] [<0.379.0>] Made request post "http: //127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.201037 secs [info] [<0.380.0>] Made request post "http: //127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.200908 secs [info] [<0.381.0>] Made request post "http: //127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.200642 secs [info] [<0.311.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc9?new_edits= false 201 [info] [<0.303.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc7?new_edits= false 201 [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc8?new_edits= false 201 [info] [<0.306.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc6?new_edits= false 201 [info] [<0.379.0>] Made request put "http: //127.0.0.1:51750/couch_test_rep_db_b/doc9?new_edits= false " in 115.210925 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:51750/couch_test_rep_db_b/doc7?new_edits= false " in 115.212048 secs [info] [<0.382.0>] Made request put "http: //127.0.0.1:51750/couch_test_rep_db_b/doc6?new_edits= false " in 115.21568 secs [info] [<0.380.0>] Made request put "http: //127.0.0.1:51750/couch_test_rep_db_b/doc8?new_edits= false " in 115.216761 secs [info] [<0.311.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [<0.381.0>] Made request post "http: //127.0.0.1:51750/couch_test_rep_db_b/_revs_diff" in 0.198427 secs [info] [<0.307.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_ensure_full_commit 201 [info] [<0.370.0>] Made request post "http: //127.0.0.1:51750/couch_test_rep_db_b/_ensure_full_commit" in 0.107235 secs [info] [<0.370.0>] recording a checkpoint for `couch_test_rep_db_a` -> `http: //127.0.0.1:51750/couch_test_rep_db_b/` at source update_seq 2 [info] [<0.307.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6 201 [info] [<0.370.0>] Made request put "http: //127.0.0.1:51750/couch_test_rep_db_b/_local/1fef0ca5eb7ea55b37727c12cbb474b6" in 0.206531 secs [info] [<0.311.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc5?new_edits= false 201 [info] [<0.381.0>] Made request put "http: //127.0.0.1:51750/couch_test_rep_db_b/doc5?new_edits= false " in 115.208751 secs Bailout called. Further testing stopped: Timeout waiting for replication to finish Bail out! Timeout waiting for replication to finish FAILED--Further testing stopped: Timeout waiting for replication to finish Still the same picture. Ok, so the problem could be either in FreeBSD 10 default configuration or in our code. Let's assume that it's the latter. First, it looks like client (ibrowser) works too slow. On previous debug I found that gen_tcp:send() takes about 200 milliseconds to send chunk with 8192 bytes to server. Not really sure what I tried to find by searching 8192 string in the source code, but I found that exactly the same numbers are used by mochiweb to define read buffer chunk size. So that's the idea: what will happens if I increase it? diff --git a/src/mochiweb/internal.hrl b/src/mochiweb/internal.hrl index 6db899a..338edca 100644 --- a/src/mochiweb/internal.hrl +++ b/src/mochiweb/internal.hrl @@ -1,3 +1,3 @@ --define(RECBUF_SIZE, 8192). +-define(RECBUF_SIZE, 8192 * 4). After the patch: FreeBSD 9.1 [info] [<0.343.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.098174 secs [info] [<0.206.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_a/_local/5852f11ad174ce93069e8d2f2da026fc" in 0.197739 secs [info] [<0.347.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.106469 secs [info] [<0.206.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_a/_local/5852f11ad174ce93069e8d2f2da026fc" in 0.199129 secs [info] [<0.372.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198386 secs [info] [<0.375.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.196966 secs [info] [<0.374.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.197865 secs [info] [<0.373.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198386 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc8?new_edits= false " in 1.035908 secs [info] [<0.374.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc7?new_edits= false " in 1.036799 secs [info] [<0.372.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc9?new_edits= false " in 1.038673 secs [info] [<0.375.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc6?new_edits= false " in 1.038161 secs [info] [<0.373.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198448 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc5?new_edits= false " in 0.339051 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc4?new_edits= false " in 0.329205 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc3?new_edits= false " in 0.339112 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc2?new_edits= false " in 0.329233 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc11?new_edits= false " in 0.368982 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc10?new_edits= false " in 0.329167 secs [info] [<0.373.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc1?new_edits= false " in 0.389066 secs [info] [<0.364.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_ensure_full_commit" in 0.1194 secs [info] [<0.364.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/_local/85777ae3396849e65c872e08a7b42144" in 0.198753 secs [info] [<0.459.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.199771 secs [info] [<0.463.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198931 secs [info] [<0.461.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.199617 secs [info] [<0.465.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.198968 secs [info] [<0.496.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc9?new_edits= false " in 2.842012 secs [info] [<0.505.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc6?new_edits= false " in 2.835165 secs [info] [<0.504.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc7?new_edits= false " in 2.836418 secs [info] [<0.500.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc8?new_edits= false " in 2.83799 secs [info] [<0.459.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_revs_diff" in 0.197864 secs [info] [<0.446.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_ensure_full_commit" in 0.106626 secs [info] [<0.613.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.108089 secs [info] [<0.446.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_a/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.203431 secs [info] [<0.446.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.19678 secs [info] [<0.574.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc2?new_edits= false " in 4.516424 secs [info] [<0.570.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc3?new_edits= false " in 4.520756 secs [info] [<0.566.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc4?new_edits= false " in 4.523575 secs [info] [<0.563.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc5?new_edits= false " in 4.527971 secs [info] [<0.584.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc1?new_edits= false " in 4.510487 secs [info] [<0.581.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc10?new_edits= false " in 4.512293 secs [info] [<0.578.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/doc11?new_edits= false " in 4.522901 secs [info] [<0.622.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_a/_ensure_full_commit" in 0.099445 secs [info] [<0.446.0>] Made request post "http: //127.0.0.1:14927/couch_test_rep_db_b/_ensure_full_commit" in 0.099696 secs [info] [<0.446.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_a/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.198469 secs [info] [<0.446.0>] Made request put "http: //127.0.0.1:14927/couch_test_rep_db_b/_local/91cabaf2492f2bd2a4ca92a53f02ca2f" in 0.199494 secs Nothing changed at all. FreeBSD 10: $ ./test/etap/run src/couch_replicator/test/04-replication-large-atts.t src/couch_replicator/test/04-replication-large-atts.t .. ... All tests successful. Files=1, Tests=1192, 47 wallclock secs ( 0.43 usr 0.09 sys + 33.72 cusr 13.70 csys = 47.95 CPU) Result: PASS Whoa! Part of output with POST/PUT requests: info] [<0.349.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.094218 secs [info] [<0.205.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_a/_local/05e6f7426223b3f0da8371efda4fa1a6" in 0.198086 secs [info] [<0.352.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.109214 secs [info] [<0.205.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_a/_local/05e6f7426223b3f0da8371efda4fa1a6" in 0.198422 secs [info] [<0.378.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195682 secs [info] [<0.379.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195524 secs [info] [<0.381.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195412 secs [info] [<0.380.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.195646 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc6?new_edits= false " in 0.706663 secs [info] [<0.380.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc7?new_edits= false " in 0.754916 secs [info] [<0.379.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc8?new_edits= false " in 0.755129 secs [info] [<0.378.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc9?new_edits= false " in 0.871189 secs [info] [<0.381.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.203772 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc5?new_edits= false " in 0.378326 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc4?new_edits= false " in 0.409232 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc3?new_edits= false " in 0.418802 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc2?new_edits= false " in 0.399084 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc11?new_edits= false " in 0.404278 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc10?new_edits= false " in 0.323305 secs [info] [<0.381.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc1?new_edits= false " in 0.359326 secs [info] [<0.369.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_ensure_full_commit" in 0.108647 secs [info] [<0.369.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/_local/76d88aed4242930e5cc84e6f22fb5cc7" in 0.206168 secs [info] [<0.465.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199665 secs [info] [<0.467.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199372 secs [info] [<0.471.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199052 secs [info] [<0.469.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.199648 secs [info] [<0.502.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc9?new_edits= false " in 1.746967 secs [info] [<0.507.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc8?new_edits= false " in 1.879406 secs [info] [<0.513.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc7?new_edits= false " in 1.890523 secs [info] [<0.465.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_revs_diff" in 0.203363 secs [info] [<0.506.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc6?new_edits= false " in 1.993508 secs [info] [<0.623.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.100955 secs [info] [<0.452.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_ensure_full_commit" in 0.101679 secs [info] [<0.452.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_a/_local/f3b6cede324c784647b987424bb0aa7c" in 0.23693 secs [info] [<0.452.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/_local/f3b6cede324c784647b987424bb0aa7c" in 0.191251 secs [info] [<0.579.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc1?new_edits= false " in 3.288234 secs [info] [<0.569.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc5?new_edits= false " in 3.306066 secs [info] [<0.590.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc11?new_edits= false " in 3.319111 secs [info] [<0.572.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc4?new_edits= false " in 3.343141 secs [info] [<0.576.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc2?new_edits= false " in 3.352749 secs [info] [<0.584.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc10?new_edits= false " in 3.480288 secs [info] [<0.594.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/doc3?new_edits= false " in 3.478705 secs [info] [<0.452.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_b/_ensure_full_commit" in 0.109566 secs [info] [<0.628.0>] Made request post "http: //127.0.0.1:27440/couch_test_rep_db_a/_ensure_full_commit" in 0.109595 secs [info] [<0.452.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_a/_local/f3b6cede324c784647b987424bb0aa7c" in 0.198072 secs [info] [<0.452.0>] Made request put "http: //127.0.0.1:27440/couch_test_rep_db_b/_local/f3b6cede324c784647b987424bb0aa7c" in 0.199271 secs Looks quite like as expected. But why? Paul Joseph Davis suggested me to track down the place where ?RECVBUF_SIZE matters - that's actually socket base options definition - and wrap with timer gen_tcp:recv/3 calls for mochiweb. The result was pretty interesting. Patch for logging: diff --git a/src/ibrowse/ibrowse_http_client.erl b/src/ibrowse/ibrowse_http_client.erl index a1cf6eb..6d05f29 100644 --- a/src/ibrowse/ibrowse_http_client.erl +++ b/src/ibrowse/ibrowse_http_client.erl @@ -557,7 +557,11 @@ do_send(Req, #state{socket = Sock, use_http_proxy = true , proxy_tunnel_setup = Pts}) when Pts /= done -> gen_tcp:send(Sock, Req); do_send(Req, #state{socket = Sock, is_ssl = true }) -> ssl:send(Sock, Req); -do_send(Req, #state{socket = Sock, is_ssl = false }) -> gen_tcp:send(Sock, Req). +do_send(Req, #state{socket = Sock, is_ssl = false }) -> + {Delay, Value} = timer:tc(gen_tcp, send, [Sock, Req]), + io:fwrite( "Sent ~p bytes in ~p sec ~n" , [size(iolist_to_binary(Req)), Delay / 1000000]), + Value. + %% @spec do_send_body(Sock::socket_descriptor(), Source::source_descriptor(), IsSSL:: boolean ()) -> ok | error() %% source_descriptor() = fun_arity_0 | diff --git a/src/mochiweb/mochiweb_socket.erl b/src/mochiweb/mochiweb_socket.erl index ad27204..8917f06 100644 --- a/src/mochiweb/mochiweb_socket.erl +++ b/src/mochiweb/mochiweb_socket.erl @@ -44,7 +44,9 @@ after_accept(_Socket) -> ok. recv({ssl, Socket}, Length, Timeout) -> ssl:recv(Socket, Length, Timeout); recv(Socket, Length, Timeout) -> - gen_tcp:recv(Socket, Length, Timeout). + {Time, Value} = timer:tc(gen_tcp, recv, [Socket, Length, Timeout]), + io:fwrite( " mochiweb_socket:recv / gen_tcp:recv(Socket, ~p, ~p) -- ~p sec ~n" , [Length, Timeout, Time / 1000000]), + Value. send({ssl, Socket}, Data) -> ssl:send(Socket, Data); And here is the output: Factor 1 (RECBUF == 8192) Factor 2 (RECBUF == 16384) Factor 4 (RECBUF == 32768) - blindly fast Factor 1/2 (RECBUF == 4096) - recbuf == read chunk size So, actually this problem could be easily solved by changing socket options in config file with adding { recbuf,32768 } value to the list (or by using one from example without patching anything. But this looks as workaround, isn't so? The question why so is remains opened.
          Hide
          andywenk Andy Wenk added a comment - - edited

          branch 1986-fix-ibrowse-infinite-async-timeout commit b35884580436824adc7: All tests successful.

          When reverting the find_in_binary commit (https://github.com/apache/couchdb/commit/0bf823c6e5d6a7054e7cb1979eabc7695c2707e2) in this branch, make check fails again

          Show
          andywenk Andy Wenk added a comment - - edited branch 1986-fix-ibrowse-infinite-async-timeout commit b35884580436824adc7: All tests successful. When reverting the find_in_binary commit ( https://github.com/apache/couchdb/commit/0bf823c6e5d6a7054e7cb1979eabc7695c2707e2 ) in this branch, make check fails again
          Hide
          kxepal Alexander Shorin added a comment -

          Repost from dev@ ML with small fixes.

          I added a bit more logging to ibrowse and couch_replicator and it seems that it takes ~120 secs to transfer single document with 9MB attachment in case remote->local where remote is the same localhost. Here is the output:

          [info] [<0.342.0>] <0.336.0>: requesting post "http://127.0.0.1:27786/couch_test_rep_db_b/_revs_diff" 
              [{"Accept","application/json"},
               {"Content-Type","application/json"},
               {"User-Agent","CouchDB/0.0.0"}]
          [info] [<0.342.0>] <0.336.0>: got response in 0 sec
          [info] [<0.342.0>] <0.336.0>: requesting put "http://127.0.0.1:27786/couch_test_rep_db_b/doc9?new_edits=false"
              [{"Accept","application/json"},
               {"Content-Length",9019252},
               {"Content-Type","multipart/related;boundary=\"d0b1be0e0754a41b1dd5db669ed0f29b\""},
               {"User-Agent","CouchDB/0.0.0"}]
          [info] [<0.342.0>] <0.336.0>: got response in 120 sec
          [info] [<0.342.0>] <0.336.0>: requesting post "http://127.0.0.1:27786/couch_test_rep_db_b/_revs_diff" 
              [{"Accept","application/json"},
               {"Content-Type","application/json"},
               {"User-Agent","CouchDB/0.0.0"}]
          [info] [<0.342.0>] <0.336.0>: got response in 0 sec
          

          the test fails with timeout during transferring of the third document.

          And here is more interesting logs:
          https://www.friendpaste.com/7BOCpdlWxnDpmRWXsNQjI7
          I added logging around gen_tcp:send call at:
          https://github.com/apache/couchdb/blob/master/src/ibrowse/ibrowse_http_client.erl#L560

          do_send/2 >>> - before gen_tcp:send call
          do_send/2 <<< - after gen_tcp:send call

          Note how the timings rapidly starts but very soon it hits some limitation and transfer goes by very short chunks which are split by the gen_tcp:send call and timeout lasts always around 200 milliseconds. Also sometimes the next message arrives from ibrowse tracing:
          2014-1-29_9:6:25:259 – (127.0.0.1:55867) - Inactivity timeout triggered. Shutting down connection

          Any ideas?

          Show
          kxepal Alexander Shorin added a comment - Repost from dev@ ML with small fixes. I added a bit more logging to ibrowse and couch_replicator and it seems that it takes ~120 secs to transfer single document with 9MB attachment in case remote->local where remote is the same localhost. Here is the output: [info] [<0.342.0>] <0.336.0>: requesting post "http: //127.0.0.1:27786/couch_test_rep_db_b/_revs_diff" [{ "Accept" , "application/json" }, { "Content-Type" , "application/json" }, { "User-Agent" , "CouchDB/0.0.0" }] [info] [<0.342.0>] <0.336.0>: got response in 0 sec [info] [<0.342.0>] <0.336.0>: requesting put "http: //127.0.0.1:27786/couch_test_rep_db_b/doc9?new_edits= false " [{ "Accept" , "application/json" }, { "Content-Length" ,9019252}, { "Content-Type" , "multipart/related;boundary=\" d0b1be0e0754a41b1dd5db669ed0f29b\""}, { "User-Agent" , "CouchDB/0.0.0" }] [info] [<0.342.0>] <0.336.0>: got response in 120 sec [info] [<0.342.0>] <0.336.0>: requesting post "http: //127.0.0.1:27786/couch_test_rep_db_b/_revs_diff" [{ "Accept" , "application/json" }, { "Content-Type" , "application/json" }, { "User-Agent" , "CouchDB/0.0.0" }] [info] [<0.342.0>] <0.336.0>: got response in 0 sec the test fails with timeout during transferring of the third document. And here is more interesting logs: https://www.friendpaste.com/7BOCpdlWxnDpmRWXsNQjI7 I added logging around gen_tcp:send call at: https://github.com/apache/couchdb/blob/master/src/ibrowse/ibrowse_http_client.erl#L560 do_send/2 >>> - before gen_tcp:send call do_send/2 <<< - after gen_tcp:send call Note how the timings rapidly starts but very soon it hits some limitation and transfer goes by very short chunks which are split by the gen_tcp:send call and timeout lasts always around 200 milliseconds. Also sometimes the next message arrives from ibrowse tracing: 2014-1-29_9:6:25:259 – (127.0.0.1:55867) - Inactivity timeout triggered. Shutting down connection Any ideas?
          Hide
          benoitc Benoit Chesneau added a comment -

          I finally reproduced it on macosx when only using the ibrowse patch. So it is not enough unfortunately

          Show
          benoitc Benoit Chesneau added a comment - I finally reproduced it on macosx when only using the ibrowse patch. So it is not enough unfortunately
          Hide
          benoitc Benoit Chesneau added a comment -

          Andy Wenk well just the latest mba with all the top options I retried today on a new build without reproducing the issue

          Dave Cottlehuber I can't say for sure that the patch above solve the issue, since it was already solved for me but all tests pass with it too anyway.

          Show
          benoitc Benoit Chesneau added a comment - Andy Wenk well just the latest mba with all the top options I retried today on a new build without reproducing the issue Dave Cottlehuber I can't say for sure that the patch above solve the issue, since it was already solved for me but all tests pass with it too anyway.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b35884580436824adc7f250d9adbe857440026d7 in branch refs/heads/1986-fix-ibrowse-infinite-async-timeout from Dave Cottlehuber
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=b358845 ]

          ibrowse: fix async hang due to infinite timeout

          [1]: https://github.com/cmullaparthi/ibrowse/pull/91.patch

          Show
          jira-bot ASF subversion and git services added a comment - Commit b35884580436824adc7f250d9adbe857440026d7 in branch refs/heads/1986-fix-ibrowse-infinite-async-timeout from Dave Cottlehuber [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=b358845 ] ibrowse: fix async hang due to infinite timeout import upstream timeout fix [1] addresses COUCHDB-1986 and quite possibly COUCHDB-1874 as well [1] : https://github.com/cmullaparthi/ibrowse/pull/91.patch
          Hide
          andywenk Andy Wenk added a comment -

          Benoit - what hyper performance Mac book do you have . I mean mine is really new with a ssd hd. So maybe I am doing something wrong. I did run the tests against master - not the 1.6. release branch because I assumed that the fix has not been merged ...

          Any idea what I could have possibly done wrong? No one else with a MBP having the problem?

          Show
          andywenk Andy Wenk added a comment - Benoit - what hyper performance Mac book do you have . I mean mine is really new with a ssd hd. So maybe I am doing something wrong. I did run the tests against master - not the 1.6. release branch because I assumed that the fix has not been merged ... Any idea what I could have possibly done wrong? No one else with a MBP having the problem?
          Hide
          klaus_trainer Klaus Trainer added a comment -

          Sorry, I forgot to mention in my previous comment that all tests finished successfully every time.

          Show
          klaus_trainer Klaus Trainer added a comment - Sorry, I forgot to mention in my previous comment that all tests finished successfully every time.
          Hide
          klaus_trainer Klaus Trainer added a comment - - edited

          I ran the tests in 'src/couch_replicator/test/' both against the current 1.6.x release branch and the current master branch. I executed those tests individually and ran the whole test suite as well for multiple times on each branch.

          Note that I've been running those tests on my notebook, which still has that spinning HDD (maybe you remember those) and is about five years old now.

          More details regarding the system configuration:

          Ubuntu 12.10 x86_64

          Linux 3.5.0-45-generic

          Erlang R16B03 (erts-5.10.4) [source] [64-bit halfword] [smp:2:2] [async-threads:10] [kernel-poll:false]

          Show
          klaus_trainer Klaus Trainer added a comment - - edited I ran the tests in 'src/couch_replicator/test/' both against the current 1.6.x release branch and the current master branch. I executed those tests individually and ran the whole test suite as well for multiple times on each branch. Note that I've been running those tests on my notebook, which still has that spinning HDD (maybe you remember those) and is about five years old now. More details regarding the system configuration: Ubuntu 12.10 x86_64 Linux 3.5.0-45-generic Erlang R16B03 (erts-5.10.4) [source] [64-bit halfword] [smp:2:2] [async-threads:10] [kernel-poll:false]
          Hide
          benoitc Benoit Chesneau added a comment - - edited

          I just relaunched the test 4 times and couldn't reproduce it. Probably because the small speed improvements with COUCHDB-1953 were enough on my machine. At least this is what I guess since you're still reproducing it. Linux is also notably faster than macosx for I/O operations which could explain what I never reproduce it on it.

          Which let me think that something prevents the replicator to timeout. Not sure why right now.

          Show
          benoitc Benoit Chesneau added a comment - - edited I just relaunched the test 4 times and couldn't reproduce it. Probably because the small speed improvements with COUCHDB-1953 were enough on my machine. At least this is what I guess since you're still reproducing it. Linux is also notably faster than macosx for I/O operations which could explain what I never reproduce it on it. Which let me think that something prevents the replicator to timeout. Not sure why right now.
          Hide
          andywenk Andy Wenk added a comment -

          I ran make check against master (6fd38404137e2c61) and have stil the problem:

          /Users/andwen/project/couchdb/couchdb-src-clean/src/couch_replicator/test/04-replication-large-atts.t ... 493/? Bailout called. Further testing stopped: Timeout waiting for replication to finish
          FAILED--Further testing stopped: Timeout waiting for replication to finish
          make[2]: *** [check] Error 127
          make[1]: *** [check-recursive] Error 1
          make: *** [check-recursive] Error 1

          Show
          andywenk Andy Wenk added a comment - I ran make check against master (6fd38404137e2c61) and have stil the problem: /Users/andwen/project/couchdb/couchdb-src-clean/src/couch_replicator/test/04-replication-large-atts.t ... 493/? Bailout called. Further testing stopped: Timeout waiting for replication to finish FAILED--Further testing stopped: Timeout waiting for replication to finish make [2] : *** [check] Error 127 make [1] : *** [check-recursive] Error 1 make: *** [check-recursive] Error 1
          Hide
          benoitc Benoit Chesneau added a comment - - edited

          Side effect? Applying #COUCHDB-1953 seems to solve the issue on my mac. At least the test pass.

          Show
          benoitc Benoit Chesneau added a comment - - edited Side effect? Applying # COUCHDB-1953 seems to solve the issue on my mac. At least the test pass.
          Hide
          dch Dave Cottlehuber added a comment -

          actually my last comment doesn't make sense, test 07 obviously runs after 04... scratch that.

          Show
          dch Dave Cottlehuber added a comment - actually my last comment doesn't make sense, test 07 obviously runs after 04... scratch that.
          Hide
          dch Dave Cottlehuber added a comment -

          NB passes after repeating a few times on:

          • OSX Lion 10.7.5 incl R16B03

          In my tests this often gets stuck around 576 in 07-use-checkpoints.t

          This is very hand-wavey but https://github.com/apache/couchdb/commit/03bc1641639667f26e219cc36885118448c24158 was added recently; if somebody can revert this & re-try `make distcheck` that will be very helpful. I'll re-read through this tonight too.

          Show
          dch Dave Cottlehuber added a comment - NB passes after repeating a few times on: OSX Lion 10.7.5 incl R16B03 In my tests this often gets stuck around 576 in 07-use-checkpoints.t This is very hand-wavey but https://github.com/apache/couchdb/commit/03bc1641639667f26e219cc36885118448c24158 was added recently; if somebody can revert this & re-try `make distcheck` that will be very helpful. I'll re-read through this tonight too.
          Hide
          dch Dave Cottlehuber added a comment -

          Reproducible by me on:

          • FreeBSD 10.0 amd64 with R16B02
          • OSX 10.9.1 as well, incl R14B04 and R15B03-1
          Show
          dch Dave Cottlehuber added a comment - Reproducible by me on: FreeBSD 10.0 amd64 with R16B02 OSX 10.9.1 as well, incl R14B04 and R15B03-1
          Hide
          benoitc Benoit Chesneau added a comment -

          The changes made in the branch is not enough and probably useless on most installation.

          I confirm the behaviour reported anyway. We are expecting a timeout of 30s from the replicator and eventually wait for it, but it never come: the replicator process is still alive, but no activity is reported. I can only reproduce myself on macosx (latest maverick) but not on my linux machine. not sure if CI reproduces it on another platform (cc Jan Lehnardt) .

          I reproduce the error on macosx using Erlang R16x (homebrew and Erlang Solutions versions). I didn't test with an earlier version.

          Show
          benoitc Benoit Chesneau added a comment - The changes made in the branch is not enough and probably useless on most installation. I confirm the behaviour reported anyway. We are expecting a timeout of 30s from the replicator and eventually wait for it, but it never come: the replicator process is still alive, but no activity is reported. I can only reproduce myself on macosx (latest maverick) but not on my linux machine. not sure if CI reproduces it on another platform (cc Jan Lehnardt ) . I reproduce the error on macosx using Erlang R16x (homebrew and Erlang Solutions versions). I didn't test with an earlier version.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 74e0f08f72f8c002674159a4f765991f101abddc in branch refs/heads/1958-only-fetch-db-per-page from Benoit Chesneau
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=74e0f08 ]

          don't crash if ibrowse is already started

          fix #COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 74e0f08f72f8c002674159a4f765991f101abddc in branch refs/heads/1958-only-fetch-db-per-page from Benoit Chesneau [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=74e0f08 ] don't crash if ibrowse is already started fix # COUCHDB-1986
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 74e0f08f72f8c002674159a4f765991f101abddc in branch refs/heads/1744-single-config-file from Benoit Chesneau
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=74e0f08 ]

          don't crash if ibrowse is already started

          fix #COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 74e0f08f72f8c002674159a4f765991f101abddc in branch refs/heads/1744-single-config-file from Benoit Chesneau [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=74e0f08 ] don't crash if ibrowse is already started fix # COUCHDB-1986
          Hide
          janl Jan Lehnardt added a comment -

          Some more testing today, bumping the test timeout waiting for replication to finish from 300000 to 600000 makes this pass again.

          Looking at the request log can see that PUTing docs to the remote db takes disproportionally long (1 minute + a few seconds each):

          [Mon, 16 Dec 2013 12:23:27 GMT] [info] [<0.321.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/_local/8831f8738b3333d5d17e00da7
          1581b23 201
          [Mon, 16 Dec 2013 12:24:30 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc5?new_edits=false 201
          [Mon, 16 Dec 2013 12:25:38 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc4?new_edits=false 201
          [Mon, 16 Dec 2013 12:26:51 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc3?new_edits=false 201
          [Mon, 16 Dec 2013 12:27:58 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc2?new_edits=false 201
          [Mon, 16 Dec 2013 12:29:06 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc11?new_edits=false 201
          [Mon, 16 Dec 2013 12:30:13 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc10?new_edits=false 201
          [Mon, 16 Dec 2013 12:31:20 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc1?new_edits=false 201
          [Mon, 16 Dec 2013 12:31:20 GMT] [info] [<0.322.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_ensure_full_commit 201

          After the last one all other requests come in at expected speeds.

          I’m not too familiar with this part of the replicator code, so I couldn’t quickly find why either the requests take that long, or why the replicator waits a minute until it continues.

          Another thing I tried was seeing whether the new R16B scheduling behaviour introduced this, but running the test with

          > ERL_FLAGS="+sws legacy" ./test/etap/run -v src/couch_replicator/test/04-replication-large-atts.t

          and the original timeout still times out.

          Show
          janl Jan Lehnardt added a comment - Some more testing today, bumping the test timeout waiting for replication to finish from 300000 to 600000 makes this pass again. Looking at the request log can see that PUTing docs to the remote db takes disproportionally long (1 minute + a few seconds each): [Mon, 16 Dec 2013 12:23:27 GMT] [info] [<0.321.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/_local/8831f8738b3333d5d17e00da7 1581b23 201 [Mon, 16 Dec 2013 12:24:30 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc5?new_edits=false 201 [Mon, 16 Dec 2013 12:25:38 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc4?new_edits=false 201 [Mon, 16 Dec 2013 12:26:51 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc3?new_edits=false 201 [Mon, 16 Dec 2013 12:27:58 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc2?new_edits=false 201 [Mon, 16 Dec 2013 12:29:06 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc11?new_edits=false 201 [Mon, 16 Dec 2013 12:30:13 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc10?new_edits=false 201 [Mon, 16 Dec 2013 12:31:20 GMT] [info] [<0.322.0>] 127.0.0.1 - - PUT /couch_test_rep_db_b/doc1?new_edits=false 201 [Mon, 16 Dec 2013 12:31:20 GMT] [info] [<0.322.0>] 127.0.0.1 - - POST /couch_test_rep_db_b/_ensure_full_commit 201 After the last one all other requests come in at expected speeds. I’m not too familiar with this part of the replicator code, so I couldn’t quickly find why either the requests take that long, or why the replicator waits a minute until it continues. Another thing I tried was seeing whether the new R16B scheduling behaviour introduced this, but running the test with > ERL_FLAGS="+sws legacy" ./test/etap/run -v src/couch_replicator/test/04-replication-large-atts.t and the original timeout still times out.
          Hide
          janl Jan Lehnardt added a comment -

          More info:

          The test tries to replicate between 4 pairs of databases: local to local, local to remote, remote to local and remote to remote, it runs the same set of tests across all pairs.

          It starts hanging in the local to remote replication after local to local and remote to local have succeeded.

          IIRC this doesn’t happen on R15B or earlier.

          Show
          janl Jan Lehnardt added a comment - More info: The test tries to replicate between 4 pairs of databases: local to local, local to remote, remote to local and remote to remote, it runs the same set of tests across all pairs. It starts hanging in the local to remote replication after local to local and remote to local have succeeded. IIRC this doesn’t happen on R15B or earlier.
          Hide
          janl Jan Lehnardt added a comment -

          I’m not seeing a crash. the test case just hangs there. Even with this patch.

          As additional info. beam.smp just sits there and does nothing until the timeout hits. I’d be happy about any pointers on how to debug this

          Show
          janl Jan Lehnardt added a comment - I’m not seeing a crash. the test case just hangs there. Even with this patch. As additional info. beam.smp just sits there and does nothing until the timeout hits. I’d be happy about any pointers on how to debug this
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 74e0f08f72f8c002674159a4f765991f101abddc in branch refs/heads/master from Benoit Chesneau
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=74e0f08 ]

          don't crash if ibrowse is already started

          fix #COUCHDB-1986

          Show
          jira-bot ASF subversion and git services added a comment - Commit 74e0f08f72f8c002674159a4f765991f101abddc in branch refs/heads/master from Benoit Chesneau [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=74e0f08 ] don't crash if ibrowse is already started fix # COUCHDB-1986
          Hide
          janl Jan Lehnardt added a comment -

          This is on R16B02 and Mac OS X Mavericks.

          Show
          janl Jan Lehnardt added a comment - This is on R16B02 and Mac OS X Mavericks.

            People

            • Assignee:
              dch Dave Cottlehuber
              Reporter:
              janl Jan Lehnardt
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development