CouchDB
  1. CouchDB
  2. COUCHDB-1424

make check hangs when compiling with R15B

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3
    • Fix Version/s: 1.4.0
    • Component/s: Test Suite
    • Labels:
      None

      Description

      make check hangs when running under R15B. For me it is 160-vhosts.t where execution stops, but if I recall correctly others have reported other tests. The crux here is that running the tests individually succeeds.

        Activity

        Hide
        Dave Cottlehuber added a comment -

        +1 too, applying Filipe's modified test passes. Bob's patch alone was not sufficient for me though.

        Show
        Dave Cottlehuber added a comment - +1 too, applying Filipe's modified test passes. Bob's patch alone was not sufficient for me though.
        Hide
        Jan Lehnardt added a comment -

        I'm happy to report that with Filipe's patch and the timer:sleep(100) in src/etap/etap.erl the test suite runs successfully on my Mac Mini on R15B. I've ran four times now and couldn't get it to hiccup. I'm +1 on getting this into 1.2.x to unblock the 1.2.0 release and then investigate the shutdown procedure in a new ticket.

        Show
        Jan Lehnardt added a comment - I'm happy to report that with Filipe's patch and the timer:sleep(100) in src/etap/etap.erl the test suite runs successfully on my Mac Mini on R15B. I've ran four times now and couldn't get it to hiccup. I'm +1 on getting this into 1.2.x to unblock the 1.2.0 release and then investigate the shutdown procedure in a new ticket.
        Hide
        Wendall Cada added a comment -

        You are 100% correct Filipe, sorry I missed that in the output. The package required for Centos/RHEL/Fedora is erlang-os_mon.

        All tests pass for me with this patch!

        Show
        Wendall Cada added a comment - You are 100% correct Filipe, sorry I missed that in the output. The package required for Centos/RHEL/Fedora is erlang-os_mon. All tests pass for me with this patch!
        Hide
        Filipe Manana added a comment -

        Wendall, like on debian/ubuntu, you probably need to install a missing erlang package. On debian it's named 'erlang-os-mon'.

        Show
        Filipe Manana added a comment - Wendall, like on debian/ubuntu, you probably need to install a missing erlang package. On debian it's named 'erlang-os-mon'.
        Hide
        Wendall Cada added a comment - - edited

        Test after applying 0001-Fix-very-slow-test-test-etap-220-compaction-daemon.t.patch

        ./test/etap/run ./test/etap/220*.t > 220-compaction-daemon.t.out

        Friendpaste for quick reference: http://friendpaste.com/3OkZoUMvXxQaGCRIFs55M0

        Show
        Wendall Cada added a comment - - edited Test after applying 0001-Fix-very-slow-test-test-etap-220-compaction-daemon.t.patch ./test/etap/run ./test/etap/220*.t > 220-compaction-daemon.t.out Friendpaste for quick reference: http://friendpaste.com/3OkZoUMvXxQaGCRIFs55M0
        Hide
        Bob Dionne added a comment -

        yes, but we do use them a lot in etaps to co-ordinate events.

        I did try debugging a bit with distel, but didn't try debugging the VM. To produce the hangs you need to run "make check" which runs all escripts and they hang in different places. You might have better luck

        Show
        Bob Dionne added a comment - yes, but we do use them a lot in etaps to co-ordinate events. I did try debugging a bit with distel, but didn't try debugging the VM. To produce the hangs you need to run "make check" which runs all escripts and they hang in different places. You might have better luck
        Hide
        Filipe Manana added a comment -

        As for the R15B related issues, I agree the sleeps are a poor choice.

        Why not use gdb to see where the Erlang VM is hanging? Just build your own Erlang with the -g flag to add debug info and then use gdb -p <beam.smp pid>. Seems more reliable than adding io:format calls everywhere.

        Show
        Filipe Manana added a comment - As for the R15B related issues, I agree the sleeps are a poor choice. Why not use gdb to see where the Erlang VM is hanging? Just build your own Erlang with the -g flag to add debug info and then use gdb -p <beam.smp pid>. Seems more reliable than adding io:format calls everywhere.
        Hide
        Filipe Manana added a comment -

        The slowness of the compaction daemon test is due to the recent change in the 1.2.x branch to make the view updater collect larger batches of map values.
        The attached patch describes the issue and makes the test much faster.
        Will commit this shortly.

        Show
        Filipe Manana added a comment - The slowness of the compaction daemon test is due to the recent change in the 1.2.x branch to make the view updater collect larger batches of map values. The attached patch describes the issue and makes the test much faster. Will commit this shortly.
        Hide
        Bob Dionne added a comment -

        so it seems the 220-compaction-daemon.t test has some additional issues, as it's the only one not fixed with the simple addition of a sleep in etap:end_tests()

        Show
        Bob Dionne added a comment - so it seems the 220-compaction-daemon.t test has some additional issues, as it's the only one not fixed with the simple addition of a sleep in etap:end_tests()
        Hide
        Wendall Cada added a comment -

        This is also an issue with Fedora 16. R14B04. I can agree with Dan's comment here that reducing the fragmentation percentage results in a working test.

        Not sure if this is related. But more than once, I had random failures on some of the earlier tests. Upon re-running them, they succeed just fine.

        Running the tests separately does not keep this test (220-compaction-daemon.t) from running for a rediculously long time. On the Centos box, I let this run for approximately 3 hours with the unpatched test. I ended up stopping the test. Resulting database was at 702M when I stopped running the test.

        Show
        Wendall Cada added a comment - This is also an issue with Fedora 16. R14B04. I can agree with Dan's comment here that reducing the fragmentation percentage results in a working test. Not sure if this is related. But more than once, I had random failures on some of the earlier tests. Upon re-running them, they succeed just fine. Running the tests separately does not keep this test (220-compaction-daemon.t) from running for a rediculously long time. On the Centos box, I let this run for approximately 3 hours with the unpatched test. I ended up stopping the test. Resulting database was at 702M when I stopped running the test.
        Hide
        Dan Everton added a comment - - edited

        Actually it may not be hanging, just taking a really, really long time. I changed the test to only run up to 10% fragmentation rather than 70% and it completed and passed in a reasonable time. Maybe the compaction test is just too thorough and takes too long?

        I'd include a patch but it's a bit long for a comment. It's essentially s/70/10/g in the 220-compaction-daemon.t file.

        Show
        Dan Everton added a comment - - edited Actually it may not be hanging, just taking a really, really long time. I changed the test to only run up to 10% fragmentation rather than 70% and it completed and passed in a reasonable time. Maybe the compaction test is just too thorough and takes too long? I'd include a patch but it's a bit long for a comment. It's essentially s/70/10/g in the 220-compaction-daemon.t file.
        Hide
        Wendall Cada added a comment -

        I can echo Dan's results here. Same OS, CentOS 6.2 running Erlang R14B04, and similar hardware: E5520 @ 2.27GHz with 3GB RAM, 15K SCSI/Raid 10

        Show
        Wendall Cada added a comment - I can echo Dan's results here. Same OS, CentOS 6.2 running Erlang R14B04, and similar hardware: E5520 @ 2.27GHz with 3GB RAM, 15K SCSI/Raid 10
        Hide
        Dan Everton added a comment - - edited

        Well it's not exactly fast hardware. It's a quad-core Xeon E5405 @ 2.00GHz with 8GB of RAM and a single 7200 RPM HDD running on EXT4.

        I changed the run script to do prove -v and here's the output for the stuck test:

        /builds/couchdb/test/etap/220-compaction-daemon.t ........
        /builds/couchdb/test/etap/220-compaction-daemon.t:165: Warning: variable 'Docs' is unused

        1. Current time local 2012-03-07 08:56:40
        2. Using etap version "0.3.4"
          1..10
          Apache CouchDB 0.0.0 (LogLevel=info) is starting.
          Apache CouchDB has started. Time to relax.
          [info] [<0.2.0>] Apache CouchDB has started on http://127.0.0.1:43534/
          [info] [<0.80.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200
          [info] [<0.81.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200
          [info] [<0.82.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200
          [info] [<0.83.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200

        Repeat the last line to infinity.

        Show
        Dan Everton added a comment - - edited Well it's not exactly fast hardware. It's a quad-core Xeon E5405 @ 2.00GHz with 8GB of RAM and a single 7200 RPM HDD running on EXT4. I changed the run script to do prove -v and here's the output for the stuck test: /builds/couchdb/test/etap/220-compaction-daemon.t ........ /builds/couchdb/test/etap/220-compaction-daemon.t:165: Warning: variable 'Docs' is unused Current time local 2012-03-07 08:56:40 Using etap version "0.3.4" 1..10 Apache CouchDB 0.0.0 (LogLevel=info) is starting. Apache CouchDB has started. Time to relax. [info] [<0.2.0>] Apache CouchDB has started on http://127.0.0.1:43534/ [info] [<0.80.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200 [info] [<0.81.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200 [info] [<0.82.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200 [info] [<0.83.0>] 127.0.0.1 - - GET /couch_test_compaction_daemon/_design/foo/_view/foo 200 Repeat the last line to infinity.
        Hide
        Jan Lehnardt added a comment -

        Good news Dan! Can you share which hardware this runs on? My running theory is still that faster CPUs will make this show up more

        Show
        Jan Lehnardt added a comment - Good news Dan! Can you share which hardware this runs on? My running theory is still that faster CPUs will make this show up more
        Hide
        Dan Everton added a comment -

        I'm seeing this issue on a CentOS 6.2 running Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [kernel-poll:false]. It hangs in 220-compaction-daemon.t. So I don't think it's an R15B or OS X issue.

        Show
        Dan Everton added a comment - I'm seeing this issue on a CentOS 6.2 running Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [kernel-poll:false] . It hangs in 220-compaction-daemon.t. So I don't think it's an R15B or OS X issue.
        Hide
        Bob Dionne added a comment -

        It's not easy, breakpoints aren't very useful because it's fairly random as to where it hangs, so I've just used io:format

        I think the 075-auth example shows that it's not in escript, though you can just put a copy of that on your source path and add some io:formats to those functions I commented on.

        When it hangs there also doesn't seem to be any VM you can remsh to and poke around, which is why I was wondering about Prove. There are also cases where you can clearly see the escript function looping a few times to check that the process is not running but those cases don't hang.

        If I had to bet I'd put my money on something we're doing when shutting down and/or deleting dbs and R15B is exposing it, or it is an OS X issue.

        Show
        Bob Dionne added a comment - It's not easy, breakpoints aren't very useful because it's fairly random as to where it hangs, so I've just used io:format I think the 075-auth example shows that it's not in escript, though you can just put a copy of that on your source path and add some io:formats to those functions I commented on. When it hangs there also doesn't seem to be any VM you can remsh to and poke around, which is why I was wondering about Prove. There are also cases where you can clearly see the escript function looping a few times to check that the process is not running but those cases don't hang. If I had to bet I'd put my money on something we're doing when shutting down and/or deleting dbs and R15B is exposing it, or it is an OS X issue.
        Hide
        Jan Lehnardt added a comment -

        Bob, can I ask how you debug this? Other than prove -v, are you adding any printf() or look at log files, or the running system? I'd like to help with the chase. Also, if it helps, I could give you remote access to the Mini.

        Show
        Jan Lehnardt added a comment - Bob, can I ask how you debug this? Other than prove -v, are you adding any printf() or look at log files, or the running system? I'd like to help with the chase. Also, if it helps, I could give you remote access to the Mini.
        Hide
        Jan Lehnardt added a comment -

        I got around to try the timer:sleep(100); patch on my MacBook Air. The test suite passes all the time. The difference between the MBA and the Mini are: i7 1.8GHz (MBA) vs. i7 2.7GHz(Mini), 4GB RAM (MBA) vs. 16GB RAM (Mini).

        This feels like the timer:sleep() just shadows the underlying issue on the slower CPU, but the faster CPU still produces it. I doubt the amount of RAM makes a difference.

        Show
        Jan Lehnardt added a comment - I got around to try the timer:sleep(100); patch on my MacBook Air. The test suite passes all the time. The difference between the MBA and the Mini are: i7 1.8GHz (MBA) vs. i7 2.7GHz(Mini), 4GB RAM (MBA) vs. 16GB RAM (Mini). This feels like the timer:sleep() just shadows the underlying issue on the slower CPU, but the faster CPU still produces it. I doubt the amount of RAM makes a difference.
        Hide
        Bob Dionne added a comment -

        I found one case 075-auth-cache.t where it hangs before etap:end_tests() is called. However it did complete the tests and was somewhere in the cleanup phase, deleting dbs and calling couch_server_sup:stop().

        I'm wondering if there's something in couchdb we're not doing quite right that is finally catching up to us with R15B

        The addition of the smal sleep work perfectly for me on the MBA and I believe Noah also reports success. So far we know it doesn't work on a Mini.

        Show
        Bob Dionne added a comment - I found one case 075-auth-cache.t where it hangs before etap:end_tests() is called. However it did complete the tests and was somewhere in the cleanup phase, deleting dbs and calling couch_server_sup:stop(). I'm wondering if there's something in couchdb we're not doing quite right that is finally catching up to us with R15B The addition of the smal sleep work perfectly for me on the MBA and I believe Noah also reports success. So far we know it doesn't work on a Mini.
        Hide
        Bob Dionne added a comment - - edited

        I'm not able to consistently see it hang on 220-compaction-daemon.t unless I disable the sleep, in which case it hangs at arbitrary places as we've seen.

        I looked at escript.erl a bit, comparing R15B to R14, and there have been no changes. If you look at my_halt(Reason) [1] you can see where it could get into an infinite loop, and it's not clear to me what status code the script returns when an error is thrown. Anyway in all our cases I'm seeing it return 0 as expected so the next culprit might be Prove.

        I notice also on this MBA that if I up the receive timeout in my_halt to 100ms or so, everything runs fine with no hangs. I not sure why they use `receive after 1 -> ok end,` and `erlang:yield()` in the same function, given what the documentation[2] says about the diff between the two.

        Some more eyeballs might be helpful. I'll poke at this some more tomorrow if I have time. My religion forbids me to touch Perl on Sunday.

        [1] https://gist.github.com/1974569
        [2] http://www.erlang.org/doc/man/erlang.html#yield-0

        Show
        Bob Dionne added a comment - - edited I'm not able to consistently see it hang on 220-compaction-daemon.t unless I disable the sleep, in which case it hangs at arbitrary places as we've seen. I looked at escript.erl a bit, comparing R15B to R14, and there have been no changes. If you look at my_halt(Reason) [1] you can see where it could get into an infinite loop, and it's not clear to me what status code the script returns when an error is thrown. Anyway in all our cases I'm seeing it return 0 as expected so the next culprit might be Prove. I notice also on this MBA that if I up the receive timeout in my_halt to 100ms or so, everything runs fine with no hangs. I not sure why they use `receive after 1 -> ok end,` and `erlang:yield()` in the same function, given what the documentation [2] says about the diff between the two. Some more eyeballs might be helpful. I'll poke at this some more tomorrow if I have time. My religion forbids me to touch Perl on Sunday. [1] https://gist.github.com/1974569 [2] http://www.erlang.org/doc/man/erlang.html#yield-0
        Hide
        Bob Dionne added a comment -

        yes, sorry I didn't push a diff

        I think the real cause is the interaction between prove and escript, but couch_util:shutdown_sync seems to surface it. 220-compaction-daemon.t is one of the ones that will consistently hang. It appears that whenever you see a hang, sysouts will also reveal a call to couch_util:shutdown_sync waiting on a DOWN message. I also tried using "kill" instead of "shutdown" with no luck

        Show
        Bob Dionne added a comment - yes, sorry I didn't push a diff I think the real cause is the interaction between prove and escript, but couch_util:shutdown_sync seems to surface it. 220-compaction-daemon.t is one of the ones that will consistently hang. It appears that whenever you see a hang, sysouts will also reveal a call to couch_util:shutdown_sync waiting on a DOWN message. I also tried using "kill" instead of "shutdown" with no luck
        Hide
        Jan Lehnardt added a comment -

        Is the diff below the correct one?

        If yes, I can't confirm that it resolves the hangs for me. I tried 200, 300 and 1000 as values for sleep() and it consistently hangs at 220-compaction-daemon.t

        diff --git a/src/etap/etap.erl b/src/etap/etap.erl
        index 5ad5dba..ba83385 100644
        — a/src/etap/etap.erl
        +++ b/src/etap/etap.erl
        @@ -77,6 +77,7 @@ plan(N) when is_integer(N), N > 0 ->
        %% @doc End the current test plan and output test results.
        %% @todo This should probably be done in the test_server process.
        end_tests() ->
        + timer:sleep(100),
        ensure_coverage_ends(),
        etap_server !

        {self(), state}

        ,
        State = receive X -> X end,

        Show
        Jan Lehnardt added a comment - Is the diff below the correct one? If yes, I can't confirm that it resolves the hangs for me. I tried 200, 300 and 1000 as values for sleep() and it consistently hangs at 220-compaction-daemon.t diff --git a/src/etap/etap.erl b/src/etap/etap.erl index 5ad5dba..ba83385 100644 — a/src/etap/etap.erl +++ b/src/etap/etap.erl @@ -77,6 +77,7 @@ plan(N) when is_integer(N), N > 0 -> %% @doc End the current test plan and output test results. %% @todo This should probably be done in the test_server process. end_tests() -> + timer:sleep(100), ensure_coverage_ends(), etap_server ! {self(), state} , State = receive X -> X end,
        Hide
        Bob Dionne added a comment -

        I've tested with prove -v and in all cases where it hangs it appears a couch_server_sup:stop() or db delete call is involved where couch_util:shutdown_sync is still waiting on a DOWN message. If etap:end_tests() returns and the main in the escript returns before those shutdowns are complete it hangs.

        A simple sleep of as little as 100ms at the beginning of etap:end_tests() seems sufficient to stop the hanging. I'd be curious if other could confirm that. I'm using a MBA with Lion, the standardnew machine. I notice the space key is starting to get lazy

        Show
        Bob Dionne added a comment - I've tested with prove -v and in all cases where it hangs it appears a couch_server_sup:stop() or db delete call is involved where couch_util:shutdown_sync is still waiting on a DOWN message. If etap:end_tests() returns and the main in the escript returns before those shutdowns are complete it hangs. A simple sleep of as little as 100ms at the beginning of etap:end_tests() seems sufficient to stop the hanging. I'd be curious if other could confirm that. I'm using a MBA with Lion, the standardnew machine. I notice the space key is starting to get lazy
        Hide
        Dave Cottlehuber added a comment -

        FYI. I've not had these hangs on mint linux debian either, over 5 runs.

        Show
        Dave Cottlehuber added a comment - FYI. I've not had these hangs on mint linux debian either, over 5 runs.
        Hide
        Jan Lehnardt added a comment -

        I did some more tests, running make check in R15B on Snow Leopard and it didn't hang. Since this ran in a VM as opposed to real hardware on my previous reports, I re-ran this on a Lion VM with R15B and got the hang.

        Unless there's reports to the contrary, this would narrow the problem to R15B/Lion.

        Show
        Jan Lehnardt added a comment - I did some more tests, running make check in R15B on Snow Leopard and it didn't hang. Since this ran in a VM as opposed to real hardware on my previous reports, I re-ran this on a Lion VM with R15B and got the hang. Unless there's reports to the contrary, this would narrow the problem to R15B/Lion.
        Hide
        Robert Newson added a comment -

        I ran make check on Ubuntu 11.10, 5 time in a row. It never hung, though the auth cache test failed on runs 3-5 inclusive.

        Show
        Robert Newson added a comment - I ran make check on Ubuntu 11.10, 5 time in a row. It never hung, though the auth cache test failed on runs 3-5 inclusive.
        Hide
        Jan Lehnardt added a comment -

        I ran the test suite on a FreeBSD 9.0 VM five times and never get it hang like I can get it most of the times on Mac OS X (10.7.3, real hardware, SSD, lots of ram).

        Show
        Jan Lehnardt added a comment - I ran the test suite on a FreeBSD 9.0 VM five times and never get it hang like I can get it most of the times on Mac OS X (10.7.3, real hardware, SSD, lots of ram).
        Hide
        Robert Newson added a comment -

        hangs for me too, the last thing printed is the 075-auth-cache.t line.

        Show
        Robert Newson added a comment - hangs for me too, the last thing printed is the 075-auth-cache.t line.
        Hide
        Jan Lehnardt added a comment -

        Not saying your result is not valid or anything, but I can't reproduce that hang solo. Fwiw, today, make test hangs at 076

        Show
        Jan Lehnardt added a comment - Not saying your result is not valid or anything, but I can't reproduce that hang solo. Fwiw, today, make test hangs at 076
        Hide
        Brian Mitchell added a comment -

        Actually, 076-file-compression.t seems to hang for me when run alone and in sequence during make check for R15B on OS X.

        Show
        Brian Mitchell added a comment - Actually, 076-file-compression.t seems to hang for me when run alone and in sequence during make check for R15B on OS X.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Lehnardt
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development