CouchDB
  1. CouchDB
  2. COUCHDB-1449

Couchdb returns stopped status before process exits

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3
    • Fix Version/s: 1.2.2, 1.3
    • Component/s: None
    • Labels:
    • Environment:

      *NIX

      Description

      When restarting couchdb via init script, couchdb returns success status before the process is exited. When a start is issued before the process ends, couchdb fails to start, but returns success.

        Activity

        Hide
        Wendall Cada added a comment -

        Patch to wait until process exits before returning success on shutdown.

        Show
        Wendall Cada added a comment - Patch to wait until process exits before returning success on shutdown.
        Hide
        Wendall Cada added a comment -

        Correct patch.

        Show
        Wendall Cada added a comment - Correct patch.
        Hide
        Sam Bisbee added a comment -

        FYI, this issue was found and resolved in the Debian package a while ago (v0.10.1-2). Here's a link to the patch file: http://patch-tracker.debian.org/patch/series/view/couchdb/0.11.0-2.3/init.patch

        Cheers.

        Show
        Sam Bisbee added a comment - FYI, this issue was found and resolved in the Debian package a while ago (v0.10.1-2). Here's a link to the patch file: http://patch-tracker.debian.org/patch/series/view/couchdb/0.11.0-2.3/init.patch Cheers.
        Hide
        Wendall Cada added a comment -

        This is a much better patch. Thanks Sam.

        Show
        Wendall Cada added a comment - This is a much better patch. Thanks Sam.
        Hide
        Wendall Cada added a comment -

        The Debian patch would need to be reworked a bit, but I like the approach of checking for the parent heart process and waiting until it exits before returning a success. However, I don't know if it's overkill, or if it even matters.

        Show
        Wendall Cada added a comment - The Debian patch would need to be reworked a bit, but I like the approach of checking for the parent heart process and waiting until it exits before returning a success. However, I don't know if it's overkill, or if it even matters.
        Hide
        Wendall Cada added a comment -

        See: use-sname-rpc-not-kill.patch

        Here is what I figured out while testing. The whole concept of using a PID file and kill -1 $PID with erlang is just not going to work consistently.

        Here is a way to replicate what happens sometimes when issuing a restart (stop/start), and beam hasn't stopped yet.

        For example, try: couchdb -b && couchdb -d && couchdb -b
        Apache CouchDB has started, time to relax.
        Apache CouchDB is not running.
        Apache CouchDB has started, time to relax.
        $ echo `cat /var/run/couchdb/couchdb.pid`
        10229
        $ ps -A | grep beam.smp
        10193 pts/2 00:00:00 beam.smp

        However, adding -sname couchdb to the command options results the second start failing silently, but couchdb does stop. A stale pid id is left in the pid file from the second start command.

        Now if I modified start_couchdb so it actually checks if the process id returned from the erl command is running, then wait 2 seconds so the pid file can hit the disk. I modified stop_couchdb and eliminated the use of kill -1 and wait for the process to actually exit. Now everything works as intended, no matter what bizarre scenario is encountered.

        So for just pure stupid, I can do this:
        for i in

        {1..5} ; do couchdb -d; couchdb -b ; done
        The last command is a start and sure enough, couchdb is running and has restarted completely five times.
        Same stupid in reverse:
        for i in {1..5}

        ; do couchdb -b; couchdb -d ; done
        CouchDB is stopped.

        Now clearly there is going to be an issue with the use of sname and multiple couchdb instances up and running, but I think it will be worthwhile to fix. Every single resource I read and my own experience with erlang is that using kill to shut down is just waiting for problems.

        I've temporarily appended the pid to start and stop messages for clarity on what's happening.

        Show
        Wendall Cada added a comment - See: use-sname-rpc-not-kill.patch Here is what I figured out while testing. The whole concept of using a PID file and kill -1 $PID with erlang is just not going to work consistently. Here is a way to replicate what happens sometimes when issuing a restart (stop/start), and beam hasn't stopped yet. For example, try: couchdb -b && couchdb -d && couchdb -b Apache CouchDB has started, time to relax. Apache CouchDB is not running. Apache CouchDB has started, time to relax. $ echo `cat /var/run/couchdb/couchdb.pid` 10229 $ ps -A | grep beam.smp 10193 pts/2 00:00:00 beam.smp However, adding -sname couchdb to the command options results the second start failing silently, but couchdb does stop. A stale pid id is left in the pid file from the second start command. Now if I modified start_couchdb so it actually checks if the process id returned from the erl command is running, then wait 2 seconds so the pid file can hit the disk. I modified stop_couchdb and eliminated the use of kill -1 and wait for the process to actually exit. Now everything works as intended, no matter what bizarre scenario is encountered. So for just pure stupid, I can do this: for i in {1..5} ; do couchdb -d; couchdb -b ; done The last command is a start and sure enough, couchdb is running and has restarted completely five times. Same stupid in reverse: for i in {1..5} ; do couchdb -b; couchdb -d ; done CouchDB is stopped. Now clearly there is going to be an issue with the use of sname and multiple couchdb instances up and running, but I think it will be worthwhile to fix. Every single resource I read and my own experience with erlang is that using kill to shut down is just waiting for problems. I've temporarily appended the pid to start and stop messages for clarity on what's happening.
        Hide
        Wendall Cada added a comment -

        Patch using sname and rpc halt() instead of kill -1

        Show
        Wendall Cada added a comment - Patch using sname and rpc halt() instead of kill -1
        Hide
        Robert Newson added a comment -

        Removing 1.0.4/1.1.2.

        Show
        Robert Newson added a comment - Removing 1.0.4/1.1.2.
        Hide
        Wendall Cada added a comment -
        Show
        Wendall Cada added a comment - Created a patch for this. https://github.com/apache/couchdb/pull/48
        Hide
        Wendall Cada added a comment -

        Patch submitted in pull request on github

        Show
        Wendall Cada added a comment - Patch submitted in pull request on github
        Hide
        Dave Cottlehuber added a comment -

        Can I get some eyes on this please? I'm not able to decide if this is good to go or not.

        Show
        Dave Cottlehuber added a comment - Can I get some eyes on this please? I'm not able to decide if this is good to go or not.
        Show
        Wendall Cada added a comment - Randall tested and gives it a +1 in this post to the dev list http://mail-archives.apache.org/mod_mbox/couchdb-dev/201303.mbox/%3CCAAL6JQjuiSQOkjuF6jLoZB_ee3Ki7bouxeF1grW9CQ6dFTms8A%40mail.gmail.com%3E
        Hide
        Wendall Cada added a comment -

        I'm thinking that we may want to land this as well. https://github.com/apache/couchdb/commit/410f4c980e6f3dbb02f0432280523e19210bb83e

        This would need a solution for windows, maybe taskkill, but I'd defer to Dave on this.

        Show
        Wendall Cada added a comment - I'm thinking that we may want to land this as well. https://github.com/apache/couchdb/commit/410f4c980e6f3dbb02f0432280523e19210bb83e This would need a solution for windows, maybe taskkill, but I'd defer to Dave on this.
        Hide
        Dave Cottlehuber added a comment -

        Re the kill-9 branch, I don't want to distribute a 3rd party binary (or a dependency on one) for Windows. @jan what's the problem we are trying to fix here? the noisy logs around _restart ? Or something else that I miss?

        Show
        Dave Cottlehuber added a comment - Re the kill-9 branch, I don't want to distribute a 3rd party binary (or a dependency on one) for Windows. @jan what's the problem we are trying to fix here? the noisy logs around _restart ? Or something else that I miss?
        Hide
        Dave Cottlehuber added a comment -

        We discussed options on IRC. taskkill is part of Windows from XP onwards, this would leave windows 2000 out. I'm not crying over this one. It doesn't require admin privileges i.e. you can kill your own procs. So +1 moving ahead with an amended version of this.

        ref http://ss64.com/nt/taskkill.html and http://technet.microsoft.com/en-us/library/bb491009.aspx full taskkill.exe should be specified.

        Show
        Dave Cottlehuber added a comment - We discussed options on IRC. taskkill is part of Windows from XP onwards, this would leave windows 2000 out. I'm not crying over this one. It doesn't require admin privileges i.e. you can kill your own procs. So +1 moving ahead with an amended version of this. ref http://ss64.com/nt/taskkill.html and http://technet.microsoft.com/en-us/library/bb491009.aspx full taskkill.exe should be specified.
        Hide
        Jan Lehnardt added a comment -

        moving off the blocker list, put on again, if you think we should fix this in 1.3.0

        Show
        Jan Lehnardt added a comment - moving off the blocker list, put on again, if you think we should fix this in 1.3.0
        Hide
        Jan Lehnardt added a comment -

        the kill -TERM patch should be its own ticket.

        Show
        Jan Lehnardt added a comment - the kill -TERM patch should be its own ticket.
        Hide
        Wendall Cada added a comment -

        My opinion is that this has been a blocker for 2+ years. It makes issuing a restart on a production service a risky operation that can result in leaving the service in a state that it cannot be started without manually killing all of the orphans left behind.

        I agree that the other patch should be a separate ticket. I'll do so right now.

        Show
        Wendall Cada added a comment - My opinion is that this has been a blocker for 2+ years. It makes issuing a restart on a production service a risky operation that can result in leaving the service in a state that it cannot be started without manually killing all of the orphans left behind. I agree that the other patch should be a separate ticket. I'll do so right now.
        Hide
        ASF subversion and git services added a comment -

        Commit 9fcac1c49d51ed6fa6b32bdfceca5c43c727d8bf in branch refs/heads/1.3.x from Wendall Cada
        [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9fcac1c ]

        Fix for COUCHDB-1449 stopped status returned before couchdb process exits.

        Show
        ASF subversion and git services added a comment - Commit 9fcac1c49d51ed6fa6b32bdfceca5c43c727d8bf in branch refs/heads/1.3.x from Wendall Cada [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9fcac1c ] Fix for COUCHDB-1449 stopped status returned before couchdb process exits.
        Hide
        ASF subversion and git services added a comment -

        Commit 15c1a97e42c5bafad6f5fa83d753d3a7cf64d6d7 in branch refs/heads/master from Wendall Cada
        [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=15c1a97 ]

        Fix for COUCHDB-1449 stopped status returned before couchdb process exits.

        Show
        ASF subversion and git services added a comment - Commit 15c1a97e42c5bafad6f5fa83d753d3a7cf64d6d7 in branch refs/heads/master from Wendall Cada [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=15c1a97 ] Fix for COUCHDB-1449 stopped status returned before couchdb process exits.
        Hide
        Jan Lehnardt added a comment -

        ok, committed.

        Show
        Jan Lehnardt added a comment - ok, committed.
        Hide
        Wendall Cada added a comment -

        Ticket for -TERM issue is here for reference: https://issues.apache.org/jira/browse/COUCHDB-1714

        Show
        Wendall Cada added a comment - Ticket for -TERM issue is here for reference: https://issues.apache.org/jira/browse/COUCHDB-1714
        Hide
        ASF subversion and git services added a comment -

        Commit 4bd0adce93ce0d5c45463d87cb186541d11f11f3 in branch refs/heads/1.2.x from Wendall Cada <wendallc@apache.org>
        [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=4bd0adc ]

        Fix for COUCHDB-1449 stopped status returned before couchdb process exits.

        Show
        ASF subversion and git services added a comment - Commit 4bd0adce93ce0d5c45463d87cb186541d11f11f3 in branch refs/heads/1.2.x from Wendall Cada <wendallc@apache.org> [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=4bd0adc ] Fix for COUCHDB-1449 stopped status returned before couchdb process exits.
        Hide
        ASF subversion and git services added a comment -

        Commit b5d18fe2ae456d1fea9fd3e3990ccabf815326c2 in branch refs/heads/1.2.x from Wendall Cada
        [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=b5d18fe ]

        Added comment for COUCHDB-1449, and added sections for 1.2.2 to CHANGES

        Show
        ASF subversion and git services added a comment - Commit b5d18fe2ae456d1fea9fd3e3990ccabf815326c2 in branch refs/heads/1.2.x from Wendall Cada [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=b5d18fe ] Added comment for COUCHDB-1449 , and added sections for 1.2.2 to CHANGES
        Hide
        ASF subversion and git services added a comment -

        Commit 9f07a4b86b24457b8076d4dcc92524eda6de0a8a in branch refs/heads/1.3.x from Wendall Cada
        [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9f07a4b ]

        Added comment for COUCHDB-1449 to CHANGES

        Show
        ASF subversion and git services added a comment - Commit 9f07a4b86b24457b8076d4dcc92524eda6de0a8a in branch refs/heads/1.3.x from Wendall Cada [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=9f07a4b ] Added comment for COUCHDB-1449 to CHANGES
        Hide
        ASF subversion and git services added a comment -

        Commit b5d18fe2ae456d1fea9fd3e3990ccabf815326c2 in branch refs/heads/1832-fix-empty-attachment-name from Wendall Cada
        [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=b5d18fe ]

        Added comment for COUCHDB-1449, and added sections for 1.2.2 to CHANGES

        Show
        ASF subversion and git services added a comment - Commit b5d18fe2ae456d1fea9fd3e3990ccabf815326c2 in branch refs/heads/1832-fix-empty-attachment-name from Wendall Cada [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=b5d18fe ] Added comment for COUCHDB-1449 , and added sections for 1.2.2 to CHANGES

          People

          • Assignee:
            Wendall Cada
            Reporter:
            Wendall Cada
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development