Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6010

Working drillbit showing as in QUIESCENT state

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0
    • None
    • None

    Description

      After DRILL-4286 once I got a situation that after running all functional tests three drillbits were in ONLINE state, another one in QUIESCENT. Though from the one in quiescent state I could run queries and so it was online. drillbit.sh stop could not shutdown it and had to do kill -9 of the process (online_vs_quiescent.JPG).

      Attachments

        1. online_vs_quiescent.JPG
          75 kB
          Arina Ielchiieva

        Issue Links

          Activity

            arina Can you please give more details like what caused the shutdown of that Drillbit? What is the graceperiod that is set( there is zookeeper delay in updating other drillbits about its state)? drillbit.sh stop typically waits for 120 sec before forcefully shutting down. Did you see it running even after 120 sec ?

            vdonapati Venkata Jyothsna Donapati added a comment - arina Can you please give more details like what caused the shutdown of that Drillbit? What is the graceperiod that is set( there is zookeeper delay in updating other drillbits about its state)? drillbit.sh stop typically waits for 120 sec before forcefully shutting down. Did you see it running even after 120 sec ?
            arina Arina Ielchiieva added a comment - - edited

            I saw this on test cluster. Test cluster runs functional / advanced tests. Cluster has 4 nodes. Nodes are restarted using drillbits.sh script in order to run tests on clear env. All nodes are restarted at the same time using clush (example: clush -a /drill/bin/drillbit.sh restart). I ran tests on master so grace period was set 0 (the current default).

            Did you see it running even after 120 sec ?

            Yes, it did.
            So there can be two options:
            1. It might be the case that Drillbit was restarted (stop + start) and some how drillbit status was not updated. Drillbit status remained in quiescent, though drillbit was running and it mode supposed to be online. Maybe Zk missed status update.
            Or maybe

            2. Drillbit was really in quiescent mode and somehow further shutdown just hanged.

            arina Arina Ielchiieva added a comment - - edited I saw this on test cluster. Test cluster runs functional / advanced tests. Cluster has 4 nodes. Nodes are restarted using drillbits.sh script in order to run tests on clear env. All nodes are restarted at the same time using clush (example: clush -a /drill/bin/drillbit.sh restart ). I ran tests on master so grace period was set 0 (the current default). Did you see it running even after 120 sec ? Yes, it did. So there can be two options: 1. It might be the case that Drillbit was restarted (stop + start) and some how drillbit status was not updated. Drillbit status remained in quiescent, though drillbit was running and it mode supposed to be online. Maybe Zk missed status update. Or maybe 2. Drillbit was really in quiescent mode and somehow further shutdown just hanged.

            I'm trying to understand the scenario but drillbit.sh stop basically does "kill -9" after waiting for 120 seconds. Is the page stale? 

            vdonapati Venkata Jyothsna Donapati added a comment - I'm trying to understand the scenario but drillbit.sh stop basically does "kill -9" after waiting for 120 seconds. Is the page stale? 

            I don't think that the page was stale. Anyway, you might try issue graceful shutdown in a cluster with more then 2 nodes simultaneously from different nodes and see how it goes. It looks like it could be some concurrency issue or some lost update. If you won't be able to reproduce the issue, you can close the ticket with 'Cannot reproduce' status. If users encounter this issue later on, they will report about it.

            arina Arina Ielchiieva added a comment - I don't think that the page was stale. Anyway, you might try issue graceful shutdown in a cluster with more then 2 nodes simultaneously from different nodes and see how it goes. It looks like it could be some concurrency issue or some lost update. If you won't be able to reproduce the issue, you can close the ticket with 'Cannot reproduce' status. If users encounter this issue later on, they will report about it.

            It turns out restart command does not stop drillbit forcefully if stop time exceeds timeout (DRILL-6213). It might be the root cause.

            arina Arina Ielchiieva added a comment - It turns out restart command does not stop drillbit forcefully if stop time exceeds timeout ( DRILL-6213 ). It might be the root cause.
            vdonapati Venkata Jyothsna Donapati added a comment - - edited

            Yes, Looks like thats the issue. Closing the issue.

            vdonapati Venkata Jyothsna Donapati added a comment - - edited Yes, Looks like thats the issue. Closing the issue.

            People

              vdonapati Venkata Jyothsna Donapati
              arina Arina Ielchiieva
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: