[DRILL-6010] Working drillbit showing as in QUIESCENT state - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.14.0
Component/s: None
Labels:
None

Description

After ~~DRILL-4286~~ once I got a situation that after running all functional tests three drillbits were in ONLINE state, another one in QUIESCENT. Though from the one in quiescent state I could run queries and so it was online. drillbit.sh stop could not shutdown it and had to do kill -9 of the process (online_vs_quiescent.JPG).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

online_vs_quiescent.JPG
05/Dec/17 13:08
75 kB
Arina Ielchiieva

Issue Links

is related to

DRILL-6023 Graceful shutdown improvements (umbrella jira)

Open

Activity

Ascending order - Click to sort in descending order

Venkata Jyothsna Donapati added a comment - 06/Dec/17 16:15

arina Can you please give more details like what caused the shutdown of that Drillbit? What is the graceperiod that is set( there is zookeeper delay in updating other drillbits about its state)? drillbit.sh stop typically waits for 120 sec before forcefully shutting down. Did you see it running even after 120 sec ?

Venkata Jyothsna Donapati added a comment - 06/Dec/17 16:15 arina Can you please give more details like what caused the shutdown of that Drillbit? What is the graceperiod that is set( there is zookeeper delay in updating other drillbits about its state)? drillbit.sh stop typically waits for 120 sec before forcefully shutting down. Did you see it running even after 120 sec ?

Arina Ielchiieva added a comment - 06/Dec/17 16:36 - edited

I saw this on test cluster. Test cluster runs functional / advanced tests. Cluster has 4 nodes. Nodes are restarted using drillbits.sh script in order to run tests on clear env. All nodes are restarted at the same time using clush (example: clush -a /drill/bin/drillbit.sh restart). I ran tests on master so grace period was set 0 (the current default).

Did you see it running even after 120 sec ?

Yes, it did.
So there can be two options:
1. It might be the case that Drillbit was restarted (stop + start) and some how drillbit status was not updated. Drillbit status remained in quiescent, though drillbit was running and it mode supposed to be online. Maybe Zk missed status update.
Or maybe

2. Drillbit was really in quiescent mode and somehow further shutdown just hanged.

Arina Ielchiieva added a comment - 06/Dec/17 16:36 - edited I saw this on test cluster. Test cluster runs functional / advanced tests. Cluster has 4 nodes. Nodes are restarted using drillbits.sh script in order to run tests on clear env. All nodes are restarted at the same time using clush (example: clush -a /drill/bin/drillbit.sh restart ). I ran tests on master so grace period was set 0 (the current default). Did you see it running even after 120 sec ? Yes, it did. So there can be two options: 1. It might be the case that Drillbit was restarted (stop + start) and some how drillbit status was not updated. Drillbit status remained in quiescent, though drillbit was running and it mode supposed to be online. Maybe Zk missed status update. Or maybe 2. Drillbit was really in quiescent mode and somehow further shutdown just hanged.

Venkata Jyothsna Donapati added a comment - 22/Feb/18 21:22

I'm trying to understand the scenario but drillbit.sh stop basically does "kill -9" after waiting for 120 seconds. Is the page stale?

Venkata Jyothsna Donapati added a comment - 22/Feb/18 21:22 I'm trying to understand the scenario but drillbit.sh stop basically does "kill -9" after waiting for 120 seconds. Is the page stale?

Arina Ielchiieva added a comment - 23/Feb/18 13:23

I don't think that the page was stale. Anyway, you might try issue graceful shutdown in a cluster with more then 2 nodes simultaneously from different nodes and see how it goes. It looks like it could be some concurrency issue or some lost update. If you won't be able to reproduce the issue, you can close the ticket with 'Cannot reproduce' status. If users encounter this issue later on, they will report about it.

Arina Ielchiieva added a comment - 23/Feb/18 13:23 I don't think that the page was stale. Anyway, you might try issue graceful shutdown in a cluster with more then 2 nodes simultaneously from different nodes and see how it goes. It looks like it could be some concurrency issue or some lost update. If you won't be able to reproduce the issue, you can close the ticket with 'Cannot reproduce' status. If users encounter this issue later on, they will report about it.

Arina Ielchiieva added a comment - 06/Mar/18 18:49

It turns out restart command does not stop drillbit forcefully if stop time exceeds timeout (~~DRILL-6213~~). It might be the root cause.

Arina Ielchiieva added a comment - 06/Mar/18 18:49 It turns out restart command does not stop drillbit forcefully if stop time exceeds timeout ( DRILL-6213 ). It might be the root cause.

Venkata Jyothsna Donapati added a comment - 13/Mar/18 21:29 - edited

Yes, Looks like thats the issue. Closing the issue.

Venkata Jyothsna Donapati added a comment - 13/Mar/18 21:29 - edited Yes, Looks like thats the issue. Closing the issue.

Apache Drill

Working drillbit showing as in QUIESCENT state

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates