[HADOOP-13837] Always get unable to kill error message even the hadoop process was successfully killed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: scripts
Labels:
None

Target Version/s:

3.2.0

Description

Reproduce steps

Setup a hadoop cluster
Stop resource manager : yarn --daemon stop resourcemanager
Stop node manager : yarn --daemon stop nodemanager
WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
ERROR: Unable to kill 20325

it always gets "Unable to kill <nm_pid>" error message, this gives user impression there is something wrong with the node manager process because it was not able to be forcibly killed. But in fact, the kill command works as expected.

This was because hadoop-functions.sh did not check process existence after kill properly. Currently it checks the process liveness right after the kill command

...
kill -9 "${pid}" >/dev/null 2>&1
if ps -p "${pid}" > /dev/null 2>&1; then
      hadoop_error "ERROR: Unable to kill ${pid}"
...

when resource manager stopped before node managers, it always takes some additional time until the process completely terminates. I tried to print output of ps -p <nm_pid> in a while loop after kill -9, and found following

16212 ?        00:00:11 java <defunct>
0
  PID TTY          TIME CMD
16212 ?        00:00:11 java <defunct>
0
  PID TTY          TIME CMD
16212 ?        00:00:11 java <defunct>
0
  PID TTY          TIME CMD
1
  PID TTY          TIME CMD
1
  PID TTY          TIME CMD
1
  PID TTY          TIME CMD
...

in the first 3 times of the loop, the process did not terminate so the exit code of ps -p are still 0

Proposal of a fix

Firstly I was thinking to add a more comprehensive pid check, it checks the pid liveness until reaches the HADOOP_STOP_TIMEOUT, but this seems to add too much complexity. Second fix was to simply add a sleep 3 after kill -9, it should fix the error in most cases with relative small changes to the script.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-13837.05.patch
04/Dec/17 07:51
1 kB
Weiwei Yang
HADOOP-13837.04.patch
29/Nov/16 06:08
1 kB
Weiwei Yang
HADOOP-13837.03.patch
29/Nov/16 05:36
1 kB
Weiwei Yang
HADOOP-13837.02.patch
29/Nov/16 02:49
2 kB
Weiwei Yang
HADOOP-13837.01.patch
29/Nov/16 02:22
2 kB
Weiwei Yang
check_proc.sh
28/Nov/16 17:13
0.2 kB
Weiwei Yang

Issue Links

is duplicated by

HADOOP-15527 loop until TIMEOUT before sending kill -9

Resolved

Activity

People

Assignee:: Weiwei Yang

Reporter:: Weiwei Yang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Nov/16 17:11

Updated:: 13/Jun/18 01:31

Resolved:: 13/Jun/18 01:31