[YARN-68] NodeManager will refuse to shutdown indefinitely due to container log aggregation - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.23.3
Fix Version/s: 2.0.2-alpha, 0.23.3
Component/s: nodemanager
Labels:
None
Environment:

QE

Description

The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present.

Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:

[Thread-1]2012-08-21 17:44:07,581 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Waiting for aggregation to complete for application_1345221477405_2733

The only recovery we found to work was to 'kill -9' the nm process.

What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

Attachments

YARN-68.patch
31/Aug/12 19:12
9 kB
Daryn Sharp
YARN-68-1.patch
31/Aug/12 21:04
10 kB
Daryn Sharp

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Daryn Sharp Assign to me

Reporter:: patrick white

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 31/Aug/12 18:19

Updated:: 11/Oct/12 17:48

Resolved:: 05/Sep/12 19:46

Agile

View on Board

NodeManager will refuse to shutdown indefinitely due to container log aggregation

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment