[MAPREDUCE-4442] Accessing hadoop counters from a job is unreliable in yarn during AM process cleanup window - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0-alpha
Fix Version/s: None
Component/s: None
Labels:
- usability

Description

We found this issue during our tests moving from MapReduceV1 to MapReduceV2. A few of our applications access job counters multiple times:

a) After submission of job, while job is execution (works fine)

b) Right after job complete notification is received (works fine)

c) Few seconds after job complete notification (fails most of the time).

The error snippet is as follows:

2012-07-12 19:12:29,039 WARN  [Client] Unexpected error reading responses on connection Thread[IPC Client (1252749669) connection to sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main]
java.lang.NullPointerException
	at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
2012-07-12 19:12:29,044 INFO  [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2012-07-12 19:12:29,132 INFO  [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2012-07-12 19:12:29,216 ERROR [UserGroupInformation] PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException
2012-07-12 19:12:29,216 WARN  [BaseOutputStageJob] getJobCounters: Unable to retrieve counters. null
java.io.IOException
	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
	at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335)
	at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470)
	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719)
	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
	at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716)
	at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)

The connection to 10.202.50.187:47944 is actually the connection to AM; appears that we are connecting to AM to get the counters for the successful job and not yet to the history server.

I'll attach the logs for AM and resource mgr separately, however no unusual activity is seen in those.

This makes me suspect that we have a race condition in the code trying to access job counters when AM is finishing up and the job hasn't moved to history server yet.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

am_logs_counter_failure.html
13/Jul/12 20:35
2.39 MB
Rahul Jain
rsrc_mgr_logs_counter_failed.txt
13/Jul/12 20:37
4 kB
Rahul Jain

Issue Links

is related to

MAPREDUCE-3755 Add the equivalent of JobStatus to end of JobHistory file

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Rahul Jain

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 13/Jul/12 20:22

Updated:: 02/May/13 02:30