[MAPREDUCE-4443] MR AM and job history server should be resilient to jobs that exceed counter limits - ASF JIRA

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0-alpha
Fix Version/s: None
Component/s: None
Labels:
- BB2015-05-TBR
- usability

Description

We saw this problem migrating applications to MapReduceV2:

Our applications use hadoop counters extensively (1000+ counters for certain jobs). While this may not be one of recommended best practices in hadoop, the real issue here is reliability of the framework when applications exceed counter limits.

The hadoop servers (yarn, history server) were originally brought up with mapreduce.job.counters.max=1000 under core-site.xml

We then ran map-reduce job under an application using its own job specific overrides, with mapreduce.job.counters.max=10000

All the tasks for the job finished successfully; however the overall job still failed due to AM encountering exceptions as:

2012-07-12 17:31:43,485 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks
: 712012-07-12 17:31:43,502 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher threa
dorg.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 1001 max=1000
        at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:58)        at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:65)
        at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:77)        at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:94)
        at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:105)
        at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:202)
        at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:337)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1212)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1198)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1179)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:711)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.checkJobCompleteSuccess(JobImpl.java:737)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.checkJobForCompletion(JobImpl.java:1360)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1340)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1323)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:380)        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:666)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:113)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:890)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:886)        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)        at java.lang.Thread.run(Thread.java:662)
2012-07-12 17:31:43,502 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..2012-07-12 17:31:43,503 INFO [Thread-1] org.apache.had

The overall job failed, and the job history wasn't accessible either at the end of the job (didn't show up in job history server).

We were able to workaround the issue by changing to higher limits in core-site.xml and restarting yarn servers. However that forced us to increase the counters global limit to be as high as possible use by any individual application, which is hard to predict.

The original job then succeeded with new global limits.

However, since we didn't restart the job history server, it was unable to display job history page for the successful job altogether as it still hit counter exceeded exception. Restart of job history server finally got the application available under job history.

I'll also attach AM logs to help debug the issue

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

am_failed_counter_limits.txt
13/Jul/12 20:55
2.40 MB
Rahul Jain
MAPREDUCE-4443-trunk-draft.patch
13/Apr/13 00:55
5 kB
Mayank Bansal
MAPREDUCE-4443-trunk-1.patch
16/Apr/13 22:52
11 kB
Mayank Bansal
MAPREDUCE-4443-trunk-2.patch
17/Apr/13 07:06
12 kB
Mayank Bansal
MAPREDUCE-4443-trunk-3.patch
17/Apr/13 18:31
12 kB
Mayank Bansal

Issue Links

is depended upon by

MAPREDUCE-5149 If job has more counters Job History server is not able to show them.

Open

is required by

MAPREDUCE-5680 Reconsider limits

Open

relates to

MAPREDUCE-5875 Make Counter limits consistent across JobClient, MRAppMaster, and YarnChild

Closed

MR AM and job history server should be resilient to jobs that exceed counter limits

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates