Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4109

availability of a job info in HS should be atomic

    XMLWordPrintableJSON

    Details

      Description

      It seems that the HS starts serving info about a job before it has all the info available.

      In the trace below, a RunningJob throws a NPE when trying to access the counters.

      This is happening on & off, thus I assume it is related to either the AM not flushing all job info to HDFS before notifying HS or the HS not loading all the job info from HDFS before start serving it.

      In case it helps to diagnose the issue, this is happening in a secure cluster.

      This makes Oozie to mark jobs as failed.

      java.lang.NullPointerException
      	at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$MRClientProtocolHandler.getCounters(HistoryClientService.java:214)
      	at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getCounters(MRClientProtocolPBServiceImpl.java:149)
      	at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:206)
      	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:355)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1660)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1656)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:396)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1654)
       at LocalTrace: 
      	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
      	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:163)
      	at $Proxy31.getCounters(Unknown Source)
      	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getCounters(MRClientProtocolPBClientImpl.java:162)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:616)
      	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:296)
      	at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:325)
      	at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:472)
      	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:714)
      	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:711)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:416)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
      	at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:711)
      	at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)
      	at org.apache.oozie.action.hadoop.LauncherMapper.hasIdSwap(LauncherMapper.java:296)
      	at org.apache.oozie.action.hadoop.JavaActionExecutor.check(JavaActionExecutor.java:886)
      	at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:162)
      	at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:51)
      	at org.apache.oozie.command.XCommand.call(XCommand.java:260)
      	at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:166)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      	at java.lang.Thread.run(Thread.java:679)
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tucu00 Alejandro Abdelnur
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: