Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3291

Oozie workflow hangs in running state even when the underlying action failed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.1.0
    • None
    • workflow
    • None

    Description

      We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1). We are hittingĀ 

      https://issues.apache.org/jira/browse/MAPREDUCE-6478

      at this time the distcp action fails with the below exception.

      2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1520068304865_972654_m_000000_0: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000 (inode 192492374): File does not exist. Holder DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files.
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
      at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
      
      

      At this time we expect that WF should be killed and subsequent WF should start. But this WF is stuck in RUNNING state and other WFs get stacked up through the coordinator, leaving no option but to kill the running WF. After this defective WF is killed, other WF's process perfectly fine

      Attachments

        Activity

          People

            Unassigned Unassigned
            rohit.peg Rohit Pegallapati
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: