Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3291

Oozie workflow hangs in running state even when the underlying action failed

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.1.0
    • Fix Version/s: None
    • Component/s: workflow
    • Labels:
      None

      Description

      We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1). We are hittingĀ 

      https://issues.apache.org/jira/browse/MAPREDUCE-6478

      at this time the distcp action fails with the below exception.

      2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1520068304865_972654_m_000000_0: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000 (inode 192492374): File does not exist. Holder DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files.
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
      at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
      
      

      At this time we expect that WF should be killed and subsequent WF should start. But this WF is stuck in RUNNING state and other WFs get stacked up through the coordinator, leaving no option but to kill the running WF. After this defective WF is killed, other WF's process perfectly fine

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rohit.peg Rohit Pegallapati
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: