[OOZIE-3291] Oozie workflow hangs in running state even when the underlying action failed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.1.0
Fix Version/s: None
Component/s: workflow
Labels:
None

Description

We have mutiple distcp actions in fork join. We use hadoop 2.6.0 (cdh 5.5.1). We are hitting

https://issues.apache.org/jira/browse/MAPREDUCE-6478

at this time the distcp action fails with the below exception.

2018-06-10 15:19:39,179 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1520068304865_972654_m_000000_0: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/xxx/oozie-oozi/1951586-180303074950833-oozie-oozi-W/distcp-to-dr-0-update-action--distcp/output/_temporary/1/_temporary/attempt_1520068304865_972654_m_000000_0/part-00000 (inode 192492374): File does not exist. Holder DFSClient_NONMAPREDUCE_-2068852542_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3604)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:528)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

At this time we expect that WF should be killed and subsequent WF should start. But this WF is stuck in RUNNING state and other WFs get stacked up through the coordinator, leaving no option but to kill the running WF. After this defective WF is killed, other WF's process perfectly fine

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Rohit Pegallapati

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Jun/18 23:09

Updated:: 28/Jun/18 18:24