Details
-
Bug
-
Status: Open
-
Blocker
-
Resolution: Unresolved
-
5.2.0
-
None
-
None
Description
When my cluster is loaded, I am faced with the problem of hanging subsidiaries in the status of "RUNNING". I get such a mistake when working with the HIVE tables. But also, I managed to reproduce the problem, launching the usual calculation of the number of pi in many subsidiaries, imitating the load.
I launch an Oozie workflow with the following structure:
-- Oozie workflow ------> subworkflow_1 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n ------> subworkflow_2 ---------- fork_1 ---------- fork_2 ---------- ... ---------- fork_n
One of the fork have status "RUNNING" but if you open this fork, then it has "SUCCESS" status.
Parent workflow:
Job ID : 0061971-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf/job
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-25 15:55 GMT
Started : 2024-01-25 15:55 GMT
Last Modified : 2024-01-30 06:24 GMT
Ended : -
CoordAction ID: -Actions
-------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@:start: OK - OK -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork OK - OK -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork7 OK 0067643-240125161152217-oozie-oozi-WSUCCEEDED -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork9 OK 0067640-240125161152217-oozie-oozi-WSUCCEEDED -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork10 RUNNING 0067641-240125161152217-oozie-oozi-WRUNNING -
-------------------------------------------------------------------------------------------------------------------------
0061971-240125161152217-oozie-oozi-W@fork5 OK 0067645-240125161152217-oozie-oozi-WSUCCEEDED -
-------------------------------------------------------------------------------------------------------------------------
Running subworkflow:
Job ID : 0067641-240125161152217-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : test-subworkflow
App Path : hdfs://mycluster:8020/user/cecyl/subwf
Status : RUNNING
Run : 0
User : cecyl
Group : -
Created : 2024-01-26 04:20 GMT
Started : 2024-01-26 04:20 GMT
Last Modified : 2024-01-26 08:23 GMT
Ended : -
CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
-------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@:start: OK - OK -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork OK - OK -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork21 RUNNING application_1706187939089_147514RUNNING -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork22 RUNNING application_1706187939089_147519RUNNING -
-------------------------------------------------------------------------------------------------------------------------
0067641-240125161152217-oozie-oozi-W@fork18 RUNNING application_1706187939089_147518RUNNING -
-------------------------------------------------------------------------------------------------------------------------
But, running app have state "SUCCEEDED" and "FINISHED"
Application Report :
Application-Id : application_1706187939089_147514
Application-Name : oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
Application-Type : Oozie Launcher
User : cecyl
Queue : default
Application Priority : 0
Start-Time : 1706259786568
Finish-Time : 1706259853156
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED
The problem began to appear more often after tuning the HA. Solving the problem - reducing the load and restarting the application. But such a solution is not normal for me.
There are no signs in the laying and server logs that something is going wrong. Someone has ideas why such behavior can appear?