-
Type:
Bug
-
Status: Resolved
-
Priority:
Blocker
-
Resolution: Cannot Reproduce
-
Affects Version/s: v2.6.1
-
Fix Version/s: None
-
Component/s: Job Engine
-
Labels:None
Auto triggered merge tasks all hanged at step: Convert Cuboid Data to HFile. The last reduce step in "Kylin_HFile_Generator_mycube_Step".
Hadoop UI shows job status:
attempt_1555603539483_1251_r_000000_0 78.56 RUNNING reduce > reduce
From the syslog, following outputs are printed at first few hours:
2019-05-10 22:27:48,699 INFO [IPC Server handler 27 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8629564 2019-05-10 22:28:24,743 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE) 2019-05-10 22:28:24,746 INFO [IPC Server handler 23 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.86303306 2019-05-10 22:29:00,794 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE) 2019-05-10 22:29:00,796 INFO [IPC Server handler 20 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.86305153 2019-05-10 22:29:36,833 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE) 2019-05-10 22:29:36,836 INFO [IPC Server handler 3 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8634469 2019-05-10 22:30:12,883 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE) 2019-05-10 22:30:12,887 INFO [IPC Server handler 18 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8643631
After that, only ZK logs keeps printed following lines constantly:
2019-05-11 00:02:23,445 WARN [Thread-1388778] org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020] 2019-05-11 00:02:23,452 INFO [main] org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Connect 0x29758ab9 to boron:2181,carbon:2181,neon:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms 2019-05-11 00:02:23,452 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=bo ron:2181,carbon:2181,neon:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$24/171414890@19fbce98 2019-05-11 00:02:23,452 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server boron/172.16.211.20:2181. Will not attempt to authenticate using SASL (unknown error) 2019-05-11 00:02:23,453 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)] org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /172.16.211.8:52214, server: boron/172.16.211.20:2181 2019-05-11 00:02:23,453 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)] org.apache.zookeeper.ClientCnxn: Session establishment complete on server boron/172.16.211.20:2181, sessionid = 0x16a061f0770a354, negotiated timeout = 60000 2019-05-11 00:02:23,455 INFO [main] org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Close zookeeper connection 0x29758ab9 to boron:2181,carbon:2181,neon:2181 2019-05-11 00:02:23,455 INFO [main] org.apache.hadoop.hbase.io.hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled 2019-05-11 00:02:23,456 INFO [main] org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2: Writer=hdfs://berylium:8020/kylin/kylin_metadata/kylin-772502e8-c126-fec2-5808-08aa2df1fdea/pbs_uv_daily/hfile/_temporary/1/_temporary/attempt_1555603539483_1149_r_000000_0/F1/fdbf1bd8933242f2bd1da40a03c7a635, wrote=43 2019-05-11 00:02:23,457 WARN [Thread-1388781] org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
It lasts for tens of hours. No error, no exceptions. In the end, it created too many blocks in HDFS. It only happens when merge task is triggered. Daily build jobs went very good.