Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-4000

Merge task hang at step #5: Convert Cuboid Data to HFile

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Cannot Reproduce
    • Affects Version/s: v2.6.1
    • Fix Version/s: None
    • Component/s: Job Engine
    • Labels:
      None

      Description

      Auto triggered merge tasks all hanged at step: Convert Cuboid Data to HFile. The last reduce step in "Kylin_HFile_Generator_mycube_Step".

      Hadoop UI shows job status:
      attempt_1555603539483_1251_r_000000_0 78.56 RUNNING reduce > reduce

      From the syslog, following outputs are printed at first few hours:

       2019-05-10 22:27:48,699 INFO [IPC Server handler 27 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8629564
       2019-05-10 22:28:24,743 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE)
       2019-05-10 22:28:24,746 INFO [IPC Server handler 23 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.86303306
       2019-05-10 22:29:00,794 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE)
       2019-05-10 22:29:00,796 INFO [IPC Server handler 20 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.86305153
       2019-05-10 22:29:36,833 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE)
       2019-05-10 22:29:36,836 INFO [IPC Server handler 3 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8634469
       2019-05-10 22:30:12,883 INFO [Socket Reader #1 for port 46084|#1 for port 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1555603539483_1149 (auth:SIMPLE)
       2019-05-10 22:30:12,887 INFO [IPC Server handler 18 on 46084] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1555603539483_1149_r_000000_0 is : 0.8643631
      

      After that, only ZK logs keeps printed following lines constantly:

      2019-05-11 00:02:23,445 WARN [Thread-1388778] org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
       2019-05-11 00:02:23,452 INFO [main] org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Connect 0x29758ab9 to boron:2181,carbon:2181,neon:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms
       2019-05-11 00:02:23,452 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=bo
       ron:2181,carbon:2181,neon:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$24/171414890@19fbce98
       2019-05-11 00:02:23,452 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server boron/172.16.211.20:2181. Will not attempt to authenticate using SASL (unknown error)
       2019-05-11 00:02:23,453 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)] org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /172.16.211.8:52214, server: boron/172.16.211.20:2181
       2019-05-11 00:02:23,453 INFO [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)] org.apache.zookeeper.ClientCnxn: Session establishment complete on server boron/172.16.211.20:2181, sessionid = 0x16a061f0770a354, negotiated timeout = 60000
       2019-05-11 00:02:23,455 INFO [main] org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Close zookeeper connection 0x29758ab9 to boron:2181,carbon:2181,neon:2181
       2019-05-11 00:02:23,455 INFO [main] org.apache.hadoop.hbase.io.hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled
       2019-05-11 00:02:23,456 INFO [main] org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2: Writer=hdfs://berylium:8020/kylin/kylin_metadata/kylin-772502e8-c126-fec2-5808-08aa2df1fdea/pbs_uv_daily/hfile/_temporary/1/_temporary/attempt_1555603539483_1149_r_000000_0/F1/fdbf1bd8933242f2bd1da40a03c7a635, wrote=43
       2019-05-11 00:02:23,457 WARN [Thread-1388781] org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
      

      It lasts for tens of hours. No error, no exceptions. In the end, it created too many blocks in HDFS. It only happens when merge task is triggered. Daily build jobs went very good.

        Attachments

          Activity

            People

            • Assignee:
              zhangyaqian Yaqian Zhang
              Reporter:
              sickcate Tuo Zhu
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: