Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-16912

Block is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.1
    • None
    • block placement
    • None
    • hadoop:3.3.1

    Description

      We use hdfs federation mode: ns1, ns2. The table data is written under dc-hdfs. But we designate a specific library under a specific ns according to the business division.
      Use parquetWriter to write data to the staging temporary file directory of each table under a specific ns, but when the Writer is closed, an exception will be reported, which will trigger our operation to restore the file lease, but when the file is found to be restoring the lease An exception will be reported:

      It looks like dn and nn have temporarily lost communication, and this doesn't happen with every write.

      java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
          at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
          at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
          at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
          at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
          at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
          at com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
          at com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
          at com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:118)
          at com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:35)
          at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
          at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
          at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
          at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
          at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:419)
          at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
          at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
          at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
          at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
          at java.lang.Thread.run(Thread.java:748)
          Suppressed: java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
              at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
              at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
              at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
              at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
              at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
              at com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
              at com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
              at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
              at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:837)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:816)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:733)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
              ... 3 more
          Caused by: java.net.UnknownHostException: ns1hdfs
              ... 16 more
      Caused by: java.net.UnknownHostException: ns1hdfs
          ... 21 more 

      I can't see what the problem is, but when I checked the log of the namenode, I found that the block status of the file could not change from COMMITTED to COMPLETE. The reason is that dn needs to report ibr to namenode when closing the file. And it will close after receiving ack confirmation. However, dn failed to report ibr, which made it impossible to close the file. And it will retry every time the report fails, and the waiting time of the client is doubled in turn: 400ms, 800ms, 1600ms, 3200ms, 6400ms. These retries can be seen in the log of the namenode.

       

      2023-02-09 10:00:10,638 INFO  hdfs.StateChange (FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
      2023-02-09 10:00:11,072 INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
      2023-02-09 10:00:11,474 INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
      2023-02-09 10:00:12,285 INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
      2023-02-09 10:00:13,887 INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
      2023-02-09 10:00:17,089 INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
      2023-02-09 10:00:23,490 INFO  namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711

       

       

      Attachments

        1. image-2023-02-09-16-36-42-818.png
          51 kB
          jiangchunyang

        Activity

          People

            Unassigned Unassigned
            jianjiao jiangchunyang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: