Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11260

Slow writer threads are not stopped

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.0
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None
    • Environment:

      CDH5.8.0

    • Target Version/s:

      Description

      If a DataNode receives a transferred block, it tries to stop writer to the same block. However, this may not work, and we saw the following error message and stacktrace.

      Fundamentally, the assumption of ReplicaInPipeline#stopWriter is wrong. It assumes the writer thread must be a DataXceiver thread, which it can be interrupted and terminates afterwards. However, IPC threads may also be the writer thread by calling initReplicaRecovery, and which ignores interrupt and do not terminate.

      2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Join on writer thread Thread[IPC Server handler 6 on 50020,5,main] timed out
      sun.misc.Unsafe.park(Native Method)
      java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
      java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
      org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:135)
      org.apache.hadoop.ipc.Server$Handler.run(Server.java:2052)
      
      2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver constructor. Cause is
      2016-12-16 19:58:56,168 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: sj1dra082.corp.adobe.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.10.0.80:44105 dst: /10.10.0.82:50010
      java.io.IOException: Join on writer thread Thread[IPC Server handler 6 on 50020,5,main] timed out
              at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:212)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1579)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:669)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
              at java.lang.Thread.run(Thread.java:745)
      

      There is also a logic error in FsDatasetImpl#createTemporary, in which if the code in the synchronized block executes for more than 60 seconds (in theory), it could throw an exception, without trying to stop the existing slow writer.

      We saw a FsDatasetImpl#createTemporary failed after nearly 10 minutes, and it's unclear why yet. It's my understanding that the code intends to stop slow writers after 1 minute by default. Some code rewrite is probably needed to get the logic right.

      2016-12-16 23:12:24,636 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Unable to stop existing writer for block BP-1527842723-10.0.0.180-1367984731269:blk_4313782210_1103780331023 after 568320 miniseconds.
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                weichiu Wei-Chiu Chuang
                Reporter:
                weichiu Wei-Chiu Chuang
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: