Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8496

Calling stopWriter() with FSDatasetImpl lock held may block other threads

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By looking at the stack, we found the calling of stopWriter() with FSDatasetImpl lock blocked everything.

      Following is the heartbeat stack, as an example, to show how threads are blocked by FSDatasetImpl lock:

         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
              - waiting to lock <0x00000007701badc0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
              - locked <0x0000000770465dc0> (a java.lang.Object)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
              at java.lang.Thread.run(Thread.java:662)
      

      The thread which held the FSDatasetImpl lock is just sleeping to wait another thread to exit in stopWriter(). The stack is:

         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Thread.join(Thread.java:1194)
              - locked <0x00000007636953b8> (a org.apache.hadoop.util.Daemon)
              at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
              - locked <0x00000007701badc0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
              at java.lang.Thread.run(Thread.java:662)
      

      In this case, we deployed quite a lot other workloads on the DN, the local file system and disk is quite busy. We guess this is why the stopWriter took quite a long time.
      Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl lock held. In HDFS-7999, the createTemporary() is changed to call stopWriter without FSDatasetImpl lock. We guess we should do so in the other three methods: recoverClose()/recoverAppend/recoverRbw().

      I'll try to finish a patch for this today.

        Attachments

        1. HDFS-8496-001.patch
          17 kB
          zhouyingchao
        2. HDFS-8496.004.patch
          25 kB
          Colin P. McCabe
        3. HDFS-8496.003.patch
          24 kB
          Colin P. McCabe
        4. HDFS-8496.002.patch
          14 kB
          Colin P. McCabe

          Issue Links

            Activity

              People

              • Assignee:
                cmccabe Colin P. McCabe
                Reporter:
                sinago zhouyingchao
              • Votes:
                0 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: