Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15398

EC: hdfs client hangs due to exception during addBlock

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.2.0
    • 3.3.1, 3.4.0
    • ec, hdfs-client
    • None
    • Reviewed

    Description

       In the operation of writing EC files, when the client calls addBlock() applying for the second block group (or >= the second block group) and it happens to exceed quota at this time, the client program will hang forever.
      See the demo below:

      $ hadoop fs -mkdir -p /user/wanghongbing/quota/ec
      $ hdfs dfsadmin -setSpaceQuota 2g /user/wanghongbing/quota
      $ hdfs ec -setPolicy -path /user/wanghongbing/quota/ec -policy RS-6-3-1024k
      Set RS-6-3-1024k erasure coding policy on /user/wanghongbing/quota/ec
      $ hadoop fs -put 800m /user/wanghongbing/quota/ec
      ^@^@^@^@^@^@^@^@^Z
      

      In the case of blocksize=128M, spaceQuota=2g and EC 6-3 policy, a block group needs to apply for 1152M physical space to write 768M logical data. Therefore, writing 800M data will exceed quota when applying for the second block group. At this point, the client will be hang forever.

      The exception stack of client is as follows:

      java.lang.Thread.State: TIMED_WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x000000008009d5d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
              at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
              at org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.takeWithTimeout(DFSStripedOutputStream.java:117)
              at org.apache.hadoop.hdfs.DFSStripedOutputStream.waitEndBlocks(DFSStripedOutputStream.java:453)
              at org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:477)
              at org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:541)
              - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
              at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
              at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164)
              - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
              at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145)
              - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
              at org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1182)
              - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
              at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:847)
              - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
              at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
              at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
              at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
              at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
              at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:77)
              at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
              at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
              at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
              at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
              at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
              at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
      

      When an exception occurs in addBlock, the program will call DFSStripedOutputStream.closeImpl() -> flushBuffer() -> writeChunk() -> allocateNewBlock() -> waitEndBlocks(), waitEndBlocks will enter an infinite loop because the queue in endBlocks is empty.

      private void waitEndBlocks(int i) throws IOException {
        while (getStripedDataStreamer(i).isHealthy()) {
          final ExtendedBlock b = coordinator.endBlocks.takeWithTimeout(i);
          if (b != null) {
            StripedBlockUtil.checkBlocks(currentBlockGroup, i, b);
            return;
          }
        }
      }
      

      So I close all stripedDataStreamer to fix it When an exception occurs in addBlock.

       

      Attachments

        1. HDFS-15398.001.patch
          1 kB
          Hongbing Wang
        2. HDFS-15398.002.patch
          4 kB
          Hongbing Wang
        3. HDFS-15398.003.patch
          3 kB
          Hongbing Wang
        4. HDFS-15398.004.patch
          3 kB
          Hongbing Wang

        Issue Links

          Activity

            People

              wanghongbing Hongbing Wang
              wanghongbing Hongbing Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: