[HDFS-15398] EC: hdfs client hangs due to exception during addBlock - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.3.1, 3.4.0
Component/s: ec, hdfs-client
Labels:
None

Hadoop Flags:

Reviewed

Description

In the operation of writing EC files, when the client calls addBlock() applying for the second block group (or >= the second block group) and it happens to exceed quota at this time, the client program will hang forever.
See the demo below:

$ hadoop fs -mkdir -p /user/wanghongbing/quota/ec
$ hdfs dfsadmin -setSpaceQuota 2g /user/wanghongbing/quota
$ hdfs ec -setPolicy -path /user/wanghongbing/quota/ec -policy RS-6-3-1024k
Set RS-6-3-1024k erasure coding policy on /user/wanghongbing/quota/ec
$ hadoop fs -put 800m /user/wanghongbing/quota/ec
^@^@^@^@^@^@^@^@^Z

In the case of blocksize=128M, spaceQuota=2g and EC 6-3 policy, a block group needs to apply for 1152M physical space to write 768M logical data. Therefore, writing 800M data will exceed quota when applying for the second block group. At this point, the client will be hang forever.

The exception stack of client is as follows:

java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000008009d5d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.takeWithTimeout(DFSStripedOutputStream.java:117)
        at org.apache.hadoop.hdfs.DFSStripedOutputStream.waitEndBlocks(DFSStripedOutputStream.java:453)
        at org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:477)
        at org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:541)
        - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164)
        - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145)
        - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
        at org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1182)
        - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:847)
        - locked <0x000000008009f758> (a org.apache.hadoop.hdfs.DFSStripedOutputStream)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
        at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:77)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
        at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
        at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
        at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
        at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
        at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)

When an exception occurs in addBlock, the program will call DFSStripedOutputStream.closeImpl() -> flushBuffer() -> writeChunk() -> allocateNewBlock() -> waitEndBlocks(), waitEndBlocks will enter an infinite loop because the queue in endBlocks is empty.

private void waitEndBlocks(int i) throws IOException {
  while (getStripedDataStreamer(i).isHealthy()) {
    final ExtendedBlock b = coordinator.endBlocks.takeWithTimeout(i);
    if (b != null) {
      StripedBlockUtil.checkBlocks(currentBlockGroup, i, b);
      return;
    }
  }
}

So I close all stripedDataStreamer to fix it When an exception occurs in addBlock.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-15398.001.patch
09/Jun/20 05:37
1 kB
Hongbing Wang
HDFS-15398.002.patch
09/Jun/20 12:31
4 kB
Hongbing Wang
HDFS-15398.003.patch
09/Jun/20 14:17
3 kB
Hongbing Wang
HDFS-15398.004.patch
10/Jun/20 02:11
3 kB
Hongbing Wang

Issue Links

requires

HDFS-7663 Erasure Coding: Append on striped file

Resolved

EC: hdfs client hangs due to exception during addBlock

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates