Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11070

NPE in BlockSender due to race condition

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      Saw the following NPE in a unit test:

      2016-10-27 14:42:58,450 ERROR DataNode - 127.0.0.1:51987:DataXceiver error processing READ_BLOCK operation src: /127.0.0.1:52429 dst: /127.0.0.1:51987
      java.lang.NullPointerException
      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:284)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581)
      at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150)
      at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289)
      at java.lang.Thread.run(Thread.java:745)

      The NPE occurred here:

      BlockSender.<init>
            // Obtain a reference before reading data
            this.volumeRef = datanode.data.getVolume(block).obtainReference();
      

      Right before the NPE was a few lines of debug message that indicated the replica was appended and updated.

      2016-10-27 14:42:58,442 DEBUG DataNode - block=BP-1071315328-172.16.1.88-1477604513635:blk_1073741825_1192, replica=FinalizedReplica, blk_1073741825_1192, FINALIZED
      getNumBytes() = 192
      getBytesOnDisk() = 192
      getVisibleLength()= 192
      getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
      getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
      2016-10-27 14:42:58,442 INFO FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1192, FINALIZED
      getNumBytes() = 192
      getBytesOnDisk() = 192
      getVisibleLength()= 192
      getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
      getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
      2016-10-27 14:42:58,442 DEBUG FsDatasetCache - Block with id 1073741825, pool BP-1071315328-172.16.1.88-1477604513635 does not need to be uncached, because it is not currently in the mappableBlockMap.
      2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825_1192.meta to /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta
      2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825 to /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825, file length=192
      2016-10-27 14:42:58,450 DEBUG DataNode - writeTo blockfile is /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825 of size 192
      2016-10-27 14:42:58,450 DEBUG DataNode - writeTo metafile is /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta of size 11

      The block object's genstamp should have been the same as the ondisk replica. However, the log suggests the replica's genstamp may have been updated after:

      BlockSender.<init>
            try(AutoCloseableLock lock = datanode.data.acquireDatasetLock()) {
              replica = getReplica(block, datanode);
              replicaVisibleLength = replica.getVisibleLength();
            }
            // if there is a write in progress
            ChunkChecksum chunkChecksum = null;
            if (replica.getState() == ReplicaState.RBW) {
              final ReplicaInPipeline rbw = (ReplicaInPipeline) replica;
              waitForMinLength(rbw, startOffset + length);
              chunkChecksum = rbw.getLastChecksumAndDataLen();
            }
      

      In summary, I think the assumption here is not valid, because a write-in-progress may happen after that check.

        Attachments

        1. HDFS-11070.001.patch
          0.9 kB
          Wei-Chiu Chuang

          Issue Links

            Activity

              People

              • Assignee:
                weichiu Wei-Chiu Chuang
                Reporter:
                weichiu Wei-Chiu Chuang
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: