Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14736

Starting the datanode unsuccessfully because of the corrupted sub dir in the data directory

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.2
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      If subdirectories in the datanode data directory was corrupted for some reason, the it would restart datanode unsuccessfully.
      For example, a sudden power failure in the computer room. The error infomation in the datanode log as the follow:

      datanode log:

      2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data06/block/current...
      2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data07/block/current...
      2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data08/block/current...
      2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data09/block/current...
      2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data10/block/current...
      2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data11/block/current...
      2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691
      1512 on volume /data12/block/current...
      2019-08-09 10:01:06,707 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught exception while scanning /data05/block/current.
      Will throw later.
      java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
      2019-08-09 10:01:15,330 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3
      -1523416911512 on /data06/block/current: 8627ms
      2019-08-09 10:01:15,348 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3
      -1523416911512 on /data11/block/current: 8645ms
      2019-08-09 10:01:15,352 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3
      -1523416911512 on /data01/block/current: 8649ms
      2019-08-09 10:01:15,361 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3
      -1523416911512 on /data12/block/current: 8658ms
      2019-08-09 10:01:15,362 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3
      -1523416911512 on /data03/block/current: 8659ms

       

      I check the codes of the whole process, and find some codes are weird in the #DataNode# and #FsVolumeImpl# as the follow:

      void initBlockPool(BPOfferService bpos) throws IOException {
        NamespaceInfo nsInfo = bpos.getNamespaceInfo();
        if (nsInfo == null) {
          throw new IOException("NamespaceInfo not found: Block pool " + bpos
              + " should have retrieved namespace info before initBlockPool.");
        }
        
        setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
      
        // Register the new block pool with the BP manager.
        blockPoolManager.addBlockPool(bpos);
        
        // In the case that this is the first block pool to connect, initialize
        // the dataset, block scanners, etc.
        initStorage(nsInfo);
      
        // Exclude failed disks before initializing the block pools to avoid startup
        // failures.
        checkDiskError();
      
        data.addBlockPool(nsInfo.getBlockPoolID(), conf);
        blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
        initDirectoryScanner(conf);
      }
      
      void checkDirs() throws DiskErrorException {
        // TODO:FEDERATION valid synchronization
        for(BlockPoolSlice s : bpSlices.values()) {
          s.checkDirs();
        }
      }

      during restarting the datanode, BPServiceActor will invoke initBlockPool to init the datastorage in this blockpool. It will execute checkDirs before addBlockPool. But I found the bpSlices is empty when the checkDirs was executed. So it is very weird. Then i check the codes as the follow:

      void addBlockPool(String bpid, Configuration conf) throws IOException {
        File bpdir = new File(currentDir, bpid);
        BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
        bpSlices.put(bpid, bp);
      }
      

      As you can see, the addBlockPool is executed after the checkDirs. So the bpSlices is empty in this case. And then it will throw java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp,  restarting datanode unsuccessfully.

      For example, the tmp dir was corrupted  with the information as follow:

      ls: cannot access tmp: Input/output error
      total 0
      d????????? ? ? ? ? ? tmp

       

       

       

       

       

       

       

       

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              alexking_lee liying
              Reporter:
              alexking_lee liying
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: