Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13768

Adding replicas to volume map makes DataNode start slowly

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 2.10.0, 3.2.0, 3.1.2
    • Component/s: None
    • Labels:
      None

      Description

      We find DN starting so slowly when rolling upgrade our cluster. When we restart DNs, the DNs start so slowly and not register to NN immediately. And this cause a lots of following error:

      DataXceiver error processing WRITE_BLOCK operation  src: /xx.xx.xx.xx:64360 dst: /xx.xx.xx.xx:50010
      java.io.IOException: Not ready to serve the block pool, BP-1508644862-xx.xx.xx.xx-1493781183457.
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAndWaitForBP(DataXceiver.java:1290)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAccess(DataXceiver.java:1298)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:630)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
              at java.lang.Thread.run(Thread.java:745)
      

      Looking into the logic of DN startup, it will do the initial block pool operation before the registration. And during initializing block pool operation, we found the adding replicas to volume map is the most expensive operation. Related log:

      2018-07-26 10:46:23,771 INFO [Thread-105] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on volume /home/hard_disk/1/dfs/dn/current: 242722ms
      2018-07-26 10:46:26,231 INFO [Thread-109] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on volume /home/hard_disk/5/dfs/dn/current: 245182ms
      2018-07-26 10:46:32,146 INFO [Thread-112] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on volume /home/hard_disk/8/dfs/dn/current: 251097ms
      2018-07-26 10:47:08,283 INFO [Thread-106] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on volume /home/hard_disk/2/dfs/dn/current: 287235ms
      

      Currently DN uses independent thread to scan and add replica for each volume, but we still need to wait the slowest thread to finish its work. So the main problem here is that we could make the thread to run faster.

      The jstack we get when DN blocking in the adding replica:

      "Thread-113" #419 daemon prio=5 os_prio=0 tid=0x00007f40879ff000 nid=0x145da runnable [0x00007f4043a38000]
         java.lang.Thread.State: RUNNABLE
      	at java.io.UnixFileSystem.list(Native Method)
      	at java.io.File.list(File.java:1122)
      	at java.io.File.listFiles(File.java:1207)
      	at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1165)
      	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:445)
      	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448)
      	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448)
      	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:342)
      	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:864)
      	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191)
      

      One improvement maybe we can use ForkJoinPool to do this recursive task, rather than a sync way. This will be a great improvement because it can greatly speed up recovery process.

        Attachments

        1. HDFS-13768.01.patch
          9 kB
          Surendra Singh Lilhore
        2. HDFS-13768.01-branch-2.patch
          25 kB
          Surendra Singh Lilhore
        3. HDFS-13768.02.patch
          9 kB
          Surendra Singh Lilhore
        4. HDFS-13768.03.patch
          21 kB
          Surendra Singh Lilhore
        5. HDFS-13768.04.patch
          21 kB
          Surendra Singh Lilhore
        6. HDFS-13768.05.patch
          23 kB
          Surendra Singh Lilhore
        7. HDFS-13768.06.patch
          23 kB
          Surendra Singh Lilhore
        8. HDFS-13768.07.patch
          23 kB
          Yiqun Lin
        9. HDFS-13768.patch
          4 kB
          Ranith Sardar
        10. HDFS-13768-branch-2.01.patch
          25 kB
          Yiqun Lin
        11. HDFS-13768-branch-2.02.patch
          25 kB
          Surendra Singh Lilhore
        12. HDFS-13768-branch-2.03.patch
          25 kB
          Surendra Singh Lilhore
        13. screenshot-1.png
          32 kB
          Surendra Singh Lilhore

          Issue Links

            Activity

              People

              • Assignee:
                surendrasingh Surendra Singh Lilhore
                Reporter:
                linyiqun Yiqun Lin
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: