Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1170

Very high CPU usage on data nodes because of FSDataset.checkDataDir() on every connect

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11.2
    • 0.13.0
    • None
    • None

    Description

      While investigating performance issues in our Hadoop DFS/MapReduce cluster I saw very high CPU usage by DataNode processes.

      Stack trace showed following on most of the data nodes:
      "org.apache.hadoop.dfs.DataNode$DataXceiveServer@528acf6e" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable [0x000000004166a000..0x000000004166ac00]
      at java.io.UnixFileSystem.checkAccess(Native Method)
      at java.io.File.canRead(File.java:660)
      at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
      at org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
      at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)

      • locked <0x00002aaab6fb8960> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
        at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
        at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
        at java.lang.Thread.run(Thread.java:595)

      I understand that it would take a while to check the entire data directory - as we have some 180,000 blocks/files in there. But what really bothers me that from the code I see that this check is executed for every client connection to the DataNode - which also means for every task executed in the cluster. Once I commented out the check and restarted datanodes - the performance went up and CPU usage went down to reasonable level.

      Attachments

        1. 1170.patch
          0.6 kB
          Igor Bolotin
        2. 1170-v2.patch
          2 kB
          Igor Bolotin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ibolotin Igor Bolotin
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: