Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None
    • Tags:
      datanode startup, volume parallel, hard links

      Description

      One of the factors slowing down cluster restart is the startup time for the Datanodes. In particular, if Upgrade is needed, the Datanodes must do a Snapshot and this can take 5-15 minutes per volume, serially. Thus, for a 4-disk datanode, it may be 45 minutes before it is ready to send its initial Block Report to the Namenode. This is an umbrella bug for the following four pieces of work to improve Datanode startup time:

      1. Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file. This is the biggest villain, responsible for 90% of that 45 minute delay. See subordinate bug for details.

      2. Refactor Upgrade process in DataStorage to run volume-parallel. There is already a bug open for this, HDFS-270, and the volume-parallel work in DirectoryScanner from HDFS-854 is a good foundation to build on.

      3. Refactor the FSDir() and getVolumeMap() call chains in FSDataset, so they share data and run volume-parallel. Currently the two constructors for in-memory directory tree and replicas map run THREE full scans of the entire disk - once in FSDir(), once in recoverTempUnlinkedBlock(), and once in addToReplicasMap(). During each scan, a new File object is created for each of the 100,000 or so items in the native file system (for a 50,000-block node). This impacts GC as well as disk traffic.

      4. Make getGenerationStampFromFile() more efficient. Currently this routine is called by addToReplicasMap() for every blockfile in the directory tree, and it walks the listing of each file's containing directory on every call. There is a simple refactoring that makes this unnecessary.

        Activity

        Matt Foley created issue -
        Matt Foley made changes -
        Field Original Value New Value
        Description One of the factors slowing down cluster restart is the startup time for the Datanodes. In particular, if Upgrade is needed, the Datanodes must do a Snapshot and this can take 5-15 minutes per volume, serially. Thus, for a 4-disk datanode, it may be 45 minutes before it is ready to send its initial Block Report to the Namenode. This is an umbrella bug for the following four pieces of work to improve Datanode startup time:

        1. Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file. This is the biggest villain, responsible for 90% of that 45 minute delay. See subordinate bug for details.

        2. Refactor Upgrade process in DataStorage to run volume-parallel. There is already a bug open for this, HDFS-270, and the volume-parallel work in DirectoryScanner from HDFS-854 is a good foundation to build on.

        3. Refactor the FSDir() and getVolumeMap() call chains in FSDataset, so they share data and run volume-parallel. Currently the two constructors for in-memory directory tree and replicas map run THREE full scans of the entire disk - once in FSDir(), once in recoverTempUnlinkedBlock(), and once in addToReplicasMap(). During each scan, a new File object is created for each of the 100,000 or so items in the native file system (for a 50,000-block node). This impacts GC as well as disk traffic.

        4. Make getGenerationStampFromFile() more efficient. Currently this routine is called by addToReplicasMap() for every blockfile in the directory tree, and it does a full listing of each file's containing directory on every call. This is the equivalent of doing lots MORE full disk scans. The underlying disk i/o buffers probably prevent disk thrashing, but we are still creating bazillions of unnecessary File objects that need to be GC'ed. There is a simple refactoring that prevents this.

        One of the factors slowing down cluster restart is the startup time for the Datanodes. In particular, if Upgrade is needed, the Datanodes must do a Snapshot and this can take 5-15 minutes per volume, serially. Thus, for a 4-disk datanode, it may be 45 minutes before it is ready to send its initial Block Report to the Namenode. This is an umbrella bug for the following four pieces of work to improve Datanode startup time:

        1. Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file. This is the biggest villain, responsible for 90% of that 45 minute delay. See subordinate bug for details.

        2. Refactor Upgrade process in DataStorage to run volume-parallel. There is already a bug open for this, HDFS-270, and the volume-parallel work in DirectoryScanner from HDFS-854 is a good foundation to build on.

        3. Refactor the FSDir() and getVolumeMap() call chains in FSDataset, so they share data and run volume-parallel. Currently the two constructors for in-memory directory tree and replicas map run THREE full scans of the entire disk - once in FSDir(), once in recoverTempUnlinkedBlock(), and once in addToReplicasMap(). During each scan, a new File object is created for each of the 100,000 or so items in the native file system (for a 50,000-block node). This impacts GC as well as disk traffic.

        4. Make getGenerationStampFromFile() more efficient. Currently this routine is called by addToReplicasMap() for every blockfile in the directory tree, and it walks the listing of each file's containing directory on every call. There is a simple refactoring that makes this unnecessary.

        Nigel Daley made changes -
        Fix Version/s 0.22.0 [ 12314241 ]

          People

          • Assignee:
            Matt Foley
            Reporter:
            Matt Foley
          • Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:

              Development