Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1687

HDFS Federation: DirectoryScanner changes for federation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Federation Branch
    • Federation Branch
    • datanode
    • None

    Description

      DirectoryScanner scans substantially all of the directory tree of entire volumes. It needs to be extended to work with Blockpools in Federation.

      Design notes:

      1. The subdirectories of active bpid's will be scanned. Active bpid's are those associated with currently connected Namenodes. Each Volume knows the set of all active bpid's, via volume.map.keySet(). I'll add a package-private accessor in FSVolume to return the set of active bpid's for use by DirectoryScanner, DataBlockScanner, etc. DirectoryScanner will ignore inactive bpid's subdirectories; see item below.

      2. There is no need to compare the volume set of active bpid's with the global set, because the way the code works, they really can't be different. If differences arise, they will be automatically fixed by the next restart of either the Datanode or the Namenode.

      3. Inactive bpid's will be ignored. Until we are connected to the owner Namenode, we cannot know whether a bpid subdirectory is correctly formatted, has snapshot data, etc. So it doesn't make sense to try to manage the data under an inactive bpid.

      4. DirectoryScanner is currently instantiated and periodically triggered by DataBlockScanner. Other than both being "scanners", these two modules have little in common, and the triggering code is confusing. (DirectoryScanner scans filesystem directory trees every hour, to detect and fix inconsistencies between disk directories and ReplicasMap. DataBlockScanner runs every 3 weeks, and traverses all block files, actually reading them out and checksumming them to detect block corruption.)

      Separating them, and running DirectoryScanner under its own periodic scheduler, is a small change that will make the code much clearer. It already runs on its own FixedThreadPool Executor, so it is easy to change it to a ScheduledThreadPool, and instantiate it from DataNode.postStartInit() at the same time as initBlockScanner() is called.

      Attachments

        1. HDFS-1687_DirScan_v1.patch
          34 kB
          Matthew Foley

        Activity

          People

            mattf Matthew Foley
            mattf Matthew Foley
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: