Hadoop Common
  1. Hadoop Common
  2. HADOOP-50

dfs datanode should store blocks in multiple directories

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.0
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None

      Description

      The datanode currently stores all file blocks in a single directory. With 32MB blocks and terabyte filesystems, this will create too many files in a single directory for many filesystems. Thus blocks should be stored in multiple directories, perhaps even a shallow hierarchy.

      1. hadoop.50.patch.1
        13 kB
        Mike Cafarella

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          I think this is a valid concern. Most filesystems work poorly with thousands of files in a single directory. My recent tests on ext3 show that listing the data directory with 50,000 blocks takes several seconds.

          FSDataset:80 contains a commented out section, which seems to address this issue. Anyone knows why it's not used?

          Show
          Andrzej Bialecki added a comment - I think this is a valid concern. Most filesystems work poorly with thousands of files in a single directory. My recent tests on ext3 show that listing the data directory with 50,000 blocks takes several seconds. FSDataset:80 contains a commented out section, which seems to address this issue. Anyone knows why it's not used?
          Hide
          Mike Cafarella added a comment -

          Hi Andrzej,

          I wrote this code and got it 90% working some time ago, but then had to abandon
          it for a more important bug. It is not ready to go in its current state, but shouldn't
          be too hard. I can bring this code back to life..

          --Mike

          Show
          Mike Cafarella added a comment - Hi Andrzej, I wrote this code and got it 90% working some time ago, but then had to abandon it for a more important bug. It is not ready to go in its current state, but shouldn't be too hard. I can bring this code back to life.. --Mike
          Hide
          Andrzej Bialecki added a comment -

          That would be very useful. I've seen similar solutions in many places (e.g. squid, or Mozilla cache dir).

          Currently, each time a block report is sent we need to list this huge dir. That's still ok, it's infrequent enough. However, each time we need to access a block, a correct file needs to be open. Inside the native code JVM uses an open(2) call, which causes the OS to perform a name-to-inode lookup. Even though OS is caching partial results of this lookup (in Linux this is known as dcache/dentries), still depending on the size of this LRU cache and the FS implementation details, doing real lookups for e.g. new blocks or newly requested blocks may take a long time.

          Having said that, I'm not sure what would be the real performance benefit of this change, perhaps you could come up with a simpler test first...?

          Show
          Andrzej Bialecki added a comment - That would be very useful. I've seen similar solutions in many places (e.g. squid, or Mozilla cache dir). Currently, each time a block report is sent we need to list this huge dir. That's still ok, it's infrequent enough. However, each time we need to access a block, a correct file needs to be open. Inside the native code JVM uses an open(2) call, which causes the OS to perform a name-to-inode lookup. Even though OS is caching partial results of this lookup (in Linux this is known as dcache/dentries), still depending on the size of this LRU cache and the FS implementation details, doing real lookups for e.g. new blocks or newly requested blocks may take a long time. Having said that, I'm not sure what would be the real performance benefit of this change, perhaps you could come up with a simpler test first...?
          Hide
          Mike Cafarella added a comment -

          This fixes the multiple-directory storage problem. It
          lazily creates a single level of 512 subdirectories, into which
          the blocks are allocated according to the lower 9 bits of the
          block id. If mankind ever needs more blocks than this, it is easy
          to add an additional subdir layer and select on the lowest-but-9
          bits of the blockid.

          This change is backwards-compatible with the previous block
          layout. Old blocks in the single-layer dir will always be kept in
          that format; we don't migrate them. New blocks will always be added
          to the new hierarchy.

          If both versions of the storage system are present, we always test
          the new one first. If that fails, we test the old one. (The new test
          should be faster, so we do it first.)

          Please let me know if this patch works for you.

          Show
          Mike Cafarella added a comment - This fixes the multiple-directory storage problem. It lazily creates a single level of 512 subdirectories, into which the blocks are allocated according to the lower 9 bits of the block id. If mankind ever needs more blocks than this, it is easy to add an additional subdir layer and select on the lowest-but-9 bits of the blockid. This change is backwards-compatible with the previous block layout. Old blocks in the single-layer dir will always be kept in that format; we don't migrate them. New blocks will always be added to the new hierarchy. If both versions of the storage system are present, we always test the new one first. If that fails, we test the old one. (The new test should be faster, so we do it first.) Please let me know if this patch works for you.
          Hide
          alan wootton added a comment -

          +1
          I didn't look at the patch, but I carefully read the commented out code. I vote yes. (is it already in? - close this issue).

          Show
          alan wootton added a comment - +1 I didn't look at the patch, but I carefully read the commented out code. I vote yes. (is it already in? - close this issue).
          Hide
          Sameer Paranjpye added a comment -

          This was done as part of HADOOP-64

          Show
          Sameer Paranjpye added a comment - This was done as part of HADOOP-64

            People

            • Assignee:
              Milind Bhandarkar
              Reporter:
              Doug Cutting
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development