Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1120

Make DataNode's block-to-device placement policy pluggable

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Make the DataNode's block-volume choosing policy pluggable.
    • Tags:
      datanode, volumes, policy

      Description

      As discussed on the mailing list, as the number of disk drives per server increases, it would be useful to allow the DataNode's policy for new block placement to grow in sophistication from the current round-robin strategy.

      1. HDFS-1120.r1.diff
        12 kB
        Harsh J
      2. HDFS-1120.r2.diff
        13 kB
        Harsh J
      3. HDFS-1120.r3.diff
        12 kB
        Harsh J
      4. HDFS-1120.r4.diff
        12 kB
        Harsh J

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          16h 18m 1 Harsh J 27/Mar/11 04:43
          Open Open Patch Available Patch Available
          331d 4h 22m 2 Harsh J 27/Mar/11 07:36
          Patch Available Patch Available Resolved Resolved
          1d 18h 32m 1 Todd Lipcon 29/Mar/11 02:09
          Hide
          Harsh J added a comment -

          Typo, HDFS-2931.

          Show
          Harsh J added a comment - Typo, HDFS-2931 .
          Hide
          Harsh J added a comment -

          Thanks for taking the time to explain out Nicholas, that does make sense. +1 on switching it down.

          I've opened HDFS-2391 for the same, please review!

          Show
          Harsh J added a comment - Thanks for taking the time to explain out Nicholas, that does make sense. +1 on switching it down. I've opened HDFS-2391 for the same, please review!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          You are right that public sounds correct. However, if we annotate it as public, all the classes associated with them should also be annotated as public. Also, whenever we change the interface or any of the associated classes, it is an incompatible change.

          In our case, BlockVolumeChoosingPolicy uses FSVolumeInterface, which is a part of FSDatasetInterface. In FSDatasetInterface, there are many classes should not be exposed. One way to solve it is to make FSVolumeInterface independent of FSDatasetInterface. However, FSVolumeInterface is not yet a well-designed interface for the public.

          For these reasons, it is justified to annotate it as private, the same as BlockPlacementPolicy.

          Show
          Tsz Wo Nicholas Sze added a comment - You are right that public sounds correct. However, if we annotate it as public, all the classes associated with them should also be annotated as public. Also, whenever we change the interface or any of the associated classes, it is an incompatible change. In our case, BlockVolumeChoosingPolicy uses FSVolumeInterface, which is a part of FSDatasetInterface. In FSDatasetInterface, there are many classes should not be exposed. One way to solve it is to make FSVolumeInterface independent of FSDatasetInterface. However, FSVolumeInterface is not yet a well-designed interface for the public. For these reasons, it is justified to annotate it as private, the same as BlockPlacementPolicy.
          Hide
          Harsh J added a comment -

          Well I don't mind as its an advanced extension anyway, but just curious why the present public+evolving marking does not cut it. Why mark it private when it is actually publicly extensible (but for advanced users)?

          Show
          Harsh J added a comment - Well I don't mind as its an advanced extension anyway, but just curious why the present public+evolving marking does not cut it. Why mark it private when it is actually publicly extensible (but for advanced users)?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Similar to BlockPlacementPolicy, BlockVolumeChoosingPolicy actually is better annotated as @InterfaceAudience.Private. Harsh, how does it sound to you?

          Show
          Tsz Wo Nicholas Sze added a comment - Similar to BlockPlacementPolicy, BlockVolumeChoosingPolicy actually is better annotated as @InterfaceAudience.Private. Harsh, how does it sound to you?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Harsh, thanks for taking a look!

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Harsh, thanks for taking a look!
          Hide
          Harsh J added a comment -

          Oh nevermind my earlier comment, I think I understand now, that the public interface class was using a private interface object earlier, and the mentioned JIRA has made the interface public now, and hence we're good to go ahead.

          Show
          Harsh J added a comment - Oh nevermind my earlier comment, I think I understand now, that the public interface class was using a private interface object earlier, and the mentioned JIRA has made the interface public now, and hence we're good to go ahead.
          Hide
          Harsh J added a comment -

          Nicholas,

          Thanks for linking back!

          Do we really need per-method audience levels in interfaces?

          This one added in:

          +@InterfaceAudience.Public
          +@InterfaceStability.Evolving
          +public interface BlockVolumeChoosingPolicy {
          

          While I see the changes in your patch as just one backwards-incompatible change:

          Index: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockVolumeChoosingPolicy.java
          ===================================================================
          --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockVolumeChoosingPolicy.java	(revision 1241740)
          +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockVolumeChoosingPolicy.java	(working copy)
          @@ -22,7 +22,7 @@
           
           import org.apache.hadoop.classification.InterfaceAudience;
           import org.apache.hadoop.classification.InterfaceStability;
          -import org.apache.hadoop.hdfs.server.datanode.FSDataset.FSVolume;
          +import org.apache.hadoop.hdfs.server.datanode.FSDatasetInterface.FSVolumeInterface;
           
           /**************************************************
            * BlockVolumeChoosingPolicy allows a DataNode to
          @@ -46,7 +46,7 @@
              * @return the chosen volume to store the block.
              * @throws IOException when disks are unavailable or are full.
              */
          -  public FSVolume chooseVolume(List<FSVolume> volumes, long blockSize)
          +  public FSVolumeInterface chooseVolume(List<FSVolumeInterface> volumes, long blockSize)
               throws IOException;
          

          So am not sure what you meant by InterfaceAudience.Private comment above?

          Show
          Harsh J added a comment - Nicholas, Thanks for linking back! Do we really need per-method audience levels in interfaces? This one added in: +@InterfaceAudience.Public +@InterfaceStability.Evolving + public interface BlockVolumeChoosingPolicy { While I see the changes in your patch as just one backwards-incompatible change: Index: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockVolumeChoosingPolicy.java =================================================================== --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockVolumeChoosingPolicy.java (revision 1241740) +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockVolumeChoosingPolicy.java (working copy) @@ -22,7 +22,7 @@ import org.apache.hadoop.classification.InterfaceAudience; import org.apache.hadoop.classification.InterfaceStability; - import org.apache.hadoop.hdfs.server.datanode.FSDataset.FSVolume; + import org.apache.hadoop.hdfs.server.datanode.FSDatasetInterface.FSVolumeInterface; /************************************************** * BlockVolumeChoosingPolicy allows a DataNode to @@ -46,7 +46,7 @@ * @ return the chosen volume to store the block. * @ throws IOException when disks are unavailable or are full. */ - public FSVolume chooseVolume(List<FSVolume> volumes, long blockSize) + public FSVolumeInterface chooseVolume(List<FSVolumeInterface> volumes, long blockSize) throws IOException; So am not sure what you meant by InterfaceAudience.Private comment above?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          BlockVolumeChoosingPolicy is @InterfaceAudience.Public but declares the chooseVolume(..) method with a @InterfaceAudience.Private class FSVolume. The method signature is updated by HDFS-2887.

          Show
          Tsz Wo Nicholas Sze added a comment - BlockVolumeChoosingPolicy is @InterfaceAudience.Public but declares the chooseVolume(..) method with a @InterfaceAudience.Private class FSVolume. The method signature is updated by HDFS-2887 .
          Tsz Wo Nicholas Sze made changes -
          Link This issue is related to HDFS-2887 [ HDFS-2887 ]
          Harsh J made changes -
          Link This issue incorporates HDFS-1797 [ HDFS-1797 ]
          Harsh J made changes -
          Link This issue is related to HDFS-1804 [ HDFS-1804 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #582 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/582/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #582 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/582/ )
          Hide
          Todd Lipcon added a comment -

          Hey Nicholas, sorry I missed that. I've just filed HDFS-1797 to fix.

          Show
          Todd Lipcon added a comment - Hey Nicholas, sorry I missed that. I've just filed HDFS-1797 to fix.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          Hey, there is a findbugs warning.

          Show
          Tsz Wo Nicholas Sze added a comment - > -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. Hey, there is a findbugs warning.
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Todd Lipcon added a comment -

          Committed to trunk, thanks Harsh!

          Show
          Todd Lipcon added a comment - Committed to trunk, thanks Harsh!
          Hide
          dhruba borthakur added a comment -

          +1

          Show
          dhruba borthakur added a comment - +1
          Hide
          Todd Lipcon added a comment -

          +1, looks good to me. Since there are so many watchers, I'll wait til the end of the day to commit this in case there are any other comments.

          Show
          Todd Lipcon added a comment - +1, looks good to me. Since there are so many watchers, I'll wait til the end of the day to commit this in case there are any other comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12474789/HDFS-1120.r4.diff
          against trunk revision 1085509.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/295//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/295//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/295//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474789/HDFS-1120.r4.diff against trunk revision 1085509. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/295//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/295//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/295//console This message is automatically generated.
          Harsh J made changes -
          Attachment HDFS-1120.r4.diff [ 12474789 ]
          Hide
          Harsh J added a comment -

          Patch that addresses @Todd's last comments.

          Show
          Harsh J added a comment - Patch that addresses @Todd's last comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12474716/HDFS-1120.r3.diff
          against trunk revision 1085509.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/292//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/292//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/292//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474716/HDFS-1120.r3.diff against trunk revision 1085509. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/292//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/292//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/292//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          A few small nits:

          • Can you use the Configuration.getClass() which also takes an interface argument? That way if someone specifies an incorrect class they'll get a nicer error, plus you don't have to typecast.
          • Also for that call, you should pass RoundRobinVolumesPolicy.class as the default, rather than null
          • Very minor thing: can you rename TestRRVolumeChoicePolicy to TestRoundRobinVolumesPolicy? Where possible I think it's better to make the tests for Foo.java be TestFoo.java (I have a handy emacs macro which switches back and forth between the two!)
          Show
          Todd Lipcon added a comment - A few small nits: Can you use the Configuration.getClass() which also takes an interface argument? That way if someone specifies an incorrect class they'll get a nicer error, plus you don't have to typecast. Also for that call, you should pass RoundRobinVolumesPolicy.class as the default, rather than null Very minor thing: can you rename TestRRVolumeChoicePolicy to TestRoundRobinVolumesPolicy? Where possible I think it's better to make the tests for Foo.java be TestFoo.java (I have a handy emacs macro which switches back and forth between the two!)
          Harsh J made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Tags datanode, volumes, policy
          Harsh J made changes -
          Attachment HDFS-1120.r3.diff [ 12474716 ]
          Hide
          Harsh J added a comment -

          @Todd - Great comments, thank you!

          I had initially wanted to construct FSVolumes myself, but I was not aware of Mockito's existence. It is a wonderful tool!

          Attached a new patch with (hopefully) all your comments addressed.

          Show
          Harsh J added a comment - @Todd - Great comments, thank you! I had initially wanted to construct FSVolumes myself, but I was not aware of Mockito's existence. It is a wonderful tool! Attached a new patch with (hopefully) all your comments addressed.
          Harsh J made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Todd Lipcon added a comment -

          Mostly looks good - just a few comments:

          • when constructing the policy class, you can use conf.getClass(...). That will also have the behavior that if a non-existent class is specified, it will throw an exception rather than continuing on, which I think is a good thing.
          • For the unit test, can you make mock FSVolumes with mockito rather than starting a minicluster? Should be pretty simple and the more we can avoid unnecessary miniclusters the faster our tests will run.
          • RoundRobinVolumesPolicy isn't threadsafe since you didn't make chooseVolume synchronized. This isn't really a problem since the caller is itself synchronized, but I think it would make more sense from an API perspective to say that policies that keep state should be threadsafe, and make chooseVolume synchronized.
          • I don't think BlockVolumeChoosingPolicy should extend Configurable - this forces you to add setConf and getConf even though you don't use them. Since we reflectively check for the Configurable interface in ReflectionUtils.setInstance, it will still work OK for other implementers who may want configuration.
          Show
          Todd Lipcon added a comment - Mostly looks good - just a few comments: when constructing the policy class, you can use conf.getClass(...). That will also have the behavior that if a non-existent class is specified, it will throw an exception rather than continuing on, which I think is a good thing. For the unit test, can you make mock FSVolumes with mockito rather than starting a minicluster? Should be pretty simple and the more we can avoid unnecessary miniclusters the faster our tests will run. RoundRobinVolumesPolicy isn't threadsafe since you didn't make chooseVolume synchronized. This isn't really a problem since the caller is itself synchronized, but I think it would make more sense from an API perspective to say that policies that keep state should be threadsafe, and make chooseVolume synchronized. I don't think BlockVolumeChoosingPolicy should extend Configurable - this forces you to add setConf and getConf even though you don't use them. Since we reflectively check for the Configurable interface in ReflectionUtils.setInstance, it will still work OK for other implementers who may want configuration.
          Harsh J made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Make the DataNode's block-volume choosing policy pluggable.
          Fix Version/s 0.23.0 [ 12315571 ]
          Harsh J made changes -
          Attachment HDFS-1120.r2.diff [ 12474692 ]
          Hide
          Harsh J added a comment -

          Forgot the hdfs-default.xml changes in the previous patch. Here's a new one with hdfs-default additions alone (over the previous).

          Show
          Harsh J added a comment - Forgot the hdfs-default.xml changes in the previous patch. Here's a new one with hdfs-default additions alone (over the previous).
          Harsh J made changes -
          Attachment HDFS-1120.r1.diff [ 12474691 ]
          Hide
          Harsh J added a comment -

          Attached a patch that refactors the round-robin default choosing mechanism to something more pluggable via a defined public interface and a datanode configurable option.

          Show
          Harsh J added a comment - Attached a patch that refactors the round-robin default choosing mechanism to something more pluggable via a defined public interface and a datanode configurable option.
          Hide
          Harsh J added a comment -

          @Todd - Marked

          @Dhruba - Yes that's the idea! I'll post a patch soon.

          Show
          Harsh J added a comment - @Todd - Marked @Dhruba - Yes that's the idea! I'll post a patch soon.
          Hide
          dhruba borthakur added a comment -

          Looks like a reasonable start. can we make the datanode to use a config that specifies the classname that will implement the above interface?

          Show
          dhruba borthakur added a comment - Looks like a reasonable start. can we make the datanode to use a config that specifies the classname that will implement the above interface?
          Hide
          Todd Lipcon added a comment -

          This seems like a reasonable first implementation. Let's be sure to mark the interface as Evolving so that if we need to change the interface later we won't have to go through any deprecation dance.

          Show
          Todd Lipcon added a comment - This seems like a reasonable first implementation. Let's be sure to mark the interface as Evolving so that if we need to change the interface later we won't have to go through any deprecation dance.
          Hide
          Harsh J added a comment -

          I've done some initial work for making the volume-block choosing policy pluggable (so that methods other than round-robin may be provided).

          The following is my initial interface design, and am looking for comments/critique/etc. w.r.t. this ticket's scope, before I start pushing out some tests + patches:

          /**************************************************
           * BlockVolumeChoosingPolicy allows a DataNode to
           * specify what policy is to be used while choosing
           * a volume for a block request.
           *
           ***************************************************/
          public interface BlockVolumeChoosingPolicy extends Configurable {
          
           /**
            * Returns a specific FSVolume after applying a suitable choice algorithm
            * to place a given block, given a list of FSVolumes and the block
            * size sought for storage.
            * @param volumes - the array of FSVolumes that are available.
            * @param blockSize - the size of the block for which a volume is sought.
            * @return the chosen volume to store the block.
            * @throws IOException when disks are unavailable or are full.
            */
           public FSVolume chooseVolume(FSVolume[] volumes, long blockSize)
             throws IOException;
          
          }
          

          This can be neatly used within FSVolumeSet.getNextVolume() [Maybe this too needs to be renamed, since it may not make sense as 'next' once it becomes pluggable]

          Looking forward to a discussion.

          P.s. I also posted this to hdfs-dev, but I guess most of Hadoop's devel discussions occur on the JIRA itself?

          Show
          Harsh J added a comment - I've done some initial work for making the volume-block choosing policy pluggable (so that methods other than round-robin may be provided). The following is my initial interface design, and am looking for comments/critique/etc. w.r.t. this ticket's scope, before I start pushing out some tests + patches: /************************************************** * BlockVolumeChoosingPolicy allows a DataNode to * specify what policy is to be used while choosing * a volume for a block request. * ***************************************************/ public interface BlockVolumeChoosingPolicy extends Configurable { /** * Returns a specific FSVolume after applying a suitable choice algorithm * to place a given block, given a list of FSVolumes and the block * size sought for storage. * @param volumes - the array of FSVolumes that are available. * @param blockSize - the size of the block for which a volume is sought. * @ return the chosen volume to store the block. * @ throws IOException when disks are unavailable or are full. */ public FSVolume chooseVolume(FSVolume[] volumes, long blockSize) throws IOException; } This can be neatly used within FSVolumeSet.getNextVolume() [Maybe this too needs to be renamed, since it may not make sense as 'next' once it becomes pluggable] Looking forward to a discussion. P.s. I also posted this to hdfs-dev, but I guess most of Hadoop's devel discussions occur on the JIRA itself?
          Harsh J made changes -
          Assignee Harsh J Chouraria [ qwertymaniac ]
          Hide
          Steve Loughran added a comment -

          @Ambikeshwar -you make some good points. I think we should

          1. start with some monitoring of what's going on, especially on 6-12 HDD servers
          2. do some plugin point where people can play with block placement
          3. Let people with time on their hands do the good plugins
            For the complex decisions, you probably do want some history -filename, created, last read, owner.

          I think your point about rebalancing is a good one, it may be better to schedule some rebalancing work on a DN that can spread stuff across the disks than try and do some smarter block placement decisions. If you can rebalance the disks then you can just use roundrobin or roulette placement. This handles the new-HDD problem better, and can allow for the block placer to be simple and fast, rebalancing can be done on selective nodes at a time of choosing.

          Show
          Steve Loughran added a comment - @Ambikeshwar -you make some good points. I think we should start with some monitoring of what's going on, especially on 6-12 HDD servers do some plugin point where people can play with block placement Let people with time on their hands do the good plugins For the complex decisions, you probably do want some history -filename, created, last read, owner. I think your point about rebalancing is a good one, it may be better to schedule some rebalancing work on a DN that can spread stuff across the disks than try and do some smarter block placement decisions. If you can rebalance the disks then you can just use roundrobin or roulette placement. This handles the new-HDD problem better, and can allow for the block placer to be simple and fast, rebalancing can be done on selective nodes at a time of choosing.
          Hide
          Ambikeshwar Raj Merchia added a comment -

          What ever policy is developed, the following should taken into consideration:

          The expected life time of the blocks.
          My experience oldest data (from creation) is the first to be removed.
          Usage of the blocks.
          My experience newest data (from creation) is the most like to be read again.

          Disk Scrub reads and validates the data. To do the balancing reading and writing are both necessary. Therefore, should the disk scrub be overloaded to do the balancing and reduce 50% of the work of balancing?

          Show
          Ambikeshwar Raj Merchia added a comment - What ever policy is developed, the following should taken into consideration: The expected life time of the blocks. My experience oldest data (from creation) is the first to be removed. Usage of the blocks. My experience newest data (from creation) is the most like to be read again. Disk Scrub reads and validates the data. To do the balancing reading and writing are both necessary. Therefore, should the disk scrub be overloaded to do the balancing and reduce 50% of the work of balancing?
          Hide
          Travis Crawford added a comment -

          Moving this comment from my duplicate.

          Filing this issue in response to ``full disk woes`` on hdfs-user.

          Datanodes fill their storage directories unevenly, leading to situations where certain disks are full while others are significantly less used. Users at many different sites have experienced this issue, and HDFS administrators are taking steps like:

          • Manually rebalancing blocks in storage directories
          • Decomissioning nodes & later readding them

          There's a tradeoff between making use of all available spindles, and filling disks at the sameish rate. Possible solutions include:

          • Weighting less-used disks heavier when placing new blocks on the datanode. In write-heavy environments this will still make use of all spindles, equalizing disk use over time.
          • Rebalancing blocks locally. This would help equalize disk use as disks are added/replaced in older cluster nodes.

          Datanodes should actively manage their local disk so operator intervention is not needed.

          Show
          Travis Crawford added a comment - Moving this comment from my duplicate. Filing this issue in response to ``full disk woes`` on hdfs-user. Datanodes fill their storage directories unevenly, leading to situations where certain disks are full while others are significantly less used. Users at many different sites have experienced this issue, and HDFS administrators are taking steps like: Manually rebalancing blocks in storage directories Decomissioning nodes & later readding them There's a tradeoff between making use of all available spindles, and filling disks at the sameish rate. Possible solutions include: Weighting less-used disks heavier when placing new blocks on the datanode. In write-heavy environments this will still make use of all spindles, equalizing disk use over time. Rebalancing blocks locally. This would help equalize disk use as disks are added/replaced in older cluster nodes. Datanodes should actively manage their local disk so operator intervention is not needed.
          Allen Wittenauer made changes -
          Link This issue relates to HDFS-1312 [ HDFS-1312 ]
          Jeff Hammerbacher made changes -
          Field Original Value New Value
          Link This issue is related to HDFS-1121 [ HDFS-1121 ]
          Hide
          Steve Loughran added a comment -

          I think the probability gets larger the more disks/server, and now that 12HDD units are coming out, you can plan to see it some time after you spec out your next datacentre.

          Causes

          1. deletion of large block size files can leave a disk unbalanced.
          2. MR temp space in the same disks can fill up then free disks
          3. Replacement of failed HDDs leaves that disk permanently underutilised.

          the third one is new; on a 12 disk server, with most of all 12 disks allocated to HDFS, one block in 12 would go to any specific disk. If one disk is replaced, it still only gets 1/12 of the blocks, even though if all the other disks were 70-80% full, its the disk with the most space. The disks would only be balanced if the new disk got more of the writes (which could have adverse consequences for future IO rates), or some rebalancing on a single machine moves data from one disk to another (or to be precise, copies, validates the block checksums, then deletes).

          I actually think HDFS-1121 should come first: provide a way of measuring the distribution on disks on a single server. Once we have the data we can start worrying about ways to correct any distribution issues.

          Show
          Steve Loughran added a comment - I think the probability gets larger the more disks/server, and now that 12HDD units are coming out, you can plan to see it some time after you spec out your next datacentre. Causes deletion of large block size files can leave a disk unbalanced. MR temp space in the same disks can fill up then free disks Replacement of failed HDDs leaves that disk permanently underutilised. the third one is new; on a 12 disk server, with most of all 12 disks allocated to HDFS, one block in 12 would go to any specific disk. If one disk is replaced, it still only gets 1/12 of the blocks, even though if all the other disks were 70-80% full, its the disk with the most space. The disks would only be balanced if the new disk got more of the writes (which could have adverse consequences for future IO rates), or some rebalancing on a single machine moves data from one disk to another (or to be precise, copies, validates the block checksums, then deletes). I actually think HDFS-1121 should come first: provide a way of measuring the distribution on disks on a single server. Once we have the data we can start worrying about ways to correct any distribution issues.
          Hide
          dhruba borthakur added a comment -

          Agreed, it is a possible scenario. I have not looked deeply enough to figure out if this scenario occurs in our cluster.

          Show
          dhruba borthakur added a comment - Agreed, it is a possible scenario. I have not looked deeply enough to figure out if this scenario occurs in our cluster.
          Hide
          Eli Collins added a comment -

          This is the result of discussion on in the "dfs.data.dir" thread on general. The high-level bit is that disk utilization gets imbalanced so you get effectively fewer spindles for writes since some disks will will up much faster than others. Do you see this in your clusters?

          Show
          Eli Collins added a comment - This is the result of discussion on in the "dfs.data.dir" thread on general. The high-level bit is that disk utilization gets imbalanced so you get effectively fewer spindles for writes since some disks will will up much faster than others. Do you see this in your clusters?
          Hide
          dhruba borthakur added a comment -

          do you have a use-case in mind when such a policy if beneficial and why the current policy is inadequate?

          is a policy that tries to keep all disks at the same usage capacity better then the round-robin one? especially when disks can die, get repaired and then put back into the same system?

          Show
          dhruba borthakur added a comment - do you have a use-case in mind when such a policy if beneficial and why the current policy is inadequate? is a policy that tries to keep all disks at the same usage capacity better then the round-robin one? especially when disks can die, get repaired and then put back into the same system?
          Jeff Hammerbacher created issue -

            People

            • Assignee:
              Harsh J
              Reporter:
              Jeff Hammerbacher
            • Votes:
              4 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development