Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      Filing this issue in response to ``full disk woes`` on hdfs-user.

      Datanodes fill their storage directories unevenly, leading to situations where certain disks are full while others are significantly less used. Users at many different sites have experienced this issue, and HDFS administrators are taking steps like:

      • Manually rebalancing blocks in storage directories
      • Decomissioning nodes & later readding them

      There's a tradeoff between making use of all available spindles, and filling disks at the sameish rate. Possible solutions include:

      • Weighting less-used disks heavier when placing new blocks on the datanode. In write-heavy environments this will still make use of all spindles, equalizing disk use over time.
      • Rebalancing blocks locally. This would help equalize disk use as disks are added/replaced in older cluster nodes.

      Datanodes should actively manage their local disk so operator intervention is not needed.

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          rebalance blocks locally by each individual datanode sounds like a good idea. we have seen this problem in our cluster too.

          Show
          dhruba borthakur added a comment - rebalance blocks locally by each individual datanode sounds like a good idea. we have seen this problem in our cluster too.
          Hide
          Allen Wittenauer added a comment -

          Aha! Found it.

          One of these should get closed as a dupe of the other tho.

          Show
          Allen Wittenauer added a comment - Aha! Found it. One of these should get closed as a dupe of the other tho.
          Hide
          Scott Carey added a comment -

          a roulette-like algorithm for choosing which disk to write a block to should be relatively easy and keep the disks generally balanced except when major changes such as adding a new disk occur.

          I think the safest default option is to make the relative odds of choosing a disk equal to the free space available. The disk placement code already has access to that information and checks it to see if the disk is full. Changing it from round-robin to weighted is straightforward unless this needs to be plugable.
          A larger disk will get more writes, but there is no avoiding that if you want them to be balanced. Balancing after placement causes more overall I/O then placing it right the first time.

          Rebalancing will always be needed from time for other reasons however.

          Show
          Scott Carey added a comment - a roulette-like algorithm for choosing which disk to write a block to should be relatively easy and keep the disks generally balanced except when major changes such as adding a new disk occur. I think the safest default option is to make the relative odds of choosing a disk equal to the free space available. The disk placement code already has access to that information and checks it to see if the disk is full. Changing it from round-robin to weighted is straightforward unless this needs to be plugable. A larger disk will get more writes, but there is no avoiding that if you want them to be balanced. Balancing after placement causes more overall I/O then placing it right the first time. Rebalancing will always be needed from time for other reasons however.
          Hide
          Travis Crawford added a comment -

          HDFS-1120 looks like a better place to track this issue. Marking this as a duplicate.

          Show
          Travis Crawford added a comment - HDFS-1120 looks like a better place to track this issue. Marking this as a duplicate.
          Hide
          Steve Loughran added a comment -

          HDFS-1120 is part of the solution, but some rebalancing util is needed to. Maybe this issue should be re-opened as the underlying problem, HDFS-1120 and HDFS-1121 as parts of the fix, along with "add a way to rebalance HDDs on a single DN"

          Show
          Steve Loughran added a comment - HDFS-1120 is part of the solution, but some rebalancing util is needed to. Maybe this issue should be re-opened as the underlying problem, HDFS-1120 and HDFS-1121 as parts of the fix, along with "add a way to rebalance HDDs on a single DN"
          Hide
          Eli Collins added a comment -

          Agree, re-opening, let's have this issue track re-balancing disks within a DN.

          Show
          Eli Collins added a comment - Agree, re-opening, let's have this issue track re-balancing disks within a DN.
          Hide
          Wang Xu added a comment -

          I think re-balance is just the next step of substitute a failed disk as in HDFS-1362. And since I've just finished the patch of HDFS-1362, I would like to implement this feature if no one have done it.

          Show
          Wang Xu added a comment - I think re-balance is just the next step of substitute a failed disk as in HDFS-1362 . And since I've just finished the patch of HDFS-1362 , I would like to implement this feature if no one have done it.
          Hide
          Steve Loughran added a comment -

          HDFS-1362 and this issue are part of the HDFS-664 problem "support efficient hotswap".
          Before worrying about this one, consider HDFS-1121, which is provide a way to monitor the distribution (i.e. web view). That web/management view would be how we'd test the rebalancing works, so its a pre-req. Also it's best to keep the issues independent (where possible), so worry about getting HDFS-1362 in first before trying to extend it.

          That said, because the #of HDDs/server is growing to 12 or more 2TB/unit, with 3TB on the horizon, we will need this feature in the 0.23-0.24 timeframe.

          Show
          Steve Loughran added a comment - HDFS-1362 and this issue are part of the HDFS-664 problem "support efficient hotswap". Before worrying about this one, consider HDFS-1121 , which is provide a way to monitor the distribution (i.e. web view). That web/management view would be how we'd test the rebalancing works, so its a pre-req. Also it's best to keep the issues independent (where possible), so worry about getting HDFS-1362 in first before trying to extend it. That said, because the #of HDDs/server is growing to 12 or more 2TB/unit, with 3TB on the horizon, we will need this feature in the 0.23-0.24 timeframe.
          Hide
          Wang Xu added a comment -

          Hi Steve,

          I have not understood the nessesary of HDFS-1121. IMHO, you can monitor the file distribution among disks with external tools such as ganglia, is it required to integrate it in the Web interface of HDFS?

          I think the regular routine is finding the problem in cluster management system and then trigger the rebalance action in HDFS.

          Show
          Wang Xu added a comment - Hi Steve, I have not understood the nessesary of HDFS-1121 . IMHO, you can monitor the file distribution among disks with external tools such as ganglia, is it required to integrate it in the Web interface of HDFS? I think the regular routine is finding the problem in cluster management system and then trigger the rebalance action in HDFS.
          Hide
          Wang Xu added a comment -

          and @Steve, I did not intend to extend HDFS-1362 to cover this issue, and I only mean they are related.

          Show
          Wang Xu added a comment - and @Steve, I did not intend to extend HDFS-1362 to cover this issue, and I only mean they are related.
          Hide
          Steve Loughran added a comment -

          I think having a remote web view is useful in two ways
          -lets people see the basics of what is going in within the entire cluster (yes, that will need some aggregation eventually)
          -lets you write tests that hit the status pages and so verify that the rebalancing worked.

          Show
          Steve Loughran added a comment - I think having a remote web view is useful in two ways -lets people see the basics of what is going in within the entire cluster (yes, that will need some aggregation eventually) -lets you write tests that hit the status pages and so verify that the rebalancing worked.
          Hide
          Wang Xu added a comment -

          Hi Steve,

          From my point, there are mainly two cases of re-balance:

          1. The admin replace a disk, and then trigger the re-balance at once or in specific time.
          2. The monitoring system observed disk space distribution issue, and rise a warning; and then the admin trigger the re-balance at once or in specific time.

          Do you mean we should integrate a monitoring system inside HDFS?

          Show
          Wang Xu added a comment - Hi Steve, From my point, there are mainly two cases of re-balance: The admin replace a disk, and then trigger the re-balance at once or in specific time. The monitoring system observed disk space distribution issue, and rise a warning; and then the admin trigger the re-balance at once or in specific time. Do you mean we should integrate a monitoring system inside HDFS?
          Hide
          Wang Xu added a comment -

          Hi folks,

          Here is the basic design of the process. Is there any other consideration?

          The basic flow is:

          1. Re-balance should be only process while it is not in heavy load (should this be guaranteed by the administrator?)
          2. Calculate the total and average available & used space of dirs.
          3. Find the disks have most and least space, and decide move direction. We need define a unbalance threshold here to decide whether it is worthy to re-balance.
          4. Lock origin disks: stop written to them and wait finalization on them.
          5. Find the deepest dirs in every selected disk and move blocks from those dirs. And if a dir is empty, then the dir should also be removed.
          6. Check the balance status while the blocks are migrated, and break from the loop if it reaches a threshold.
          7. Release the lock.

          The case should be take into account:

          • If a disk have much less space than other disks, it might have least available space, but could not migrate blocks out.
          • If two or more dirs are located in a same disk, they might confuse the space calculation. And this is just the case in MiniDFSCluster deployment.
          Show
          Wang Xu added a comment - Hi folks, Here is the basic design of the process. Is there any other consideration? The basic flow is: Re-balance should be only process while it is not in heavy load (should this be guaranteed by the administrator?) Calculate the total and average available & used space of dirs. Find the disks have most and least space, and decide move direction. We need define a unbalance threshold here to decide whether it is worthy to re-balance. Lock origin disks: stop written to them and wait finalization on them. Find the deepest dirs in every selected disk and move blocks from those dirs. And if a dir is empty, then the dir should also be removed. Check the balance status while the blocks are migrated, and break from the loop if it reaches a threshold. Release the lock. The case should be take into account: If a disk have much less space than other disks, it might have least available space, but could not migrate blocks out. If two or more dirs are located in a same disk, they might confuse the space calculation. And this is just the case in MiniDFSCluster deployment.
          Hide
          Allen Wittenauer added a comment -

          > IMHO, you can monitor the file distribution among disks with external tools
          > such as ganglia, is it required to integrate it in the Web interface of HDFS?

          Yes. First off, Ganglia sucks at large scales for too many reasons to go into here. Secondly, only Hadoop really knows what file systems are in play for HDFS.

          > From my point, there are mainly two cases of re-balance:

          There's a third, and one I suspect is more common than people realize: mass deletes.

          > Re-balance should be only process while it is not in heavy load
          > (should this be guaranteed by the administrator?)

          and

          > Lock origin disks: stop written to them and wait finalization on them.

          I don't think it is realistic to expect the system to be idle during a rebalance. Ultimately, it shouldn't matter from the rebalancer's perspective anyway; the only performance hit that should be noticeable would be for blocks in the middle of being moved. Even then, the DN process knows what blocks are where and can read from the 'old' location.

          If DN's being idle are a requirement, then one is better off just shutting down the DN (and TT?) processes and doing them offline.

          > Find the deepest dirs in every selected disk and move blocks from those dirs.
          > And if a dir is empty, then the dir should also be removed.

          Why does depth matter?

          > If two or more dirs are located in a same disk, they might confuse the
          > space calculation. And this is just the case in MiniDFSCluster deployment.

          This is also the case with pooled storage systems (such as ZFS) on real clusters already.

          Show
          Allen Wittenauer added a comment - > IMHO, you can monitor the file distribution among disks with external tools > such as ganglia, is it required to integrate it in the Web interface of HDFS? Yes. First off, Ganglia sucks at large scales for too many reasons to go into here. Secondly, only Hadoop really knows what file systems are in play for HDFS. > From my point, there are mainly two cases of re-balance: There's a third, and one I suspect is more common than people realize: mass deletes. > Re-balance should be only process while it is not in heavy load > (should this be guaranteed by the administrator?) and > Lock origin disks: stop written to them and wait finalization on them. I don't think it is realistic to expect the system to be idle during a rebalance. Ultimately, it shouldn't matter from the rebalancer's perspective anyway; the only performance hit that should be noticeable would be for blocks in the middle of being moved. Even then, the DN process knows what blocks are where and can read from the 'old' location. If DN's being idle are a requirement, then one is better off just shutting down the DN (and TT?) processes and doing them offline. > Find the deepest dirs in every selected disk and move blocks from those dirs. > And if a dir is empty, then the dir should also be removed. Why does depth matter? > If two or more dirs are located in a same disk, they might confuse the > space calculation. And this is just the case in MiniDFSCluster deployment. This is also the case with pooled storage systems (such as ZFS) on real clusters already.
          Hide
          Wang Xu added a comment -

          Hi Allen,

          For the cluster monitoring issue, we could not expect an integrated monitoring substitute the external monitoring system. Thus, we need to make clear what information gathering requirements should be considered inside hdfs.

          And the "lock" I mentions is to stop write new blocks into the volume, which is the simplest way to migrate blocks. And I do think the depth is matter, from my understanding, the deeper the dir is, the more io will be wasted to reach the block.

          Show
          Wang Xu added a comment - Hi Allen, For the cluster monitoring issue, we could not expect an integrated monitoring substitute the external monitoring system. Thus, we need to make clear what information gathering requirements should be considered inside hdfs. And the "lock" I mentions is to stop write new blocks into the volume, which is the simplest way to migrate blocks. And I do think the depth is matter, from my understanding, the deeper the dir is, the more io will be wasted to reach the block.
          Hide
          Allen Wittenauer added a comment -

          >For the cluster monitoring issue, we could not expect an integrated monitoring
          > substitute the external monitoring system. Thus, we need to make clear
          > what information gathering requirements should be considered inside hdfs.

          There is nothing preventing us from building a jsp that runs on the datanode that shows the file systems in use and the relevant stats for those fs's.

          >And the "lock" I mentions is to stop write new blocks into the volume,
          > which is the simplest way to migrate blocks.

          This is going to be a big performance hit. I think the focus should be on already written/closed blocks and just ignore new blocks.

          Show
          Allen Wittenauer added a comment - >For the cluster monitoring issue, we could not expect an integrated monitoring > substitute the external monitoring system. Thus, we need to make clear > what information gathering requirements should be considered inside hdfs. There is nothing preventing us from building a jsp that runs on the datanode that shows the file systems in use and the relevant stats for those fs's. >And the "lock" I mentions is to stop write new blocks into the volume, > which is the simplest way to migrate blocks. This is going to be a big performance hit. I think the focus should be on already written/closed blocks and just ignore new blocks.
          Hide
          Wang Xu added a comment -

          Hi Allen,

          For the jsp that showing local filesystem, I will comments on HDFS-1121 for detail requirements

          And to simplify the design, I agree that we could move finalized blocks only and do not touch existed dirs.

          Show
          Wang Xu added a comment - Hi Allen, For the jsp that showing local filesystem, I will comments on HDFS-1121 for detail requirements And to simplify the design, I agree that we could move finalized blocks only and do not touch existed dirs.
          Hide
          Sanjay Radia added a comment -

          >> From my point, there are mainly two cases of re-balance:
          >There's a third, and one I suspect is more common than people realize: mass deletes.
          Since the creates are distributed, there is high probability that deletes will be distributed.

          Show
          Sanjay Radia added a comment - >> From my point, there are mainly two cases of re-balance: >There's a third, and one I suspect is more common than people realize: mass deletes. Since the creates are distributed, there is high probability that deletes will be distributed.
          Hide
          Steve Loughran added a comment -

          Another driver for rebalancing is a cluster set up with mapred temp space allocated to the same partitions as HDFS. This delivers best IO rate for temp storage, but once that temp space is reclaimed, the disks can get unbalanced.

          Show
          Steve Loughran added a comment - Another driver for rebalancing is a cluster set up with mapred temp space allocated to the same partitions as HDFS. This delivers best IO rate for temp storage, but once that temp space is reclaimed, the disks can get unbalanced.
          Hide
          Steve Hoffman added a comment -

          Wow, I can't believe this is still lingering out there as a new feature request. I'd argue this is a bug – and a big one. Here's why:

          • You have Nx 12x3TB machines in your cluster.
          • 1 disk fails on 12 drive machine. Let's say they each were 80% full.
          • You install the replacement drive (0% full), but by the time you do this the under-replicated blocks have been fixed (on this and other nodes)
          • The 0% full drive will fill at the same rate as the blocks on the other disks. That machine's other 11 disks will fill to 100% as the block placement is at a node level and the node seems to use a round-robin algorithm even though there is more space.

          The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it.

          Hard drives fail. This isn't news to anybody. The larger (12 disk) nodes only make the problem worse in time to empty and fill again. Even if you had a 1U 4 disk machine it is still bad 'cause you lose 25% of your capacity on 1 disk failure where the impact of the 12 disk machine is less than 9%.

          The remove/add of a complete node seems like a pretty poor option.
          Or am I alone in this? Can we please revive this JIRA?

          Show
          Steve Hoffman added a comment - Wow, I can't believe this is still lingering out there as a new feature request. I'd argue this is a bug – and a big one. Here's why: You have Nx 12x3TB machines in your cluster. 1 disk fails on 12 drive machine. Let's say they each were 80% full. You install the replacement drive (0% full), but by the time you do this the under-replicated blocks have been fixed (on this and other nodes) The 0% full drive will fill at the same rate as the blocks on the other disks. That machine's other 11 disks will fill to 100% as the block placement is at a node level and the node seems to use a round-robin algorithm even though there is more space. The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it. Hard drives fail. This isn't news to anybody. The larger (12 disk) nodes only make the problem worse in time to empty and fill again. Even if you had a 1U 4 disk machine it is still bad 'cause you lose 25% of your capacity on 1 disk failure where the impact of the 12 disk machine is less than 9%. The remove/add of a complete node seems like a pretty poor option. Or am I alone in this? Can we please revive this JIRA?
          Hide
          Allen Wittenauer added a comment -

          The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it.

          That's the slow way. You can also take down the data node, move the blocks around on the disk, restart the data node.

          The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway.

          Show
          Allen Wittenauer added a comment - The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it. That's the slow way. You can also take down the data node, move the blocks around on the disk, restart the data node. The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway.
          Hide
          Andrew Purtell added a comment -

          It might be worth adding to the manual a note that after adding or replacing drives on a DataNode, when it's temporarily offline anyway, that blocks and their associated metadata file both can be moved to any defined data directory for local rebalancing?

          Show
          Andrew Purtell added a comment - It might be worth adding to the manual a note that after adding or replacing drives on a DataNode, when it's temporarily offline anyway, that blocks and their associated metadata file both can be moved to any defined data directory for local rebalancing?
          Hide
          Steve Hoffman added a comment -

          The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway.

          Our installation is about 1PB so I think we can say we are past "small". We typically run at 70-80% full as we are not made of money. And at 90% the disk alarms start waking people out of bed.
          I would say we very much care about the balance of a single node. When that node fills, it'll take out the region server, the M/R jobs running on it and generally anger people who's jobs have to be restarted.

          I wouldn't be so quick to discount this. And when you have enough machines, you are replacing disks more and more frequently. So ANY manual process is $ wasted in people time. Time to re-run jobs, times to take down datanode and move blocks. Time = $. To turn Hadoop into a more mature product, shouldn't we be striving for "it just works"?

          Show
          Steve Hoffman added a comment - The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway. Our installation is about 1PB so I think we can say we are past "small". We typically run at 70-80% full as we are not made of money. And at 90% the disk alarms start waking people out of bed. I would say we very much care about the balance of a single node. When that node fills, it'll take out the region server, the M/R jobs running on it and generally anger people who's jobs have to be restarted. I wouldn't be so quick to discount this. And when you have enough machines, you are replacing disks more and more frequently. So ANY manual process is $ wasted in people time. Time to re-run jobs, times to take down datanode and move blocks. Time = $. To turn Hadoop into a more mature product, shouldn't we be striving for "it just works"?
          Hide
          Allen Wittenauer added a comment -

          We typically run at 70-80% full as we are not made of money.

          No, you aren't made of money, but you are under-provisioned.

          Show
          Allen Wittenauer added a comment - We typically run at 70-80% full as we are not made of money. No, you aren't made of money, but you are under-provisioned.
          Hide
          Allen Wittenauer added a comment -

          Since someone (off-jira) asked:

          • Yes, I think this should be fixed.
          • No, I don't think this is as big of an issue as most people think.
          • At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...)
          • As you scale, you care less about the health of individual nodes and more about total framework health.
          • 1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes.
          Show
          Allen Wittenauer added a comment - Since someone (off-jira) asked: Yes, I think this should be fixed. No, I don't think this is as big of an issue as most people think. At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...) As you scale, you care less about the health of individual nodes and more about total framework health. 1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes.
          Hide
          Steve Hoffman added a comment -

          Yes, I think this should be fixed.

          This was my original question really. Since it hasn't made the cut in over 2 years, I was wondering what it would take to either do something with this or should it be closed it as a "won't fix" with script/documentation support for the admins?

          No, I don't think this is as big of an issue as most people think.

          Basically, I agree with you. There are worse things that can go wrong.

          At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...)

          Agreed. More getting installed Friday. Just don't want bad timing/luck to be a factor here – and we do clean up after the MR.

          As you scale, you care less about the health of individual nodes and more about total framework health.

          Sorry, have to disagree here. The total framework is made up of the parts. While I agree there is enough redundancy built in to handle most cases once your node count gets above a certain level, you are basically saying it doesn't have to work well in all cases because more $ can be thrown at it.

          1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes.

          Our cluster is storage dense yes, so a loss of 1 node is noticeable.

          Show
          Steve Hoffman added a comment - Yes, I think this should be fixed. This was my original question really. Since it hasn't made the cut in over 2 years, I was wondering what it would take to either do something with this or should it be closed it as a "won't fix" with script/documentation support for the admins? No, I don't think this is as big of an issue as most people think. Basically, I agree with you. There are worse things that can go wrong. At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...) Agreed. More getting installed Friday. Just don't want bad timing/luck to be a factor here – and we do clean up after the MR. As you scale, you care less about the health of individual nodes and more about total framework health. Sorry, have to disagree here. The total framework is made up of the parts. While I agree there is enough redundancy built in to handle most cases once your node count gets above a certain level, you are basically saying it doesn't have to work well in all cases because more $ can be thrown at it. 1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes. Our cluster is storage dense yes, so a loss of 1 node is noticeable.
          Hide
          Steve Loughran added a comment -

          I don't think it's a wontfix, just that nobody has sat down to fix it. With the trend towards very storage intense servers (12-16x 3TBs), this problem has grown -and it does need fixing

          What does it take?

          1. tests
          2. solution
          3. patch submit and review.

          It may be possible to implement the rebalance operation as some script (Please, not bash, something better like Python); ops could run this script after installing a new HDD.

          A way to assess HDD imbalance in a DN or across the cluster would be good too; that could pick up problems which manual/automated intervention could fix.

          BTW,, not everyone splits MR and DFS onto separate disk partitions, as that remove flexibility and the ability to handle large-intermediate-output MR jobs.

          Show
          Steve Loughran added a comment - I don't think it's a wontfix, just that nobody has sat down to fix it. With the trend towards very storage intense servers (12-16x 3TBs), this problem has grown -and it does need fixing What does it take? tests solution patch submit and review. It may be possible to implement the rebalance operation as some script (Please, not bash, something better like Python); ops could run this script after installing a new HDD. A way to assess HDD imbalance in a DN or across the cluster would be good too; that could pick up problems which manual/automated intervention could fix. BTW,, not everyone splits MR and DFS onto separate disk partitions, as that remove flexibility and the ability to handle large-intermediate-output MR jobs.
          Hide
          Eli Collins added a comment -

          This one is on my radar, just haven't had time to get to it. The most common use cases for this I've seen is an admin adding new drives, or drives have mixed capacities so smaller drives fill faster. For this occasionally imbalance a DN startup option that rebalances the disks and shutsdown is probably sufficient. OTOH it would be pretty easy to have a background task that moves finalized blocks from heavy disks to lightly-loaded disks.

          Show
          Eli Collins added a comment - This one is on my radar, just haven't had time to get to it. The most common use cases for this I've seen is an admin adding new drives, or drives have mixed capacities so smaller drives fill faster. For this occasionally imbalance a DN startup option that rebalances the disks and shutsdown is probably sufficient. OTOH it would be pretty easy to have a background task that moves finalized blocks from heavy disks to lightly-loaded disks.
          Hide
          Steve Loughran added a comment -

          Let's start with a python script that people can use while the DN is offline, as that would work against all versions. Moving as a background task would be much better -but I can imagine surprises (need to make sure block isn't in use locally, remotely, handling of very unbalanced disks (where disk #4 is a 50TB NFS mount, etc).

          Show
          Steve Loughran added a comment - Let's start with a python script that people can use while the DN is offline, as that would work against all versions. Moving as a background task would be much better -but I can imagine surprises (need to make sure block isn't in use locally, remotely, handling of very unbalanced disks (where disk #4 is a 50TB NFS mount, etc).
          Hide
          Scott Carey added a comment -

          Isn't the datanode internal block placement policy an easier/simpler solution?

          IMO if you simply placed blocks on disks based on the weight of free space available then this would not be a big issue. You would always run out of space with all drives near the same capacity. The drawback would be write performance bottlenecks in more extreme cases.

          If you were 90% full on 11 drives and 100% empty on one, then ~50% of new blocks would go to the new drive (however, few reads would hit this drive) . That is not ideal for performance but not a big problem either since it should rapidly become more balanced.

          In most situations, we would be talking about systems that have 3 to 11 drives that are 50% to 70% full and one empty drive. This would lead to between ~17% and 55% of writes going to the drive instead of the 8% or 25% that would happen if round-robin.

          IMO the default datanode block placement should be weighted towards disks with less space. There are other cases besides disk failure that can lead to imbalanced space usage, including heterogeneous partition sizes. That would mitigate the need for any complicated background rebalance tasks.

          Perhaps on start-up a datanode could optionally do some local rebalancing before joining the cluster.

          Show
          Scott Carey added a comment - Isn't the datanode internal block placement policy an easier/simpler solution? IMO if you simply placed blocks on disks based on the weight of free space available then this would not be a big issue. You would always run out of space with all drives near the same capacity. The drawback would be write performance bottlenecks in more extreme cases. If you were 90% full on 11 drives and 100% empty on one, then ~50% of new blocks would go to the new drive (however, few reads would hit this drive) . That is not ideal for performance but not a big problem either since it should rapidly become more balanced. In most situations, we would be talking about systems that have 3 to 11 drives that are 50% to 70% full and one empty drive. This would lead to between ~17% and 55% of writes going to the drive instead of the 8% or 25% that would happen if round-robin. IMO the default datanode block placement should be weighted towards disks with less space. There are other cases besides disk failure that can lead to imbalanced space usage, including heterogeneous partition sizes. That would mitigate the need for any complicated background rebalance tasks. Perhaps on start-up a datanode could optionally do some local rebalancing before joining the cluster.
          Hide
          Eli Collins added a comment -

          There are issues with just modifying the placement policy:

          1. It only solves the problem for new blocks. If you add a bunch of new disks you want to rebalance the cluster immediately to get better read throughput. And if you have to implement block balancing (eg on startup) then you don't need to modify the placement policy.
          2. The internal policy intentionally avoids disk usage to optimize for performance (round robin'ing blocks across all spindles). As you point out in some cases this won't be much of a hit, but on a 12 disk machine where half the disks are new the impact will be noticeable.
          3. There are multiple placement policies now that they're pluggable, this requires every policy solve this problem vs just solving it once.

          IMO a background process would actually be easier then modifying the placement policy. Just balancing on DN startup is simplest and would solve most people issues, though would require a rolling DN restart if you wanted to do it on-line.

          Show
          Eli Collins added a comment - There are issues with just modifying the placement policy: It only solves the problem for new blocks. If you add a bunch of new disks you want to rebalance the cluster immediately to get better read throughput. And if you have to implement block balancing (eg on startup) then you don't need to modify the placement policy. The internal policy intentionally avoids disk usage to optimize for performance (round robin'ing blocks across all spindles). As you point out in some cases this won't be much of a hit, but on a 12 disk machine where half the disks are new the impact will be noticeable. There are multiple placement policies now that they're pluggable, this requires every policy solve this problem vs just solving it once. IMO a background process would actually be easier then modifying the placement policy. Just balancing on DN startup is simplest and would solve most people issues, though would require a rolling DN restart if you wanted to do it on-line.
          Hide
          Kevin Lyda added a comment -

          Continuing on Eli's comment, modifying the placement policy also fails to handle deletions.

          I'm currently experiencing this on my cluster where the first datadir is both smaller and getting more of the data (for reasons I'm still trying to figure out - it might be due to how the machines were configured historically). The offline rebalance script sounds like a good first start.

          Show
          Kevin Lyda added a comment - Continuing on Eli's comment, modifying the placement policy also fails to handle deletions. I'm currently experiencing this on my cluster where the first datadir is both smaller and getting more of the data (for reasons I'm still trying to figure out - it might be due to how the machines were configured historically). The offline rebalance script sounds like a good first start.
          Hide
          Kevin Lyda added a comment -

          As an alternative, this issue could be avoided if clusters were configured with LVM and data striping. Well, not avoided, but with the problem pushed down a layer. However tuning docs discourage using LVM.

          Besides performance, another obvious downside to putting a single fs across all disks is that a single disk failure would destroy the entire node as opposed to a portion of it. However I'm new to hadoop so am not sure how a partial data loss event is handled. If a data node w/ a partial data loss can be restored w/o copying in the data it has not lost, then avoiding striped LVM is a huge win. Particularly for the XX TB case mentioned above where balancing in a new node from scratch would involve a huge data transfer with time and network consequences.

          Show
          Kevin Lyda added a comment - As an alternative, this issue could be avoided if clusters were configured with LVM and data striping. Well, not avoided, but with the problem pushed down a layer. However tuning docs discourage using LVM. Besides performance, another obvious downside to putting a single fs across all disks is that a single disk failure would destroy the entire node as opposed to a portion of it. However I'm new to hadoop so am not sure how a partial data loss event is handled. If a data node w/ a partial data loss can be restored w/o copying in the data it has not lost, then avoiding striped LVM is a huge win. Particularly for the XX TB case mentioned above where balancing in a new node from scratch would involve a huge data transfer with time and network consequences.
          Hide
          Steve Loughran added a comment -

          @Kevin: loss of a single disk is an event that not only preserves the rest of the data on the server, the server keeps going. You get 1-3TB of network traffic as the underreplicated data is re-duplicated, but that's all.

          Show
          Steve Loughran added a comment - @Kevin: loss of a single disk is an event that not only preserves the rest of the data on the server, the server keeps going. You get 1-3TB of network traffic as the underreplicated data is re-duplicated, but that's all.
          Hide
          Steve Hoffman added a comment -

          Given the general nature of HDFS and its many uses (HBase, M/R, etc) as much as I'd like it to "just work", it is clear it always depends on the use. Maybe one day we won't need a balancer script for disks (or for the cluster).

          I'm totally OK with having a machine-level balancer script. We use the HDFS balancer to fix inter-machine imbalances when they crop up (again, for a variety of reasons). It makes sense to have a manual script for intra-machine imbalances for people who DO have issues and make it part of the standard install (like the HDFS balancer).

          Show
          Steve Hoffman added a comment - Given the general nature of HDFS and its many uses (HBase, M/R, etc) as much as I'd like it to "just work", it is clear it always depends on the use. Maybe one day we won't need a balancer script for disks (or for the cluster). I'm totally OK with having a machine-level balancer script. We use the HDFS balancer to fix inter-machine imbalances when they crop up (again, for a variety of reasons). It makes sense to have a manual script for intra-machine imbalances for people who DO have issues and make it part of the standard install (like the HDFS balancer).
          Hide
          Kevin Lyda added a comment -

          Are there any decent code examples for how one might do this? I've never done hadoop dev so some pointers for where to start would be appreciated.

          Show
          Kevin Lyda added a comment - Are there any decent code examples for how one might do this? I've never done hadoop dev so some pointers for where to start would be appreciated.
          Hide
          Michael Schmitz added a comment -

          I wrote this code to balance blocks on a single data node. I'm not sure if it's the right approach at all--particularly with how I deal with subdirectories. I wish there were better documentation on how to manually balance a disk.

          https://github.com/schmmd/hadoop-balancer

          My biggest problem with my code is it's way too slow and it requires the cluster to be down.

          Show
          Michael Schmitz added a comment - I wrote this code to balance blocks on a single data node. I'm not sure if it's the right approach at all--particularly with how I deal with subdirectories. I wish there were better documentation on how to manually balance a disk. https://github.com/schmmd/hadoop-balancer My biggest problem with my code is it's way too slow and it requires the cluster to be down.
          Hide
          Kevin Lyda added a comment -

          I wrote the following in Java which... is not really my favourite language. However Hadoop is written in Java and if the code will ever make it into Hadoop proper that seems like an important requirement.

          My particular cluster that I need this working on is a 1.0.3 cluster so my code is written for that. Hence ant, etc. It assumes a built 1.0.3 tree lives in ../hadoop-common. Again, not a Java person so I know this isn't the greatest setup.

          The code is here:

          https://bitbucket.org/lyda/intranode-balance

          I've run it on a test cluster but haven't tried it on "real" data yet. In the meantime pull requests are accepted / desired / encouraged / etc.

          Show
          Kevin Lyda added a comment - I wrote the following in Java which... is not really my favourite language. However Hadoop is written in Java and if the code will ever make it into Hadoop proper that seems like an important requirement. My particular cluster that I need this working on is a 1.0.3 cluster so my code is written for that. Hence ant, etc. It assumes a built 1.0.3 tree lives in ../hadoop-common. Again, not a Java person so I know this isn't the greatest setup. The code is here: https://bitbucket.org/lyda/intranode-balance I've run it on a test cluster but haven't tried it on "real" data yet. In the meantime pull requests are accepted / desired / encouraged / etc.

            People

            • Assignee:
              Unassigned
              Reporter:
              Travis Crawford
            • Votes:
              16 Vote for this issue
              Watchers:
              56 Start watching this issue

              Dates

              • Created:
                Updated:

                Development