Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-325

DFS should not use round robin policy in determing on which volume (file system partition) to allocate for the next block

    Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When multiple file system partitions are configured for the data storage of a data node,
      it uses a strict round robin policy to decide which partition to use for writing the next block.
      This may result in anormaly cases in which the blocks of a file are not evenly distributed across
      the partitions. For example, when we use distcp to copy files with each node have 4 mappers running concurrently,
      those 4 mappers are writing to DFS at about the same rate. Thus, it is possible that the 4 mappers write out
      blocks interleavingly. If there are 4 file system partitions configured for the local data node, it is possible that each mapper will
      continue to write its blocks on to the same file system partition.

      A simple random placement policy will avoid such anormaly cases, and does not have any obvious drawbacks.

        Issue Links

          Activity

          Hide
          Runping Qi added a comment -

          Similar problem due to round robin placement policy happens for map output data.

          Show
          Runping Qi added a comment - Similar problem due to round robin placement policy happens for map output data.
          Hide
          dhruba borthakur added a comment -

          This makes the DataNode pick a disk randomly for allocating a new block.

          Show
          dhruba borthakur added a comment - This makes the DataNode pick a disk randomly for allocating a new block.
          Hide
          Raghu Angadi added a comment -

          Random partition is fine and patch looks fine. If there are two writers, there is 25% probability that both write to the same partition. with 3, it becomes 62.5% (that 2 are more writing the same disk) 90% for 4 etc.. If that is ok, then this patch is fine. Assuming typically these apps are IO bound, this sounds pretty large panalty.

          But I don't know how it fixes problems reported in the description.. actually I did not quite understand the problem any way.

          Show
          Raghu Angadi added a comment - Random partition is fine and patch looks fine. If there are two writers, there is 25% probability that both write to the same partition. with 3, it becomes 62.5% (that 2 are more writing the same disk) 90% for 4 etc.. If that is ok, then this patch is fine. Assuming typically these apps are IO bound, this sounds pretty large panalty. But I don't know how it fixes problems reported in the description.. actually I did not quite understand the problem any way.
          Hide
          Raghu Angadi added a comment -

          Better (but much more complicated) policy could be to select random partition from least loaded disks.

          Show
          Raghu Angadi added a comment - Better (but much more complicated) policy could be to select random partition from least loaded disks.
          Hide
          Hairong Kuang added a comment -

          If the new blcok allocation strategy proposed in HADOOP-2559 removes the local node copy, the problem described in this jira will go away.

          Show
          Hairong Kuang added a comment - If the new blcok allocation strategy proposed in HADOOP-2559 removes the local node copy, the problem described in this jira will go away.
          Hide
          Runping Qi added a comment -

          is hadoop-2559 accepted? will it be in 0,17?

          Show
          Runping Qi added a comment - is hadoop-2559 accepted? will it be in 0,17?
          Hide
          dhruba borthakur added a comment -

          If HADOOP-2559 gets committed, I plan on closing this JIRA as "won't fix".

          Show
          dhruba borthakur added a comment - If HADOOP-2559 gets committed, I plan on closing this JIRA as "won't fix".
          Hide
          dhruba borthakur added a comment -

          Duplicate of HADOOP-2559.

          Show
          dhruba borthakur added a comment - Duplicate of HADOOP-2559 .
          Hide
          Runping Qi added a comment -

          Have we (H-2559) decided not to place the first block replica on the local node?

          Show
          Runping Qi added a comment - Have we (H-2559) decided not to place the first block replica on the local node?
          Hide
          dhruba borthakur added a comment -

          From my understanding, HADOOP-2559 places the first replica on a random node on the local rack. I will check with Lohit to confirm this one.

          Show
          dhruba borthakur added a comment - From my understanding, HADOOP-2559 places the first replica on a random node on the local rack. I will check with Lohit to confirm this one.
          Hide
          Lohit Vijayarenu added a comment -

          Dhruba, it was decided that we commit patch1 in HADOOP_2559 which is to have first replica on local node, second replica on node on different rack and third on same rack as of 2nd replica but different node.

          Show
          Lohit Vijayarenu added a comment - Dhruba, it was decided that we commit patch1 in HADOOP_2559 which is to have first replica on local node, second replica on node on different rack and third on same rack as of 2nd replica but different node.
          Hide
          dhruba borthakur added a comment -

          It appears that HADOOP-2599 still keeps the first replica on the local machine. Thus, reopening this JIRA.

          Show
          dhruba borthakur added a comment - It appears that HADOOP-2599 still keeps the first replica on the local machine. Thus, reopening this JIRA.
          Hide
          Runping Qi added a comment -

          By analyzing disk utilization data, we have found that the four disks on each node were not evenly utlized.
          It seems that first disk was the most heavily utilized, which is consistent with the potential impact of the current policy for
          volume selection for a nw block on data nodes.

          Show
          Runping Qi added a comment - By analyzing disk utilization data, we have found that the four disks on each node were not evenly utlized. It seems that first disk was the most heavily utilized, which is consistent with the potential impact of the current policy for volume selection for a nw block on data nodes.
          Hide
          Allen Wittenauer added a comment -

          Rather than random, I'd rather have the capability to give weights, as mentioned in HADOOP-2150.

          Show
          Allen Wittenauer added a comment - Rather than random, I'd rather have the capability to give weights, as mentioned in HADOOP-2150 .

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development