Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Change DFS block placement to allocate the first replica locally, the second off-rack, and the third intra-rack from the second.

      Description

      Currently, when writing out a block, dfs will place one copy to a local data node, one copy to a rack local node
      and another one to a remote node. This leads to a number of undesired properties:

      1. The block will be rack-local to two tacks instead of three, reducing the advantage of rack locality based scheduling by 1/3.

      2. The Blocks of a file (especiallya large file) are unevenly distributed over the nodes: One third will be on the local node, and two thirds on the nodes on the same rack. This may make some nodes full much faster than others,
      increasing the need of rebalancing. Furthermore, this also make some nodes become "hot spots" if those big
      files are popular and accessed by many applications.

      1. HADOOP-2559-1-4.patch
        7 kB
        Lohit Vijayarenu
      2. HADOOP-2559-1-3.patch
        7 kB
        Lohit Vijayarenu
      3. HADOOP-2559-1-2.patch
        7 kB
        Lohit Vijayarenu
      4. HADOOP-2559-1.patch
        8 kB
        Lohit Vijayarenu
      5. Patch2_Rack_Node_Mapping.jpg
        35 kB
        Lohit Vijayarenu
      6. Patch1_Rack_Node_Mapping.jpg
        34 kB
        Lohit Vijayarenu
      7. Trunk_Rack_Node_Mapping.jpg
        33 kB
        Lohit Vijayarenu
      8. Patch2 Block Report.jpg
        56 kB
        Lohit Vijayarenu
      9. Patch1_Block_Report.png.jpg
        52 kB
        Lohit Vijayarenu
      10. Trunk_Block_Report.png
        30 kB
        Lohit Vijayarenu
      11. HADOOP-2559-2.patch
        11 kB
        Lohit Vijayarenu
      12. HADOOP-2559-1.patch
        8 kB
        Lohit Vijayarenu

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #432 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/432/ )
          Hide
          Chris Douglas added a comment -

          I just committed this. Thanks, Lohit!

          Show
          Chris Douglas added a comment - I just committed this. Thanks, Lohit!
          Hide
          Lohit Vijayarenu added a comment -

          the failed test was due to timeout. I re-ran the test locally and it passed.

          Show
          Lohit Vijayarenu added a comment - the failed test was due to timeout. I re-ran the test locally and it passed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378053/HADOOP-2559-1-4.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378053/HADOOP-2559-1-4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          +1 The patch looks good.

          Show
          Hairong Kuang added a comment - +1 The patch looks good.
          Hide
          Lohit Vijayarenu added a comment -

          Thanks Hairong, Attaching patch modifying as suggested by your comments

          Show
          Lohit Vijayarenu added a comment - Thanks Hairong, Attaching patch modifying as suggested by your comments
          Hide
          Hairong Kuang added a comment -

          No need to add a new method chooseNodeOnSameRack. Should call the method chooseLocalRack instead but pass the second target as the first parameter.

          Show
          Hairong Kuang added a comment - No need to add a new method chooseNodeOnSameRack. Should call the method chooseLocalRack instead but pass the second target as the first parameter.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377544/HADOOP-2559-1.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377544/HADOOP-2559-1.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/console This message is automatically generated.
          Hide
          Lohit Vijayarenu added a comment -

          Submitting Patch1 for test

          Show
          Lohit Vijayarenu added a comment - Submitting Patch1 for test
          Hide
          Runping Qi added a comment -

          Lohit,

          Great work!

          Clearly, the block distributation with patch2 is better than the one with patch1, and much better than that with trunk.
          For the scan job, patch1 and patch2 performed about the same, and both better than the trunk for about 20%

          The number for random writers are interesting.
          The first test, where 200 nodes writes concurrently, shows trunk and patch1 were better than patch2 for about 15%.
          Test 2 shows that patch2 performed best, and both patch1 and patch2 were better than trunk!
          I suspect that the disk space of those nodes running the mappers might have reached the limit, thus,
          blocks could not be placed on local nodes.

          Show
          Runping Qi added a comment - Lohit, Great work! Clearly, the block distributation with patch2 is better than the one with patch1, and much better than that with trunk. For the scan job, patch1 and patch2 performed about the same, and both better than the trunk for about 20% The number for random writers are interesting. The first test, where 200 nodes writes concurrently, shows trunk and patch1 were better than patch2 for about 15%. Test 2 shows that patch2 performed best, and both patch1 and patch2 were better than trunk! I suspect that the disk space of those nodes running the mappers might have reached the limit, thus, blocks could not be placed on local nodes.
          Hide
          Lohit Vijayarenu added a comment -

          Node alloation across Racks while running against Patch2

          Show
          Lohit Vijayarenu added a comment - Node alloation across Racks while running against Patch2
          Hide
          Lohit Vijayarenu added a comment -

          Node alloation across Racks while running against Patch1

          Show
          Lohit Vijayarenu added a comment - Node alloation across Racks while running against Patch1
          Hide
          Lohit Vijayarenu added a comment -

          Node alloation across Racks while running against Trunk

          Show
          Lohit Vijayarenu added a comment - Node alloation across Racks while running against Trunk
          Hide
          Lohit Vijayarenu added a comment -

          Screenshot of Block distribution after RandomWriter against Patch2

          Show
          Lohit Vijayarenu added a comment - Screenshot of Block distribution after RandomWriter against Patch2
          Hide
          Lohit Vijayarenu added a comment -

          Screenshot of Block Distribution after RandomWriter against Patch1

          Show
          Lohit Vijayarenu added a comment - Screenshot of Block Distribution after RandomWriter against Patch1
          Hide
          Lohit Vijayarenu added a comment -

          Block Report across nodes after RandomWriter for Trunk

          Show
          Lohit Vijayarenu added a comment - Block Report across nodes after RandomWriter for Trunk
          Hide
          Lohit Vijayarenu added a comment -

          Here are the test results

          1. This test was 200 node random writer followed by just Scan.

          Job             Trunk     Patch1    Patch2
          RandWriter      460       471       520
          Scan            718       583       601
          MapTasks        16307     16307     16241
          Data-Local Maps 15743     15741     15741
          Rack-Local Maps 86        84        54
          

          2. This test was 10 mappers on 200 node writing random text (total 2TB) followed by 200 node Scan

          Job             Trunk     Patch1    Patch2
          RandWriter      13633     12923     12299
          Scan            764       606       600
          MapTasks        16517     16453     16538
          Data-Local Maps 15276     15335     15695
          Rack-Local Maps 433       363       121
          
          Show
          Lohit Vijayarenu added a comment - Here are the test results 1. This test was 200 node random writer followed by just Scan. Job Trunk Patch1 Patch2 RandWriter 460 471 520 Scan 718 583 601 MapTasks 16307 16307 16241 Data-Local Maps 15743 15741 15741 Rack-Local Maps 86 84 54 2. This test was 10 mappers on 200 node writing random text (total 2TB) followed by 200 node Scan Job Trunk Patch1 Patch2 RandWriter 13633 12923 12299 Scan 764 606 600 MapTasks 16517 16453 16538 Data-Local Maps 15276 15335 15695 Rack-Local Maps 433 363 121
          Hide
          Owen O'Malley added a comment -

          Ok, I'd like a little more data please, Lohit. Please make sure you test with HADOOP-1985.

          I'd like to see two tests:

          A full 200 node test:
          1. random writer
          2. scan (just map-reduce input, no shuffle or reduces)
          a. record number of node and rack local maps

          A lopsided 200 node test:
          1. random writer with 10 maps
          a. block distribution by node
          b. node distribution by rack
          2. 200 node scan
          a. node distribution by rack
          b. number of node and rack local maps

          Thanks!

          Show
          Owen O'Malley added a comment - Ok, I'd like a little more data please, Lohit. Please make sure you test with HADOOP-1985 . I'd like to see two tests: A full 200 node test: 1. random writer 2. scan (just map-reduce input, no shuffle or reduces) a. record number of node and rack local maps A lopsided 200 node test: 1. random writer with 10 maps a. block distribution by node b. node distribution by rack 2. 200 node scan a. node distribution by rack b. number of node and rack local maps Thanks!
          Hide
          Hairong Kuang added a comment -

          The performance of Trunk and Trunk+patch1 should be the same. I do not know why there is a performance difference between these two strategies on most of the experiments.

          Show
          Hairong Kuang added a comment - The performance of Trunk and Trunk+patch1 should be the same. I do not know why there is a performance difference between these two strategies on most of the experiments.
          Hide
          Runping Qi added a comment -

          Another piece of useful information missing is the percentages of data local mappers of the runs.

          Show
          Runping Qi added a comment - Another piece of useful information missing is the percentages of data local mappers of the runs.
          Hide
          Owen O'Malley added a comment -

          I think you missed the point of his experiment. He ran only 20 maps on a 200 node cluster to try and generate unbalanced distribution of blocks. Because word count largely is a scanning operation, you can see the better distribution of the blocks leading to improved times. Naturally, this will be better once hadoop-1985 is committed.

          The most relevant missing piece of information would be the distribution of blocks in the input directory both per a node and per a rack. With trunk or patch1, he'd end up with those 20 nodes each having 5% of the blocks. (His 20 nodes probably hit 80% of the 23 racks in the 900 node hod cluster he was running on, so it makes sense that trunk and patch1 aren't that far apart.) Patch 2, should have generated pretty even distribution across the nodes and racks (although the 20% non-local racks would probably have 33% fewer blocks than the local racks).

          Show
          Owen O'Malley added a comment - I think you missed the point of his experiment. He ran only 20 maps on a 200 node cluster to try and generate unbalanced distribution of blocks. Because word count largely is a scanning operation, you can see the better distribution of the blocks leading to improved times. Naturally, this will be better once hadoop-1985 is committed. The most relevant missing piece of information would be the distribution of blocks in the input directory both per a node and per a rack. With trunk or patch1, he'd end up with those 20 nodes each having 5% of the blocks. (His 20 nodes probably hit 80% of the 23 racks in the 900 node hod cluster he was running on, so it makes sense that trunk and patch1 aren't that far apart.) Patch 2, should have generated pretty even distribution across the nodes and racks (although the 20% non-local racks would probably have 33% fewer blocks than the local racks).
          Hide
          Sameer Paranjpye added a comment -

          A single type of job on a static allocation will likely not show much benefit. Can we run the gridmix benchmark against these patches, with and without HoD? That data will help us better evaluate these changes.

          Show
          Sameer Paranjpye added a comment - A single type of job on a static allocation will likely not show much benefit. Can we run the gridmix benchmark against these patches, with and without HoD? That data will help us better evaluate these changes.
          Hide
          Lohit Vijayarenu added a comment -

          Some more experiments. This time on a 200 node cluster allocated by HOD, ran randomtextwriter with 20 mappers (20 nodes) writing out data (2 TB in total). Then I used wordcount to scan the data across the allocated 200 nodes.

          Job 			Trunk		Trunk+Patch1	Trunk+Patch2
          RandomTextWriter	5642		6483			6720
          RandomTextWriter	6304		6488			7074
          WordCount		3118		3094			3082
          WordCount		3096		3102			3084
          
          Show
          Lohit Vijayarenu added a comment - Some more experiments. This time on a 200 node cluster allocated by HOD, ran randomtextwriter with 20 mappers (20 nodes) writing out data (2 TB in total). Then I used wordcount to scan the data across the allocated 200 nodes. Job Trunk Trunk+Patch1 Trunk+Patch2 RandomTextWriter 5642 6483 6720 RandomTextWriter 6304 6488 7074 WordCount 3118 3094 3082 WordCount 3096 3102 3084
          Hide
          Lohit Vijayarenu added a comment -

          I ran same set of experiments 4 times instead of 2. Here are the results

          Job				Trunk			Trunk+patch1	Trunk+patch2
          RandomWriter	1346			923				1607
          RandomWriter	743				571				1111
          RandomWriter	698				497				1003
          RandomWriter	776				508				963
          Sort			1535			2027			1802
          Sort			1466			1869			1768
          Sort			1618			1787			1738
          Sort 			1699			2044			1515
          

          Interesting note is that Trunk+patch1 writes have better time compared to Trunk.

          While sort, I see many tasks failing due to ChecksumException which succeeds on other nodes in retry which affects the time show for sort jobs.

          log:org.apache.hadoop.fs.ChecksumException: Checksum error: /tmps/3/gs203727-22269-2527764705241834/mapred-tt/mapred-local/task_200802222124_0007_m_001749_0/file.out at 17018368

          Runping suggested we run wordcount instead, will do that and post the results.

          Show
          Lohit Vijayarenu added a comment - I ran same set of experiments 4 times instead of 2. Here are the results Job Trunk Trunk+patch1 Trunk+patch2 RandomWriter 1346 923 1607 RandomWriter 743 571 1111 RandomWriter 698 497 1003 RandomWriter 776 508 963 Sort 1535 2027 1802 Sort 1466 1869 1768 Sort 1618 1787 1738 Sort 1699 2044 1515 Interesting note is that Trunk+patch1 writes have better time compared to Trunk. While sort, I see many tasks failing due to ChecksumException which succeeds on other nodes in retry which affects the time show for sort jobs. log:org.apache.hadoop.fs.ChecksumException: Checksum error: /tmps/3/gs203727-22269-2527764705241834/mapred-tt/mapred-local/task_200802222124_0007_m_001749_0/file.out at 17018368 Runping suggested we run wordcount instead, will do that and post the results.
          Hide
          Lohit Vijayarenu added a comment -

          I ran random writer and sort benchmark on trunk, trunk+patch1 (allocates 3rd block on same rack as 2nd block is located) and trunk+patch2 (which includes patch1 and also allocates first block on local rack, instead of local node). The tests were run on 100 nodes. I ran 2 randomwriters back to back and 2 sorts back to back on the data generated by random writers.

          Here are the results

          Job			    TRUNK		TRUNK+Path1 	TRUNK+Patch2
          RandomWriter		500			563			689
          RandomWriter		495			486			625
          Sort				1563			1737			1614
          Sort				1680			1675			1678
          
          Show
          Lohit Vijayarenu added a comment - I ran random writer and sort benchmark on trunk, trunk+patch1 (allocates 3rd block on same rack as 2nd block is located) and trunk+patch2 (which includes patch1 and also allocates first block on local rack, instead of local node). The tests were run on 100 nodes. I ran 2 randomwriters back to back and 2 sorts back to back on the data generated by random writers. Here are the results Job TRUNK TRUNK+Path1 TRUNK+Patch2 RandomWriter 500 563 689 RandomWriter 495 486 625 Sort 1563 1737 1614 Sort 1680 1675 1678
          Hide
          Lohit Vijayarenu added a comment -

          Thanks Hairong. I am attaching another patch which takes care of allocating first block in local rack instead of local node.

          Show
          Lohit Vijayarenu added a comment - Thanks Hairong. I am attaching another patch which takes care of allocating first block in local rack instead of local node.
          Hide
          Hairong Kuang added a comment -

          We should try placing the first replica on a random node on the local rack. Lohit, could you please run random writer and sort to see if the new strategy has any significant performance degradation?

          Show
          Hairong Kuang added a comment - We should try placing the first replica on a random node on the local rack. Lohit, could you please run random writer and sort to see if the new strategy has any significant performance degradation?
          Hide
          Lohit Vijayarenu added a comment -

          Attached is the patch which changes the block allocation as suggested by everyone.

          Show
          Lohit Vijayarenu added a comment - Attached is the patch which changes the block allocation as suggested by everyone.
          Hide
          Hairong Kuang added a comment -

          I also think it is a good comprise. But it is still worth placing blocks that maximizes the number of racks especially for those data that are written once but read lots lots of times.

          Show
          Hairong Kuang added a comment - I also think it is a good comprise. But it is still worth placing blocks that maximizes the number of racks especially for those data that are written once but read lots lots of times.
          Hide
          Arun C Murthy added a comment -

          I have to place all the blame for this one on Owen's head... smile

          +1

          I think it's a good compromise too.

          Show
          Arun C Murthy added a comment - I have to place all the blame for this one on Owen's head... smile +1 I think it's a good compromise too.
          Hide
          Owen O'Malley added a comment -

          +1

          Show
          Owen O'Malley added a comment - +1
          Hide
          Runping Qi added a comment -

          A few folks are concerned about the write performance if three replicas are placed on three racks.
          Arun and Owen proposed a comprise as follow:

          DFS client should place one replica on a random node of its local rack, if that is possible, and then choose a remote rack randomly and
          place the other two replicas on two random nodes from that remote rack. This gives us pretty good distribution and needs
          only one remote rack write.

          Arun amd Owen, please correct if the above quote is inaccurate.

          I like this proposal.

          Show
          Runping Qi added a comment - A few folks are concerned about the write performance if three replicas are placed on three racks. Arun and Owen proposed a comprise as follow: DFS client should place one replica on a random node of its local rack, if that is possible, and then choose a remote rack randomly and place the other two replicas on two random nodes from that remote rack. This gives us pretty good distribution and needs only one remote rack write. Arun amd Owen, please correct if the above quote is inaccurate. I like this proposal.
          Hide
          Mark Butler added a comment -

          Is this related to https://issues.apache.org/jira/browse/HADOOP-2767 as this highlights one possible cause of uneven distribution of data on a cluster?

          Show
          Mark Butler added a comment - Is this related to https://issues.apache.org/jira/browse/HADOOP-2767 as this highlights one possible cause of uneven distribution of data on a cluster?
          Hide
          eric baldeschwieler added a comment -

          A further issue is that as discussed on the list, we get very uneven distribution of data on the cluster when you have a small number of clients writing a lot of data.

          A preference for one copy on the same rack unless that rack is substantially more full than most does make sense, but a preference for the same node seems problematic.

          Likewise, the choice of putting two blocks on the source rack seems to lead to a lot of imbalance. We could get the same bandwidth reduction by putting 2 copies on the second rack if it has more free space than the source rack. We could also choose the not allow two copies on a rack in the standard 3 replica case, but that is a separable issue.

          Show
          eric baldeschwieler added a comment - A further issue is that as discussed on the list, we get very uneven distribution of data on the cluster when you have a small number of clients writing a lot of data. A preference for one copy on the same rack unless that rack is substantially more full than most does make sense, but a preference for the same node seems problematic. Likewise, the choice of putting two blocks on the source rack seems to lead to a lot of imbalance. We could get the same bandwidth reduction by putting 2 copies on the second rack if it has more free space than the source rack. We could also choose the not allow two copies on a rack in the standard 3 replica case, but that is a separable issue.

            People

            • Assignee:
              Lohit Vijayarenu
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development