|
Is this related to https://issues.apache.org/jira/browse/HADOOP-2767
A few folks are concerned about the write performance if three replicas are placed on three racks. DFS client should place one replica on a random node of its local rack, if that is possible, and then choose a remote rack randomly and Arun amd Owen, please correct if the above quote is inaccurate. I like this proposal. I have to place all the blame for this one on Owen's head... smile
+1 I think it's a good compromise too. I also think it is a good comprise. But it is still worth placing blocks that maximizes the number of racks especially for those data that are written once but read lots lots of times.
Attached is the patch which changes the block allocation as suggested by everyone.
We should try placing the first replica on a random node on the local rack. Lohit, could you please run random writer and sort to see if the new strategy has any significant performance degradation?
Thanks Hairong. I am attaching another patch which takes care of allocating first block in local rack instead of local node.
I ran random writer and sort benchmark on trunk, trunk+patch1 (allocates 3rd block on same rack as 2nd block is located) and trunk+patch2 (which includes patch1 and also allocates first block on local rack, instead of local node). The tests were run on 100 nodes. I ran 2 randomwriters back to back and 2 sorts back to back on the data generated by random writers.
Here are the results Job TRUNK TRUNK+Path1 TRUNK+Patch2 RandomWriter 500 563 689 RandomWriter 495 486 625 Sort 1563 1737 1614 Sort 1680 1675 1678 I ran same set of experiments 4 times instead of 2. Here are the results
Job Trunk Trunk+patch1 Trunk+patch2 RandomWriter 1346 923 1607 RandomWriter 743 571 1111 RandomWriter 698 497 1003 RandomWriter 776 508 963 Sort 1535 2027 1802 Sort 1466 1869 1768 Sort 1618 1787 1738 Sort 1699 2044 1515 Interesting note is that Trunk+patch1 writes have better time compared to Trunk. While sort, I see many tasks failing due to ChecksumException which succeeds on other nodes in retry which affects the time show for sort jobs. log:org.apache.hadoop.fs.ChecksumException: Checksum error: /tmps/3/gs203727-22269-2527764705241834/mapred-tt/mapred-local/task_200802222124_0007_m_001749_0/file.out at 17018368 Runping suggested we run wordcount instead, will do that and post the results. Some more experiments. This time on a 200 node cluster allocated by HOD, ran randomtextwriter with 20 mappers (20 nodes) writing out data (2 TB in total). Then I used wordcount to scan the data across the allocated 200 nodes.
Job Trunk Trunk+Patch1 Trunk+Patch2 RandomTextWriter 5642 6483 6720 RandomTextWriter 6304 6488 7074 WordCount 3118 3094 3082 WordCount 3096 3102 3084 A single type of job on a static allocation will likely not show much benefit. Can we run the gridmix benchmark against these patches, with and without HoD? That data will help us better evaluate these changes.
I think you missed the point of his experiment. He ran only 20 maps on a 200 node cluster to try and generate unbalanced distribution of blocks. Because word count largely is a scanning operation, you can see the better distribution of the blocks leading to improved times. Naturally, this will be better once hadoop-1985 is committed.
The most relevant missing piece of information would be the distribution of blocks in the input directory both per a node and per a rack. With trunk or patch1, he'd end up with those 20 nodes each having 5% of the blocks. (His 20 nodes probably hit 80% of the 23 racks in the 900 node hod cluster he was running on, so it makes sense that trunk and patch1 aren't that far apart.) Patch 2, should have generated pretty even distribution across the nodes and racks (although the 20% non-local racks would probably have 33% fewer blocks than the local racks). Another piece of useful information missing is the percentages of data local mappers of the runs. The performance of Trunk and Trunk+patch1 should be the same. I do not know why there is a performance difference between these two strategies on most of the experiments.
Ok, I'd like a little more data please, Lohit. Please make sure you test with
I'd like to see two tests: A full 200 node test: A lopsided 200 node test: Thanks! Here are the test results
1. This test was 200 node random writer followed by just Scan. Job Trunk Patch1 Patch2 RandWriter 460 471 520 Scan 718 583 601 MapTasks 16307 16307 16241 Data-Local Maps 15743 15741 15741 Rack-Local Maps 86 84 54 2. This test was 10 mappers on 200 node writing random text (total 2TB) followed by 200 node Scan Job Trunk Patch1 Patch2 RandWriter 13633 12923 12299 Scan 764 606 600 MapTasks 16517 16453 16538 Data-Local Maps 15276 15335 15695 Rack-Local Maps 433 363 121 Block Report across nodes after RandomWriter for Trunk
Screenshot of Block Distribution after RandomWriter against Patch1
Screenshot of Block distribution after RandomWriter against Patch2
Node alloation across Racks while running against Trunk
Node alloation across Racks while running against Patch1
Node alloation across Racks while running against Patch2
Lohit, Great work! Clearly, the block distributation with patch2 is better than the one with patch1, and much better than that with trunk. The number for random writers are interesting. Submitting Patch1 for test
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377544/HADOOP-2559-1.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/testReport/ This message is automatically generated. No need to add a new method chooseNodeOnSameRack. Should call the method chooseLocalRack instead but pass the second target as the first parameter.
Thanks Hairong, Attaching patch modifying as suggested by your comments
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378053/HADOOP-2559-1-4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/testReport/ This message is automatically generated. the failed test was due to timeout. I re-ran the test locally and it passed.
I just committed this. Thanks, Lohit!
Integrated in Hadoop-trunk #432 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/432/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A preference for one copy on the same rack unless that rack is substantially more full than most does make sense, but a preference for the same node seems problematic.
Likewise, the choice of putting two blocks on the source rack seems to lead to a lot of imbalance. We could get the same bandwidth reduction by putting 2 copies on the second rack if it has more free space than the source rack. We could also choose the not allow two copies on a rack in the standard 3 replica case, but that is a separable issue.