Issue Details (XML | Word | Printable)

Key: HADOOP-2559
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Lohit Vijayarenu
Reporter: Runping Qi
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

DFS should place one replica per rack

Created: 09/Jan/08 04:02 PM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-2559-1-2.patch 2008-03-13 11:57 PM Lohit Vijayarenu 7 kB
Text File Licensed for inclusion in ASF works HADOOP-2559-1-3.patch 2008-03-14 12:11 AM Lohit Vijayarenu 7 kB
Text File Licensed for inclusion in ASF works HADOOP-2559-1-4.patch 2008-03-17 05:58 PM Lohit Vijayarenu 7 kB
Text File Licensed for inclusion in ASF works HADOOP-2559-1.patch 2008-03-10 06:44 PM Lohit Vijayarenu 8 kB
Text File Licensed for inclusion in ASF works HADOOP-2559-1.patch 2008-02-21 03:59 PM Lohit Vijayarenu 8 kB
Text File Licensed for inclusion in ASF works HADOOP-2559-2.patch 2008-02-22 09:54 AM Lohit Vijayarenu 11 kB
Image Attachments:

1. Patch1_Block_Report.png.jpg
(52 kB)

2. Patch1_Rack_Node_Mapping.jpg
(34 kB)

3. Patch2 Block Report.jpg
(56 kB)

4. Patch2_Rack_Node_Mapping.jpg
(35 kB)

5. Trunk_Block_Report.png
(30 kB)

6. Trunk_Rack_Node_Mapping.jpg
(33 kB)
Issue Links:
dependent
 

Release Note: Change DFS block placement to allocate the first replica locally, the second off-rack, and the third intra-rack from the second.
Resolution Date: 17/Mar/08 11:54 PM


 Description  « Hide
Currently, when writing out a block, dfs will place one copy to a local data node, one copy to a rack local node
and another one to a remote node. This leads to a number of undesired properties:

1. The block will be rack-local to two tacks instead of three, reducing the advantage of rack locality based scheduling by 1/3.

2. The Blocks of a file (especiallya large file) are unevenly distributed over the nodes: One third will be on the local node, and two thirds on the nodes on the same rack. This may make some nodes full much faster than others,
increasing the need of rebalancing. Furthermore, this also make some nodes become "hot spots" if those big
files are popular and accessed by many applications.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
eric baldeschwieler added a comment - 13/Jan/08 11:07 PM
A further issue is that as discussed on the list, we get very uneven distribution of data on the cluster when you have a small number of clients writing a lot of data.

A preference for one copy on the same rack unless that rack is substantially more full than most does make sense, but a preference for the same node seems problematic.

Likewise, the choice of putting two blocks on the source rack seems to lead to a lot of imbalance. We could get the same bandwidth reduction by putting 2 copies on the second rack if it has more free space than the source rack. We could also choose the not allow two copies on a rack in the standard 3 replica case, but that is a separable issue.


Robert Chansler made changes - 31/Jan/08 11:59 PM
Field Original Value New Value
Assignee lohit vijayarenu [ lohit ]
Mark Butler added a comment - 04/Feb/08 05:49 PM
Is this related to https://issues.apache.org/jira/browse/HADOOP-2767 as this highlights one possible cause of uneven distribution of data on a cluster?

Runping Qi added a comment - 08/Feb/08 05:53 PM

A few folks are concerned about the write performance if three replicas are placed on three racks.
Arun and Owen proposed a comprise as follow:

DFS client should place one replica on a random node of its local rack, if that is possible, and then choose a remote rack randomly and
place the other two replicas on two random nodes from that remote rack. This gives us pretty good distribution and needs
only one remote rack write.

Arun amd Owen, please correct if the above quote is inaccurate.

I like this proposal.


Owen O'Malley added a comment - 12/Feb/08 05:06 PM
+1

Arun C Murthy added a comment - 12/Feb/08 05:46 PM
I have to place all the blame for this one on Owen's head... smile

+1

I think it's a good compromise too.


Hairong Kuang added a comment - 12/Feb/08 10:50 PM
I also think it is a good comprise. But it is still worth placing blocks that maximizes the number of racks especially for those data that are written once but read lots lots of times.

Lohit Vijayarenu added a comment - 21/Feb/08 03:59 PM
Attached is the patch which changes the block allocation as suggested by everyone.

Lohit Vijayarenu made changes - 21/Feb/08 03:59 PM
Attachment HADOOP-2559-1.patch [ 12376130 ]
Hairong Kuang added a comment - 21/Feb/08 09:10 PM
We should try placing the first replica on a random node on the local rack. Lohit, could you please run random writer and sort to see if the new strategy has any significant performance degradation?

Lohit Vijayarenu added a comment - 22/Feb/08 09:54 AM
Thanks Hairong. I am attaching another patch which takes care of allocating first block in local rack instead of local node.

Lohit Vijayarenu made changes - 22/Feb/08 09:54 AM
Attachment HADOOP-2559-2.patch [ 12376211 ]
Lohit Vijayarenu added a comment - 22/Feb/08 09:59 AM
I ran random writer and sort benchmark on trunk, trunk+patch1 (allocates 3rd block on same rack as 2nd block is located) and trunk+patch2 (which includes patch1 and also allocates first block on local rack, instead of local node). The tests were run on 100 nodes. I ran 2 randomwriters back to back and 2 sorts back to back on the data generated by random writers.

Here are the results

Job			    TRUNK		TRUNK+Path1 	TRUNK+Patch2
RandomWriter		500			563			689
RandomWriter		495			486			625
Sort				1563			1737			1614
Sort				1680			1675			1678

Lohit Vijayarenu added a comment - 23/Feb/08 12:59 AM
I ran same set of experiments 4 times instead of 2. Here are the results
Job				Trunk			Trunk+patch1	Trunk+patch2
RandomWriter	1346			923				1607
RandomWriter	743				571				1111
RandomWriter	698				497				1003
RandomWriter	776				508				963
Sort			1535			2027			1802
Sort			1466			1869			1768
Sort			1618			1787			1738
Sort 			1699			2044			1515

Interesting note is that Trunk+patch1 writes have better time compared to Trunk.

While sort, I see many tasks failing due to ChecksumException which succeeds on other nodes in retry which affects the time show for sort jobs.

log:org.apache.hadoop.fs.ChecksumException: Checksum error: /tmps/3/gs203727-22269-2527764705241834/mapred-tt/mapred-local/task_200802222124_0007_m_001749_0/file.out at 17018368

Runping suggested we run wordcount instead, will do that and post the results.


Lohit Vijayarenu added a comment - 26/Feb/08 07:39 PM
Some more experiments. This time on a 200 node cluster allocated by HOD, ran randomtextwriter with 20 mappers (20 nodes) writing out data (2 TB in total). Then I used wordcount to scan the data across the allocated 200 nodes.
Job 			Trunk		Trunk+Patch1	Trunk+Patch2
RandomTextWriter	5642		6483			6720
RandomTextWriter	6304		6488			7074
WordCount		3118		3094			3082
WordCount		3096		3102			3084

Sameer Paranjpye added a comment - 26/Feb/08 08:49 PM
A single type of job on a static allocation will likely not show much benefit. Can we run the gridmix benchmark against these patches, with and without HoD? That data will help us better evaluate these changes.

Owen O'Malley added a comment - 27/Feb/08 06:32 AM
I think you missed the point of his experiment. He ran only 20 maps on a 200 node cluster to try and generate unbalanced distribution of blocks. Because word count largely is a scanning operation, you can see the better distribution of the blocks leading to improved times. Naturally, this will be better once hadoop-1985 is committed.

The most relevant missing piece of information would be the distribution of blocks in the input directory both per a node and per a rack. With trunk or patch1, he'd end up with those 20 nodes each having 5% of the blocks. (His 20 nodes probably hit 80% of the 23 racks in the 900 node hod cluster he was running on, so it makes sense that trunk and patch1 aren't that far apart.) Patch 2, should have generated pretty even distribution across the nodes and racks (although the 20% non-local racks would probably have 33% fewer blocks than the local racks).


Runping Qi added a comment - 27/Feb/08 06:42 AM

Another piece of useful information missing is the percentages of data local mappers of the runs.


Hairong Kuang added a comment - 27/Feb/08 10:04 PM
The performance of Trunk and Trunk+patch1 should be the same. I do not know why there is a performance difference between these two strategies on most of the experiments.

Owen O'Malley added a comment - 29/Feb/08 06:35 PM
Ok, I'd like a little more data please, Lohit. Please make sure you test with HADOOP-1985.

I'd like to see two tests:

A full 200 node test:
1. random writer
2. scan (just map-reduce input, no shuffle or reduces)
a. record number of node and rack local maps

A lopsided 200 node test:
1. random writer with 10 maps
a. block distribution by node
b. node distribution by rack
2. 200 node scan
a. node distribution by rack
b. number of node and rack local maps

Thanks!


Lohit Vijayarenu added a comment - 05/Mar/08 10:55 AM
Here are the test results

1. This test was 200 node random writer followed by just Scan.

Job             Trunk     Patch1    Patch2
RandWriter      460       471       520
Scan            718       583       601
MapTasks        16307     16307     16241
Data-Local Maps 15743     15741     15741
Rack-Local Maps 86        84        54

2. This test was 10 mappers on 200 node writing random text (total 2TB) followed by 200 node Scan

Job             Trunk     Patch1    Patch2
RandWriter      13633     12923     12299
Scan            764       606       600
MapTasks        16517     16453     16538
Data-Local Maps 15276     15335     15695
Rack-Local Maps 433       363       121

Lohit Vijayarenu added a comment - 05/Mar/08 10:56 AM
Block Report across nodes after RandomWriter for Trunk

Lohit Vijayarenu made changes - 05/Mar/08 10:56 AM
Attachment Trunk_Block_Report.png [ 12377152 ]
Lohit Vijayarenu added a comment - 05/Mar/08 10:59 AM
Screenshot of Block Distribution after RandomWriter against Patch1

Lohit Vijayarenu made changes - 05/Mar/08 10:59 AM
Attachment Patch1_Block_Report.png.jpg [ 12377153 ]
Lohit Vijayarenu added a comment - 05/Mar/08 11:00 AM
Screenshot of Block distribution after RandomWriter against Patch2

Lohit Vijayarenu made changes - 05/Mar/08 11:00 AM
Attachment Patch2 Block Report.jpg [ 12377154 ]
Lohit Vijayarenu added a comment - 05/Mar/08 11:02 AM
Node alloation across Racks while running against Trunk

Lohit Vijayarenu made changes - 05/Mar/08 11:02 AM
Attachment Trunk_Rack_Node_Mapping.jpg [ 12377155 ]
Lohit Vijayarenu added a comment - 05/Mar/08 11:03 AM
Node alloation across Racks while running against Patch1

Lohit Vijayarenu made changes - 05/Mar/08 11:03 AM
Attachment Patch1_Rack_Node_Mapping.jpg [ 12377156 ]
Lohit Vijayarenu added a comment - 05/Mar/08 11:04 AM
Node alloation across Racks while running against Patch2

Lohit Vijayarenu made changes - 05/Mar/08 11:04 AM
Attachment Patch2_Rack_Node_Mapping.jpg [ 12377157 ]
Runping Qi added a comment - 05/Mar/08 02:29 PM

Lohit,

Great work!

Clearly, the block distributation with patch2 is better than the one with patch1, and much better than that with trunk.
For the scan job, patch1 and patch2 performed about the same, and both better than the trunk for about 20%

The number for random writers are interesting.
The first test, where 200 nodes writes concurrently, shows trunk and patch1 were better than patch2 for about 15%.
Test 2 shows that patch2 performed best, and both patch1 and patch2 were better than trunk!
I suspect that the disk space of those nodes running the mappers might have reached the limit, thus,
blocks could not be placed on local nodes.


Lohit Vijayarenu added a comment - 10/Mar/08 06:44 PM
Submitting Patch1 for test

Lohit Vijayarenu made changes - 10/Mar/08 06:44 PM
Attachment HADOOP-2559-1.patch [ 12377544 ]
Lohit Vijayarenu made changes - 10/Mar/08 06:45 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 11/Mar/08 06:20 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377544/HADOOP-2559-1.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1933/console

This message is automatically generated.


Hairong Kuang added a comment - 13/Mar/08 11:17 PM
No need to add a new method chooseNodeOnSameRack. Should call the method chooseLocalRack instead but pass the second target as the first parameter.

Lohit Vijayarenu made changes - 13/Mar/08 11:57 PM
Attachment HADOOP-2559-1-2.patch [ 12377849 ]
Lohit Vijayarenu made changes - 14/Mar/08 12:11 AM
Attachment HADOOP-2559-1-3.patch [ 12377850 ]
dhruba borthakur made changes - 17/Mar/08 06:49 AM
Link This issue is depended upon by HADOOP-2094 [ HADOOP-2094 ]
Lohit Vijayarenu made changes - 17/Mar/08 05:57 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Lohit Vijayarenu added a comment - 17/Mar/08 05:58 PM
Thanks Hairong, Attaching patch modifying as suggested by your comments

Lohit Vijayarenu made changes - 17/Mar/08 05:58 PM
Attachment HADOOP-2559-1-4.patch [ 12378053 ]
Hairong Kuang added a comment - 17/Mar/08 06:00 PM
+1 The patch looks good.

Lohit Vijayarenu made changes - 17/Mar/08 06:05 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 17/Mar/08 08:26 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378053/HADOOP-2559-1-4.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests -1. The patch failed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1984/console

This message is automatically generated.


Lohit Vijayarenu added a comment - 17/Mar/08 09:32 PM
the failed test was due to timeout. I re-ran the test locally and it passed.

Chris Douglas added a comment - 17/Mar/08 11:54 PM
I just committed this. Thanks, Lohit!

Chris Douglas made changes - 17/Mar/08 11:54 PM
Resolution Fixed [ 1 ]
Fix Version/s 0.17.0 [ 12312913 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #638144 Mon Mar 17 23:55:09 UTC 2008 cdouglas HADOOP-2559. Change DFS block placement to allocate the first replica
locally, the second off-rack, and the third intra-rack from the
second. Contributed by Lohit Vijayarenu.
Files Changed
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/test/org/apache/hadoop/dfs/TestReplicationPolicy.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/ReplicationTargetChooser.java

Hudson added a comment - 18/Mar/08 03:35 PM

Lohit Vijayarenu made changes - 17/Apr/08 05:10 AM
Description
Currently, when writing out a block, dfs will place one copy to a local data node, one copy to a rack local node
and another one to a remote node. This leads to a number of undesired properties:

1. The block will be rack-local to two tacks instead of three, reducing the advantage of rack locality based scheduling by 1/3.

2. The Blocks of a file (especiallya large file) are unevenly distributed over the nodes: One third will be on the local node, and two thirds on the nodes on the same rack. This may make some nodes full much faster than others,
increasing the need of rebalancing. Furthermore, this also make some nodes become "hot spots" if those big
files are popular and accessed by many applications.


Currently, when writing out a block, dfs will place one copy to a local data node, one copy to a rack local node
and another one to a remote node. This leads to a number of undesired properties:

1. The block will be rack-local to two tacks instead of three, reducing the advantage of rack locality based scheduling by 1/3.

2. The Blocks of a file (especiallya large file) are unevenly distributed over the nodes: One third will be on the local node, and two thirds on the nodes on the same rack. This may make some nodes full much faster than others,
increasing the need of rebalancing. Furthermore, this also make some nodes become "hot spots" if those big
files are popular and accessed by many applications.


Release Note Change DFS block placement to allocate the first replica locally, the second off-rack, and the third intra-rack from the second.
Nigel Daley made changes - 21/May/08 08:05 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]