Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
I ran a simple map/reduce job counting the number of records in the input data.
The number of reducers was set to 1.
I did not set the number of mappers. Thus by default, all splits except the last split of a file contain one dfs block (128MB in my case).
The web gui indicated that 99% of map tasks were with local splits.
Thus I expected that most of the dfs reads should have come from the local data nodes.
However, when I examine the traffic of the ethernet interfaces,
I found about 50% traffic of each node were through the loopback interface and other 50% were through the ethernet card!
Also, the switch monitoring indicated that a lot of traffic went through the links and cross racks!
This indicated that the data locality feature does not work as expected.
To confirm that, I set the number of map tasks to a very high number so that it forced the split size down to about 27MB.
The web gui indicated that 99% of map tasks were with local splits, as expected.
The ethernet interface monitor showed that almost 100% traffic went through the loopback interface, as it should be.
I found about 50% traffic of each node were through the loopback interface and other 50% were through the ethernet card!
Also, the switch monitoring indicated that there were very little traffic through the links and cross racks.
This implies that some corner cases are not handled properly.