Issue Details (XML | Word | Printable)

Key: HDFS-355
Type: Improvement Improvement
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Pete Wyckoff
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop HDFS

Ability to throttle DFS/MR so as not to overwhelm colo to colo switches

Created: 19/Sep/08 05:51 PM   Updated: 20/Jun/09 07:42 AM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Reference
 


 Description  « Hide
Motivation:

This would allow people to put data that is not used as often in non co-located HDFS instance and when needed pulling it from the other cluster.
This is useful in the context of Hive where a Metastore tells the runtime system where the data is located (the full URI) or symbolic links.

The problem:

This will not work right now because it may overwhelm switches between the two instances.

Workaround:

Make the files unplittable or make your block size such that you only get 2-3 mappers.

Possible solution:

Throttle parallelism in the scheduler by specifying to run only X mappers for a job no matter how many slots are free. (making some assumptions about the reliability of the JobTracker's failure detector).



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Pete Wyckoff added a comment - 23/Sep/08 12:05 AM
If we think of the switch as a resource (or at a higher level, the hdfs instance in the other colo as a resource for "this" mapred instance - still not truly global across all mapred instances) this jira relates to HADOOP-3421.