Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
Motivation:
This would allow people to put data that is not used as often in non co-located HDFS instance and when needed pulling it from the other cluster.
This is useful in the context of Hive where a Metastore tells the runtime system where the data is located (the full URI) or symbolic links.
The problem:
This will not work right now because it may overwhelm switches between the two instances.
Workaround:
Make the files unplittable or make your block size such that you only get 2-3 mappers.
Possible solution:
Throttle parallelism in the scheduler by specifying to run only X mappers for a job no matter how many slots are free. (making some assumptions about the reliability of the JobTracker's failure detector).
Attachments
Issue Links
- relates to
-
HADOOP-3421 Requirements for a Resource Manager for Hadoop
- Resolved