[HADOOP-1985] Abstract node to switch mapping into a topology service class used by namenode and jobtracker - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: None
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed
Release Note:

Hide
This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen.
For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level.
Known issue: the task caching will not work with levels greater than 2 (beyond racks). This bug is tracked in ~~HADOOP-3296~~.

Show
This issue introduces rack awareness for map tasks. It also moves the rack resolution logic to the central servers - NameNode & JobTracker. The administrator can specify a loadable class given by topology.node.switch.mapping.impl to specify the class implementing the logic for rack resolution. The class must implement a method - resolve(List<String> names), where names is the list of DNS-names/IP-addresses that we want resolved. The return value is a list of resolved network paths of the form /foo/rack, where rack is the rackID where the node belongs to and foo is the switch where multiple racks are connected, and so on. The default implementation of this class is packaged along with hadoop and points to org.apache.hadoop.net.ScriptBasedMapping and this class loads a script that can be used for rack resolution. The script location is configurable. It is specified by topology.script.file.name and defaults to an empty script. In the case where the script name is empty, /default-rack is returned for all dns-names/IP-addresses. The loadable topology.node.switch.mapping.impl provides administrators fleixibilty to define how their site's node resolution should happen. For mapred, one can also specify the level of the cache w.r.t the number of levels in the resolved network path - defaults to two. This means that the JobTracker will cache tasks at the host level and at the rack level. Known issue: the task caching will not work with levels greater than 2 (beyond racks). This bug is tracked in HADOOP-3296 .

Description

In order to implement switch locality in MapReduce, we need to have switch location in both the namenode and job tracker. Currently the namenode asks the data nodes for this info and they run a local script to answer this question. In our environment and others that I know of there is no reason to push this to each node. It is easier to maintain a centralized script that maps node DNS names to switch strings.

I propose that we build a new class that caches known DNS name to switch mappings and invokes a loadable class or a configurable system call to resolve unknown DNS to switch mappings. We can then add this to the namenode to support the current block to switch mapping needs and simplify the data nodes. We can also add this same callout to the job tracker and then implement rack locality logic there without needing to chane the filesystem API or the split planning API.

Not only is this the least intrusive path to building racklocal MR I can ID, it is also future compatible to future infrastructures that may derive topology on the fly, etc, etc...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jobinprogress.patch
16/Jan/08 19:27
11 kB
Devaraj Das
1985.v9.patch
18/Jan/08 11:35
82 kB
Devaraj Das
1985.v6.patch
14/Jan/08 04:58
83 kB
Devaraj Das
1985.v5.patch
07/Jan/08 15:49
78 kB
Devaraj Das
1985.v4.patch
02/Jan/08 19:57
78 kB
Devaraj Das
1985.v3.patch
24/Dec/07 06:51
81 kB
Devaraj Das
1985.v25.patch
28/Feb/08 16:05
101 kB
Devaraj Das
1985.v24.patch
27/Feb/08 04:52
101 kB
Devaraj Das
1985.v23.patch
25/Feb/08 11:41
100 kB
Devaraj Das
1985.v20.patch
19/Feb/08 17:39
100 kB
Devaraj Das
1985.v2.patch
15/Dec/07 17:27
80 kB
Devaraj Das
1985.v19.patch
15/Feb/08 04:28
99 kB
Devaraj Das
1985.v11.patch
25/Jan/08 15:04
88 kB
Devaraj Das
1985.v10.patch
22/Jan/08 11:49
81 kB
Devaraj Das
1985.v1.patch
14/Dec/07 12:27
78 kB
Devaraj Das
1985.new.patch
19/Nov/07 19:07
27 kB
Devaraj Das

Issue Links

blocks

MAPREDUCE-315 Bias the decision of task scheduling (both for not-running and running) on node metrics (load, processing rate etc).

Open

is depended upon by

HADOOP-2119 JobTracker becomes non-responsive if the task trackers finish task too fast

Closed

MAPREDUCE-267 Rack level copy of map outputs

Open

is related to

HDFS-891 DataNode no longer needs to check for dfs.network.script

Closed

Abstract node to switch mapping into a topology service class used by namenode and jobtracker

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates