[HADOOP-15129] Datanode caches namenode DNS lookup failure and cannot startup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.8.2
Fix Version/s: 3.4.0, 3.3.2, 3.2.4
Component/s: ipc
Labels:
- pull-request-available
Environment:

Google Compute Engine.

I'm using Java 8, Debian 8, Hadoop 2.8.2.

Target Version/s:

2.7.8
Flags:

Patch

Description

On startup, the Datanode creates an InetSocketAddress to register with each namenode. Though there are retries on connection failure throughout the stack, the same InetSocketAddress is reused.

InetSocketAddress is an interesting class, because it resolves DNS names to IP addresses on construction, and it is never refreshed. Hadoop re-creates an InetSocketAddress in some cases just in case the remote IP has changed for a particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.

Anyway, on startup, you cna see the Datanode log: "Namenode...remains unresolved" – referring to the fact that DNS lookup failed.

2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: null
2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode for null remains unresolved for ID null. Check your hdfs-site.xml file to ensure namenodes are configured properly.
2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: <default>
2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (Datanode Uuid unassigned) service to cluster-32f5-m:8020 starting to offer service

The Datanode then proceeds to use this unresolved address, as it may work if the DN is configured to use a proxy. Since I'm not using a proxy, it forever prints out this message:

2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020

Unfortunately, the log doesn't contain the exception that triggered it, but the culprit is actually in IPC Client: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.

This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 to give a clear error message when somebody mispells an address.

However, the fix in ~~HADOOP-7472~~ doesn't apply here, because that code happens in Client#getConnection after the Connection is constructed.

My proposed fix (will attach a patch) is to move this exception out of the constructor and into a place that will trigger ~~HADOOP-7472~~'s logic to re-resolve addresses. If the DNS failure was temporary, this will allow the connection to succeed. If not, the connection will fail after ipc client retries (default 10 seconds worth of retries).

I want to fix this in ipc client rather than just in Datanode startup, as this fixes temporary DNS issues for all of Hadoop.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-15129.001.patch
19/Dec/17 02:07
4 kB
Karthik Palaniappan
HADOOP-15129.002.patch
10/Jan/18 19:08
5 kB
Karthik Palaniappan

Issue Links

breaks

HDFS-8068 Do not retry rpc calls If the proxy contains unresolved address

Patch Available

relates to

HADOOP-12125 Retrying UnknownHostException on a proxy does not actually retry hostname resolution

Open

HADOOP-487 misspelt DFS host name gives null pointer exception in getProtocolVersion

Closed

links to

GitHub Pull Request #3348

Activity

People

Assignee:: Chris Nauroth

Reporter:: Karthik Palaniappan

Votes:: 1 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 19/Dec/17 00:34

Updated:: 13/Sep/21 16:23

Resolved:: 03/Sep/21 19:47

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: