Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.4.0, 3.3.5, 3.3.4
-
None
-
None
-
None
-
Running an HA configuration in Kubernetes, using Java 11.
Description
Running in dynamic or managed environments is a challenge because we can't assume that all services will have DNS entries, will be started in a specific order, will maintain constant IP addresses, etc. I'm using the following assumptions to guide the changes necessary to operate in this kind of environment:
- The configuration files are an expression of desired state
- If a referenced service instance is not resolvable or reachable at a moment in time, it will be eventually and should be able to participate in the future, as if it had been there originally, without requiring manual intervention
- IP address changes should be handled in a way that no only allows distributed calls to continue to function, but avoids having to re-resolve the address over and over
- Code that requires resolved names (Kerberos and DataNode registration) should fall back to DNS reverse lookups to work around temporary issues caused by caching. Example: The DataNode registration is only performed at startup, and yet the extra check that allows it to succeed in registering with the NameNode isn’t performed
- If an HA system is supposed to only require a quorum, then we shouldn’t require the full set, allowing the called service to bring the remaining instances into compliance
- Managing a service should be independent of other services. Example: You should be able to perform a rolling restart of JournalNodes without worrying about causing an issue with NameNodes as long as a quorum is present.
A proof of these concepts would be the ability to:
- Start with less that the full replica count of a service, while still providing the required quorum or minimal count, should still allow a cluster to start and function. Example: 2 out of 3 configured JournalNodes should still allow the NameNode to format, function, rollover to the standby, etc.
- Introduce missing instances should join the existing cluster without manual intervention. Example: Starting the 3rd JournalNode should automatically be formatted and brought up to date
- Perform rolling restarts of individual services without negatively impacting other services (causing failures, restarts, etc.). Example: Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; Rolling restarts of NameNodes shouldn't cause problems with DataNodes
- Logs should only report updated IP addresses once (per dependent), avoiding costly re-resolution
Attachments
Issue Links
- is a parent of
-
HDFS-4043 Namenode Kerberos Login does not use proper hostname for host qualified hdfs principal name.
- Resolved
-
HDFS-16685 DataNode registration fails because getHostName returns an IP address
- In Progress
-
HDFS-16688 Unresolved Hosts during startup are not synced by JournalNodes
- In Progress
-
HDFS-16690 Automatically format new unformatted JournalNodes using JournalNodeSyncer
- In Progress
-
HDFS-16691 Use quorum instead of requiring full JN set for NN format
- In Progress
-
HADOOP-18365 Updated addresses are still accessed using the old IP address
- Resolved
-
HDFS-16684 Exclude self from JournalNodeSyncer when using a bind host
- Resolved
-
HDFS-16686 GetJournalEditServlet fails to authorize valid Kerberos request
- Resolved