Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18396

Issues running in dynamic / managed environments



    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.4.0, 3.3.5, 3.3.4
    • None
    • None
    • None
    • Running an HA configuration in Kubernetes, using Java 11.


      Running in dynamic or managed environments is a challenge because we can't assume that all services will have DNS entries, will be started in a specific order, will maintain constant IP addresses, etc.  I'm using the following assumptions to guide the changes necessary to operate in this kind of environment:

      1. The configuration files are an expression of desired state
      2. If a referenced service instance is not resolvable or reachable at a moment in time, it will be eventually and should be able to participate in the future, as if it had been there originally, without requiring manual intervention
      3. IP address changes should be handled in a way that no only allows distributed calls to continue to function, but avoids having to re-resolve the address over and over
      4. Code that requires resolved names (Kerberos and DataNode registration) should fall back to DNS reverse lookups to work around temporary issues caused by caching.  Example: The DataNode registration is only performed at startup, and yet the extra check that allows it to succeed in registering with the NameNode isn’t performed
      5. If an HA system is supposed to only require a quorum, then we shouldn’t require the full set, allowing the called service to bring the remaining instances into compliance
      6. Managing a service should be independent of other services.  Example: You should be able to perform a rolling restart of JournalNodes without worrying about causing an issue with NameNodes as long as a quorum is present.

      A proof of these concepts would be the ability to:

      • Start with less that the full replica count of a service, while still providing the required quorum or minimal count, should still allow a cluster to start and function.  Example: 2 out of 3 configured JournalNodes should still allow the NameNode to format, function, rollover to the standby, etc.
      • Introduce missing instances should join the existing cluster without manual intervention.  Example: Starting the 3rd JournalNode should automatically be formatted and brought up to date
      • Perform rolling restarts of individual services without negatively impacting other services (causing failures, restarts, etc.).  Example: Rolling restarts of JournalNodes shouldn't cause problems in NameNodes; Rolling restarts of NameNodes shouldn't cause problems with DataNodes
      • Logs should only report updated IP addresses once (per dependent), avoiding costly re-resolution


        Issue Links



              svaughan Steve Vaughan
              svaughan Steve Vaughan
              0 Vote for this issue
              6 Start watching this issue