Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11249

YARN Client Retry Policy inadvertently changed several years ago



    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • client
    • None


      A non-HA & non-Federation YARN client should be using OtherThanRemoteAndSaslExceptionDependentRetry (previously named OtherThanRemoteExceptionDependentRetry).

      An HA or Federation YARN client should be using FailoverOnNetworkExceptionRetry.

      However, the logic surrounding which Retry Policy gets used (by YARN client) has been inconsistent/unexpected due to past changes in the code.

      This commit introduced what is arguably a bug where FailoverOnNetworkExceptionRetry would get used for non-HA & non-Federation YARN clusters because of the default yarn-site configuration "yarn.federation.failover.enabled=true". This was arguably a bug because "yarn.federation.failover.enabled=true" even if YARN federation is not enabled at all. The consequence of this change is that all over the world non-HA & non-Federation YARN clusters using a Hadoop release from after 2017-09 were unexpectedly using FailoverOnNetworkExceptionRetry when in reality they should be using OtherThanRemoteAndSaslExceptionDependentRetry. The only YARN clusters (using a Hadoop release from after 2017-09) that would not be impacted by the retry policy change would be those with a custom yarn-site configuration of "yarn.federation.failover.enabled=false"; I assume this is probably a fairly small fraction of YARN clusters around the world and so I suspect this retry policy change affected the majority of clusters around the world using an affected Hadoop release.

      The code was recently corrected to change the Retry Policy used for non-HA & non-Federation YARN clusters; in 2022-07 this change was committed which causes the correct Retry Policy of OtherThanRemoteAndSaslExceptionDependentRetry to be used for non-HA & non-Federation YARN clusters. The change was to return that federation failover is not enabled unless federation is enabled.

      Hadoop clusters based on Hadoop releases created between the following approximate time ranges will have a different "default" YARN client retry policy (i.e. "default" retry policy in the sense that you will get it with a default yarn-site file containing "yarn.federation.failover.enabled=true"):

      • before 2017-09 ; YARN uses OtherThanRemoteAndSaslExceptionDependentRetry by default
      • 2017-09 to 2022-07 ; YARN uses FailoverOnNetworkExceptionRetry by default
      • 2022-07 into future ; YARN uses OtherThanRemoteAndSaslExceptionDependentRetry by default

      Part of the problem here is that FailoverOnNetworkExceptionRetry is arguably a superior retry policy than OtherThanRemoteAndSaslExceptionDependentRetry because:

      My concern is that YARN users who were previously seeing their YARN client not retry on InvalidToken or AccessControlException will now start to see the YARN client retry for 15 minutes on these exceptions because the Retry Policy for non-HA & non-Federation YARN client has been reverted back to OtherThanRemoteAndSaslExceptionDependentRetry.

      I see a few options to correct this:

      • remove OtherThanRemoteAndSaslExceptionDependentRetry entirely & always use FailoverOnNetworkExceptionRetry
        • the benefit of this approach is that it reduces lines of code & helps avoid cases of exception handling code divergence by maintaining multiple different retry policies
        • I don't see any major risk to doing this because all Hadoop releases since after 2017-09 have been using FailoverOnNetworkExceptionRetry without the sysadmins even being aware of it & without any issues reported
      • bring OtherThanRemoteAndSaslExceptionDependentRetry to parity with FailoverOnNetworkExceptionRetry in terms of not retrying on InvalidToken or AccessControlException
        • downside is that the exception handling in these retry policies may diverge again in the future
      • refactor so that both retry policies share the same exception handling code but may differ in other areas such as the RetryAction being returned
        • may be as simple as using FailoverOnNetworkExceptionRetry's exception handling logic in a parent class but to return different RetryAction depending on subclass (i.e. FAILOVER_AND_RETRY for HA/federation vs RETRY for non-HA/non-federation)




            Unassigned Unassigned
            KevinWikant Kevin Wikant
            0 Vote for this issue
            1 Start watching this issue