Uploaded image for project: 'Apache Knox'
  1. Apache Knox
  2. KNOX-1093

KNOX Not Handling safemode state of one of the NameNode In HA state

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.10.0
    • 1.2.0
    • Server
    • None

    Description

      per your code WebHdfsHaDispatch.java , When Safemode exception happened it calls the retryRequest() method. which also calls executeRequest() method as like failover request but the namenode info is not changing for the thread for all of its iteration until maxRetryAttempts=300
      and retrySleep=1000 ( 1 sec )
      After Max 5 minutes , client retries should pick the right namenode atleast in next attempt.
      But in this case if we need to copy a set of files in stipulated time there is X% of connections falls into these namenode and fails. Can we handle that better

      try {
               inboundResponse = executeOutboundRequest(outboundRequest);
               writeOutboundResponse(outboundRequest, inboundRequest, outboundResponse, inboundResponse);
            } catch (StandbyException e) {
               LOG.errorReceivedFromStandbyNode(e);
               failoverRequest(outboundRequest, inboundRequest, outboundResponse, inboundResponse, e);
            } catch (SafeModeException e) {
               LOG.errorReceivedFromSafeModeNode(e);
               retryRequest(outboundRequest, inboundRequest, outboundResponse, inboundResponse, e);
            } catch (IOException e) {
               LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
               failoverRequest(outboundRequest, inboundRequest, outboundResponse, inboundResponse, e);
            }
         }
      

      Need to change the logic in SafeModeexception state in KNOX HADispatch code to flag the namenode which is stuck in safemode and maintain don't try queue and redirect all further connection only to healthy active namenode . This way X5 of failures we can handle. What do we think

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            MatthewSharp Matthew Sharp
            rajeshhadoop Rajesh Chandramohan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment