Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: timelineserver
    • Labels:

      Description

      I found that the client retry has some problems:

      1. The new put methods will retry on all exception, but they should only do it upon ConnectException.
      2. We can reuse TimelineClientConnectionRetry to simplify the retry logic.

      1. YARN-3471.1.patch
        12 kB
        Zhijie Shen
      2. YARN-3471.2.patch
        14 kB
        Zhijie Shen

        Issue Links

          Activity

          Hide
          zjshen Zhijie Shen added a comment -

          Upload a patch to fix the problem, and add test cases to verify the retry is working properly

          Show
          zjshen Zhijie Shen added a comment - Upload a patch to fix the problem, and add test cases to verify the retry is working properly
          Hide
          stevel@apache.org Steve Loughran added a comment -
          1. all raised exceptions need to include the URL of the timeline server in them. Otherwise nobody will ever be able to track down the problem if its any
          2. you can actually test TimelineClient (or any Yarn service) in a try-with-resources clause, to get the service automatically stopped {{ Service extends Closeable}}, see
          try(Timeline client = createClient()) {
          }
          
          Show
          stevel@apache.org Steve Loughran added a comment - all raised exceptions need to include the URL of the timeline server in them. Otherwise nobody will ever be able to track down the problem if its any you can actually test TimelineClient (or any Yarn service) in a try-with-resources clause, to get the service automatically stopped {{ Service extends Closeable}}, see try (Timeline client = createClient()) { }
          Hide
          zjshen Zhijie Shen added a comment -

          Uploaded a new patch.

          all raised exceptions need to include the URL of the timeline server in them. Otherwise nobody will ever be able to track down the problem if its any

          Make sense because in v2 the url can be changed. Log the url if exception happens when connecting the timeline server.

          you can actually test TimelineClient (or any Yarn service) in a try-with-resources clause, to get the service automatically stopped

          Done

          In addition, I fixed another bug of retry: When we use up retry quota, we should throw IOException (which is declared in the method signature) other than RuntimeException, such that users can catch and process the exception.

          Show
          zjshen Zhijie Shen added a comment - Uploaded a new patch. all raised exceptions need to include the URL of the timeline server in them. Otherwise nobody will ever be able to track down the problem if its any Make sense because in v2 the url can be changed. Log the url if exception happens when connecting the timeline server. you can actually test TimelineClient (or any Yarn service) in a try-with-resources clause, to get the service automatically stopped Done In addition, I fixed another bug of retry: When we use up retry quota, we should throw IOException (which is declared in the method signature) other than RuntimeException, such that users can catch and process the exception.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Make sense because in v2 the url can be changed. Log the url if exception happens when connecting the timeline server.

          makes sense in v1 as people can misconfigure their URL.

          Oddly enough, I'm deep in trying to fix some timeline problems, and was about to add better diagnostics myself. I'd like to see this patch (or variant thereof) in branch-2

          I'm linking this patch to YARN-3477, in which the TL client doesn't rethrow the underlying exception on a retry timeout. It should, as that preserves the most information.

          1. line 407: if e.getCause()==IOE, cast and throw rather than wrap-and-throw
          Show
          stevel@apache.org Steve Loughran added a comment - Make sense because in v2 the url can be changed. Log the url if exception happens when connecting the timeline server. makes sense in v1 as people can misconfigure their URL. Oddly enough, I'm deep in trying to fix some timeline problems, and was about to add better diagnostics myself. I'd like to see this patch (or variant thereof) in branch-2 I'm linking this patch to YARN-3477 , in which the TL client doesn't rethrow the underlying exception on a retry timeout. It should, as that preserves the most information. line 407: if e.getCause()==IOE, cast and throw rather than wrap-and-throw
          Hide
          zjshen Zhijie Shen added a comment -

          makes sense in v1 as people can misconfigure their URL.

          In v1, when we initialize the client, it will print the url, which won't change during the lifecycle of the client.

          I'd like to see this patch (or variant thereof) in branch-2

          This patch mainly focus on reusing TimelineClientConnectionRetry for v2 APIs to only retry on connect exception, while it fix some related problems of TimelineClientConnectionRetry.

          But I agree that we can have a jira to fix TimelineClientConnectionRetry generally on trunk/branch-2.

          Show
          zjshen Zhijie Shen added a comment - makes sense in v1 as people can misconfigure their URL. In v1, when we initialize the client, it will print the url, which won't change during the lifecycle of the client. I'd like to see this patch (or variant thereof) in branch-2 This patch mainly focus on reusing TimelineClientConnectionRetry for v2 APIs to only retry on connect exception, while it fix some related problems of TimelineClientConnectionRetry. But I agree that we can have a jira to fix TimelineClientConnectionRetry generally on trunk/branch-2.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Looking at this patch, it doesn't address the issues I've encountered in YARN-3477

          1. when the retries time out, the exception causing the attempts to fail is not rethrown
          2. interrupts are being swallowed, making it impossible to reliably interrupt the thread

          I'd like to get the YARN-3477 patches in as well as these ones, so lets see if we can do them in turn.

          Show
          stevel@apache.org Steve Loughran added a comment - Looking at this patch, it doesn't address the issues I've encountered in YARN-3477 when the retries time out, the exception causing the attempts to fail is not rethrown interrupts are being swallowed, making it impossible to reliably interrupt the thread I'd like to get the YARN-3477 patches in as well as these ones, so lets see if we can do them in turn.
          Hide
          haibochen Haibo Chen added a comment -

          Varun Saxena Are you working on this? If not, I'd like to take it.

          Show
          haibochen Haibo Chen added a comment - Varun Saxena Are you working on this? If not, I'd like to take it.
          Hide
          haibochen Haibo Chen added a comment -

          Assigning this to myself, per offline discussion in weekly call

          Show
          haibochen Haibo Chen added a comment - Assigning this to myself, per offline discussion in weekly call
          Hide
          haibochen Haibo Chen added a comment - - edited

          Both 1) and 2) have been resolved after YARN-4675. (TimelineClientConnectionRetry is called in a Jersey client filter that only retries requests upon connection-related exceptions) Closing this. Feel free to reopen it if you disagree.

          Show
          haibochen Haibo Chen added a comment - - edited Both 1) and 2) have been resolved after YARN-4675 . (TimelineClientConnectionRetry is called in a Jersey client filter that only retries requests upon connection-related exceptions) Closing this. Feel free to reopen it if you disagree.

            People

            • Assignee:
              haibochen Haibo Chen
              Reporter:
              zjshen Zhijie Shen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development