Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This change introduces a new configuration key used by RPC server to decide whether to send backoff signal to RPC Client when RPC call queue is full. When the feature is enabled, RPC server will no longer block on the processing of RPC requests when RPC call queue is full. It helps to improve quality of service when the service is under heavy load. The configuration key is in the format of "ipc.#port#.backoff.enable" where #port# is the port number that RPC server listens on. For example, if you want to enable the feature for the RPC server that listens on 8020, set ipc.8020.backoff.enable to true.
      Show
      This change introduces a new configuration key used by RPC server to decide whether to send backoff signal to RPC Client when RPC call queue is full. When the feature is enabled, RPC server will no longer block on the processing of RPC requests when RPC call queue is full. It helps to improve quality of service when the service is under heavy load. The configuration key is in the format of "ipc.#port#.backoff.enable" where #port# is the port number that RPC server listens on. For example, if you want to enable the feature for the RPC server that listens on 8020, set ipc.8020.backoff.enable to true.

      Description

      Currently if an application hits NN too hard, RPC requests be in blocking state, assuming OS connection doesn't run out. Alternatively RPC or NN can throw some well defined exception back to the client based on certain policies when it is under heavy load; client will understand such exception and do exponential back off, as another implementation of RetryInvocationHandler.

      1. HADOOP-10597.patch
        19 kB
        Ming Ma
      2. HADOOP-10597-2.patch
        34 kB
        Ming Ma
      3. HADOOP-10597-3.patch
        32 kB
        Ming Ma
      4. HADOOP-10597-4.patch
        34 kB
        Ming Ma
      5. HADOOP-10597-5.patch
        11 kB
        Ming Ma
      6. HADOOP-10597-6.patch
        11 kB
        Ming Ma
      7. HADOOP-10597-branch-2.7.patch
        13 kB
        Zhe Zhang
      8. MoreRPCClientBackoffEvaluation.pdf
        55 kB
        Ming Ma
      9. RPCClientBackoffDesignAndEvaluation.pdf
        169 kB
        Ming Ma

        Activity

        Hide
        stevel@apache.org Steve Loughran added a comment -

        this could be useful for clients of other services too, where the back-off message could trigger redirect.

        maybe the response could include some hints of

        1. where else to go
        2. backoff parameter hints: sleep time, growth, jitter. This gives the NN more control of the clients, lets you spread the jitter, and grow the backoff time as load increases -so reducing socket connection load.
        Show
        stevel@apache.org Steve Loughran added a comment - this could be useful for clients of other services too, where the back-off message could trigger redirect. maybe the response could include some hints of where else to go backoff parameter hints: sleep time, growth, jitter. This gives the NN more control of the clients, lets you spread the jitter, and grow the backoff time as load increases -so reducing socket connection load.
        Hide
        mingma Ming Ma added a comment -

        Steve, thanks for the suggestions. Yes, it is useful to have server provide backoff parameters to client.

        Here is the initial patch to cover the basic idea. After we believe this is the right direction, unit test and load tests will follow.

        1. Enhance RetriableException for this purpose. When RPC queue is full, RetriableException will be thrown back to the client with the backoff parameters in PB. The backoff parameters are based on exponential back off.

        2. Client/RetryPolicy can decide if and how to use the server hint. It is up to each retry policy implementation. For NN HA RetryPolicy failoverOnNetworkException, it will use the server hint when it is available. For NN non-HA scenario, HDFS-6478 needs to be fixed first. A new policy called RetryUpToMaximumCountWithFixedSleepAndServerHint is provided as an example.

        3. Backoff feature is turned off by default in RPC server.

        Show
        mingma Ming Ma added a comment - Steve, thanks for the suggestions. Yes, it is useful to have server provide backoff parameters to client. Here is the initial patch to cover the basic idea. After we believe this is the right direction, unit test and load tests will follow. 1. Enhance RetriableException for this purpose. When RPC queue is full, RetriableException will be thrown back to the client with the backoff parameters in PB. The backoff parameters are based on exponential back off. 2. Client/RetryPolicy can decide if and how to use the server hint. It is up to each retry policy implementation. For NN HA RetryPolicy failoverOnNetworkException, it will use the server hint when it is available. For NN non-HA scenario, HDFS-6478 needs to be fixed first. A new policy called RetryUpToMaximumCountWithFixedSleepAndServerHint is provided as an example. 3. Backoff feature is turned off by default in RPC server.
        Hide
        mingma Ming Ma added a comment -

        Lohit Vijayarenu provided some feedback. Here is the design document with some evaluation results. The updated patch also includes unit tests and make the server side retry policy pluggable.

        Show
        mingma Ming Ma added a comment - Lohit Vijayarenu provided some feedback. Here is the design document with some evaluation results. The updated patch also includes unit tests and make the server side retry policy pluggable.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12654055/HADOOP-10597-2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The applied patch generated 1261 javac compiler warnings (more than the trunk's current 1260 warnings).

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common:

        org.apache.hadoop.ipc.TestRPC

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html
        Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//artifact/trunk/patchprocess/diffJavacWarnings.txt
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654055/HADOOP-10597-2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The applied patch generated 1261 javac compiler warnings (more than the trunk's current 1260 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.ipc.TestRPC +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4207//console This message is automatically generated.
        Hide
        jingzhao Jing Zhao added a comment -

        This can be a very useful feature. For the evaluation, maybe It will be helpful to see how the change will affect the request latency and NN throughput when the workload of NN increases from low to high. For example, one of my questions is that when the NN workload is just relatively high (there are some but not many requests are blocking on the queue), will the average latency and throughput be affected if NN directly lets client back off instead of blocking and timeout?

        Show
        jingzhao Jing Zhao added a comment - This can be a very useful feature. For the evaluation, maybe It will be helpful to see how the change will affect the request latency and NN throughput when the workload of NN increases from low to high. For example, one of my questions is that when the NN workload is just relatively high (there are some but not many requests are blocking on the queue), will the average latency and throughput be affected if NN directly lets client back off instead of blocking and timeout?
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        Hi Ming Ma, just looked at the attached doc. If I understand correctly, the server tells the client which backoff policy to use. The backoff policy can just be a client side configuration since the server's suggestion is just advisory, unless the server has a way to penalize clients that fail to follow the suggestion.

        Also you have probably seen the RPC Congestion Control work under HADOOP-9460. Is there any overlap?

        Show
        arpitagarwal Arpit Agarwal added a comment - Hi Ming Ma , just looked at the attached doc. If I understand correctly, the server tells the client which backoff policy to use. The backoff policy can just be a client side configuration since the server's suggestion is just advisory, unless the server has a way to penalize clients that fail to follow the suggestion. Also you have probably seen the RPC Congestion Control work under HADOOP-9460 . Is there any overlap?
        Hide
        mingma Ming Ma added a comment -

        Thanks, Jing and Arpit.

        1. In the current implementation, RPC server only throws RetriableException back to client when RPC queue is full, or more specifically RPC queue is full for the RPC user with HADOOP-9460. So before RPC queue is full, there should be no difference. It might be interesting to verify "large number of connections" scenario. The blocking approach could hold up lots of TCP connections and thus other users' request can't connect.

        2. The value of server defined backoff policy. So far I don't have any use case that requires server to specify backoff policy. So it is possible all we need is to have RPC server throws RetriableException without backoff policy. I put it there for extensibility and based on Steve's suggestion. This might still be useful later. What if the client doesn't honor the policy? In a controlled environment, we can assume a single client will use hadoop RPC client which enforce the policy; if we have many clients, then the backoff policy component in RPC server such as LinearClientBackoffPolicy can keep state and can adjust the backoff policy parameters.

        3. How it is related to HADOOP-9640. HADOOP-9640 is quite useful. client backoff can be complementary to that. FairQueue currently is blocking; if a given RPC request's enqueue to FairQueue is blocked due to FairQueue policy, it will hold up TCP connection and the reader threads. If we use FairQueue together with client backoff, requests from some heavy load application won't hold up TCP connection and the reader threads; thus allow other applications' request to be processed more quickly. Some evaluation to compare HADOOP-9640 with "HADOOP-9640 + client backoff" might be useful. I will follow up with Chris Li on that.

        Is there any other scenarios? For example, we can have RPC rejects requests based on user id, method name or machine ip for some operational situations. Granted, these can also be handled at the higher layer.

        Show
        mingma Ming Ma added a comment - Thanks, Jing and Arpit. 1. In the current implementation, RPC server only throws RetriableException back to client when RPC queue is full, or more specifically RPC queue is full for the RPC user with HADOOP-9460 . So before RPC queue is full, there should be no difference. It might be interesting to verify "large number of connections" scenario. The blocking approach could hold up lots of TCP connections and thus other users' request can't connect. 2. The value of server defined backoff policy. So far I don't have any use case that requires server to specify backoff policy. So it is possible all we need is to have RPC server throws RetriableException without backoff policy. I put it there for extensibility and based on Steve's suggestion. This might still be useful later. What if the client doesn't honor the policy? In a controlled environment, we can assume a single client will use hadoop RPC client which enforce the policy; if we have many clients, then the backoff policy component in RPC server such as LinearClientBackoffPolicy can keep state and can adjust the backoff policy parameters. 3. How it is related to HADOOP-9640 . HADOOP-9640 is quite useful. client backoff can be complementary to that. FairQueue currently is blocking; if a given RPC request's enqueue to FairQueue is blocked due to FairQueue policy, it will hold up TCP connection and the reader threads. If we use FairQueue together with client backoff, requests from some heavy load application won't hold up TCP connection and the reader threads; thus allow other applications' request to be processed more quickly. Some evaluation to compare HADOOP-9640 with " HADOOP-9640 + client backoff" might be useful. I will follow up with Chris Li on that. Is there any other scenarios? For example, we can have RPC rejects requests based on user id, method name or machine ip for some operational situations. Granted, these can also be handled at the higher layer.
        Hide
        mingma Ming Ma added a comment -

        Here are some more evaluation results to compare FIFO RPC queue, FairCallQueue without backoff and FairCallQueue with backoff. In general, the more RPC connections, the more useful RPC client backoff is.

        Show
        mingma Ming Ma added a comment - Here are some more evaluation results to compare FIFO RPC queue, FairCallQueue without backoff and FairCallQueue with backoff. In general, the more RPC connections, the more useful RPC client backoff is.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12670619/MoreRPCClientBackoffEvaluation.pdf
        against trunk revision a9a55db.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4789//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670619/MoreRPCClientBackoffEvaluation.pdf against trunk revision a9a55db. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/4789//console This message is automatically generated.
        Hide
        chrilisf Chris Li added a comment -

        Hi Ming Ma, thanks for adding some numbers. If I understand correctly from the graph, the latency spike is a result of maxing out the call queue's capacity, which FairCallQueue will not solve since FCQ has no choice but to enqueue a call somewhere. Just to double check, were all these calls made under the same user? I'd guess that RPC client backoff would work just as well when FairCallQueue is disabled too, since it solves the different problem of alleviating a full queue. I do agree with Steve that we'll want some fuzz on the retry method, since linear could cause load to be periodic over time

        Show
        chrilisf Chris Li added a comment - Hi Ming Ma , thanks for adding some numbers. If I understand correctly from the graph, the latency spike is a result of maxing out the call queue's capacity, which FairCallQueue will not solve since FCQ has no choice but to enqueue a call somewhere. Just to double check, were all these calls made under the same user? I'd guess that RPC client backoff would work just as well when FairCallQueue is disabled too, since it solves the different problem of alleviating a full queue. I do agree with Steve that we'll want some fuzz on the retry method, since linear could cause load to be periodic over time
        Hide
        mingma Ming Ma added a comment -

        Thanks, Chris.

        1. The first two experiments belong to "user A is doing some bad things, measure user B's read latency.". The rest of the experiments are done under single user to measure the performance implication under different loads.
        2. We can use client backoff without FCQ. But it is less interesting, given it could penalize good clients. That is because in the current implementation, the criteria RPC server uses to decide if it needs to ask client to back off is whether the RPC call queue is full. We can improve this criteria later if this isn't enough.
        3. The experiment results are based on "client driven retry interval" policy. It means the server only asks the client to back off; RPC client will decide retry policy. In NN HA setup, that will be FailoverOnNetworkExceptionRetry which does exponential back off.

        Show
        mingma Ming Ma added a comment - Thanks, Chris. 1. The first two experiments belong to "user A is doing some bad things, measure user B's read latency.". The rest of the experiments are done under single user to measure the performance implication under different loads. 2. We can use client backoff without FCQ. But it is less interesting, given it could penalize good clients. That is because in the current implementation, the criteria RPC server uses to decide if it needs to ask client to back off is whether the RPC call queue is full. We can improve this criteria later if this isn't enough. 3. The experiment results are based on "client driven retry interval" policy. It means the server only asks the client to back off; RPC client will decide retry policy. In NN HA setup, that will be FailoverOnNetworkExceptionRetry which does exponential back off.
        Hide
        chrilisf Chris Li added a comment -

        Cool. I think this is a good feature to have.

        One small question about the code:
        + LOG.warn("Element " + e + " was queued properly." +
        + "But client is asked to retry.");

        From my brief study of the code, it seems like isCallQueued is passed pretty deep in order to maintain some sort of reference count on how many pending requests each handler has waiting client-side to retry. Does this count always balance to zero? What if a client makes a request, is denied, and then terminates before it can make a request that successfully queues?

        Also, what conditions will the element be queued correctly but the client gets a retry?

        Also kind of a small thing but instead of recentBackOffCount.set(oldValue) it would be more clear to create a new variable newValue and recentBackOffCount.set(newValue) instead of mutating oldValue, or perhaps just rename the oldValue variable to something which doesn't imply immutability

        Show
        chrilisf Chris Li added a comment - Cool. I think this is a good feature to have. One small question about the code: + LOG.warn("Element " + e + " was queued properly." + + "But client is asked to retry."); From my brief study of the code, it seems like isCallQueued is passed pretty deep in order to maintain some sort of reference count on how many pending requests each handler has waiting client-side to retry. Does this count always balance to zero? What if a client makes a request, is denied, and then terminates before it can make a request that successfully queues? Also, what conditions will the element be queued correctly but the client gets a retry? Also kind of a small thing but instead of recentBackOffCount.set(oldValue) it would be more clear to create a new variable newValue and recentBackOffCount.set(newValue) instead of mutating oldValue, or perhaps just rename the oldValue variable to something which doesn't imply immutability
        Hide
        mingma Ming Ma added a comment -

        Thanks, Chris.

        The backoff retry policy is defined by interface ClientBackoffPolicy. There are two implementations of the interface, NullClientBackoffPolicy and LinearClientBackoffPolicy.

        The experiment results are based on NullClientBackoffPolicy which doesn't specify any retry policy. Thus RPC server will return empty RetriableException and let client decides the retry policy. We can start with this policy when we enable this feature in production. That will provide us useful info and help us to improve the feature and make necessary modification to ClientBackoffPolicy and its implementations in next iterations.

        LinearClientBackoffPolicy specifies retry policy based on numbers of succeeded and denied requests. The policy will then be returned to the client and the client is expected to honor that. recentBackOffCount will decrease with each successful queued request. So in your case, if a client is denied first and then terminates before it retries, as long as enough requests from other clients are queued successfully, recentBackOffCount will become zero.

        There shouldn't be a case where the element be queued correctly but the client gets a retry. The warn message is there to catch bad implementation of ClientBackoffPolicy. We can remove that as it doesn't seem to be necessary.

        Yes, it is better to rename oldValue to something else.

        I will provide an updated patch after rebase to address your comments.

        Show
        mingma Ming Ma added a comment - Thanks, Chris. The backoff retry policy is defined by interface ClientBackoffPolicy . There are two implementations of the interface, NullClientBackoffPolicy and LinearClientBackoffPolicy . The experiment results are based on NullClientBackoffPolicy which doesn't specify any retry policy. Thus RPC server will return empty RetriableException and let client decides the retry policy. We can start with this policy when we enable this feature in production. That will provide us useful info and help us to improve the feature and make necessary modification to ClientBackoffPolicy and its implementations in next iterations. LinearClientBackoffPolicy specifies retry policy based on numbers of succeeded and denied requests. The policy will then be returned to the client and the client is expected to honor that. recentBackOffCount will decrease with each successful queued request. So in your case, if a client is denied first and then terminates before it retries, as long as enough requests from other clients are queued successfully, recentBackOffCount will become zero. There shouldn't be a case where the element be queued correctly but the client gets a retry. The warn message is there to catch bad implementation of ClientBackoffPolicy . We can remove that as it doesn't seem to be necessary. Yes, it is better to rename oldValue to something else. I will provide an updated patch after rebase to address your comments.
        Hide
        chrilisf Chris Li added a comment -

        Ah okay, thanks for clarifying, makes sense now

        Show
        chrilisf Chris Li added a comment - Ah okay, thanks for clarifying, makes sense now
        Hide
        mingma Ming Ma added a comment -

        Rebased for trunk and address Chris' comments. Appreciate if Steve Loughran, Jing Zhao, Arpit Agarwal and others have additional input. We have enabled this feature with FairCallQueue in one of our production clusters.

        Show
        mingma Ming Ma added a comment - Rebased for trunk and address Chris' comments. Appreciate if Steve Loughran , Jing Zhao , Arpit Agarwal and others have additional input. We have enabled this feature with FairCallQueue in one of our production clusters.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12688388/HADOOP-10597-3.patch
        against trunk revision 6635ccd.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The applied patch generated 1217 javac compiler warnings (more than the trunk's current 1216 warnings).

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
        Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//artifact/patchprocess/diffJavacWarnings.txt
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688388/HADOOP-10597-3.patch against trunk revision 6635ccd. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The applied patch generated 1217 javac compiler warnings (more than the trunk's current 1216 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5318//console This message is automatically generated.
        Hide
        stevel@apache.org Steve Loughran added a comment -

        Looks OK to me...

        Show
        stevel@apache.org Steve Loughran added a comment - Looks OK to me...
        Hide
        stevel@apache.org Steve Loughran added a comment -

        -sorry, browser submitted too early.

        Looks OK to me, though its gone deep enough into the RPC stack I'm out of my depth.

        Minor recommendations

        • tag things as audience=private as well as unstable
        • LinearClientBackoffPolicy p 46-49, can we pull these inline strings out as public constants? It keeps errors down in tests & other code setting things.
        • NullClientBackoffPolicy should just extend Configured to remove boiler plate set/get conf logic
        • TestRPC.testClientBackOff(). Recommend saving any caught IOE and, if !succeeded, rethrowing it. It'll help debugging failing tests.

        One thing I will highlight is I.m not that enamoured of how the retriable exception protobuf data is being marshalled into the string value of the exception. Why choose this approach?

        Show
        stevel@apache.org Steve Loughran added a comment - -sorry, browser submitted too early. Looks OK to me, though its gone deep enough into the RPC stack I'm out of my depth. Minor recommendations tag things as audience=private as well as unstable LinearClientBackoffPolicy p 46-49, can we pull these inline strings out as public constants? It keeps errors down in tests & other code setting things. NullClientBackoffPolicy should just extend Configured to remove boiler plate set/get conf logic TestRPC.testClientBackOff(). Recommend saving any caught IOE and, if !succeeded, rethrowing it. It'll help debugging failing tests. One thing I will highlight is I.m not that enamoured of how the retriable exception protobuf data is being marshalled into the string value of the exception. Why choose this approach?
        Hide
        mingma Ming Ma added a comment -

        Thanks, Steve Loughran! Here is the new patch with your suggestions.

        Regarding the serialization of RetryAction via RetriableException message string, I agree it is not necessarily the best approach. Here we need to serialize RetryAction and have RPC server send it back to RPC client. Possible options that I know of:

        • Current RPC Header structure RpcHeaderProtos includes Exception message field; thus it is convenient to use RetriableException message.
        • We can consider adding optional RetryAction field into RPC header RpcHeaderProtos. That requires more changes.
        Show
        mingma Ming Ma added a comment - Thanks, Steve Loughran ! Here is the new patch with your suggestions. Regarding the serialization of RetryAction via RetriableException message string, I agree it is not necessarily the best approach. Here we need to serialize RetryAction and have RPC server send it back to RPC client. Possible options that I know of: Current RPC Header structure RpcHeaderProtos includes Exception message field; thus it is convenient to use RetriableException message. We can consider adding optional RetryAction field into RPC header RpcHeaderProtos . That requires more changes.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12690499/HADOOP-10597-4.patch
        against trunk revision 788ee35.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javac. The applied patch generated 1215 javac compiler warnings (more than the trunk's current 1214 warnings).

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//testReport/
        Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//artifact/patchprocess/diffJavacWarnings.txt
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690499/HADOOP-10597-4.patch against trunk revision 788ee35. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The applied patch generated 1215 javac compiler warnings (more than the trunk's current 1214 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5373//console This message is automatically generated.
        Hide
        stevel@apache.org Steve Loughran added a comment -

        I think this is a topic for the RPC people, not me. I just would prefer to see a new field, not stuff hidden into the text payload. That way lies WS-*

        Show
        stevel@apache.org Steve Loughran added a comment - I think this is a topic for the RPC people, not me. I just would prefer to see a new field, not stuff hidden into the text payload. That way lies WS-*
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        Hi Ming Ma, thanks for proposing this work and posting the patch. I still prefer that we don't support multiple retry policies. Exponential back-off is known to be the only mechanism that converges to fairness in a loaded system.

        If Steve does not object then the v1 implementation can avoid sending retry policies. It should significantly simplify the patch and sidestep the issue of how to communicate the policy.

        Show
        arpitagarwal Arpit Agarwal added a comment - Hi Ming Ma , thanks for proposing this work and posting the patch. I still prefer that we don't support multiple retry policies. Exponential back-off is known to be the only mechanism that converges to fairness in a loaded system. If Steve does not object then the v1 implementation can avoid sending retry policies. It should significantly simplify the patch and sidestep the issue of how to communicate the policy.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Arpit Agarwal. We have been using "no server retry policy" option in production clusters for some time and things are fine. So I agree with you; at least we don't have to add it in the initial version. If later we have scenarios that require the server to supply retry policies, we can add it at that point.

        Show
        mingma Ming Ma added a comment - Thanks, Arpit Agarwal . We have been using "no server retry policy" option in production clusters for some time and things are fine. So I agree with you; at least we don't have to add it in the initial version. If later we have scenarios that require the server to supply retry policies, we can add it at that point.
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        We have been using "no server retry policy" option in production clusters for some time and things are fine.

        That's a good datapoint, thanks. Couple of questions:

        1. What kind of back-off policy are you using on the client? Do you have a separate Jira for the client side work? I just realized today that we already have configurable retry policies via DFS_CLIENT_RETRY_POLICY_SPEC_KEY. We also have a ExponentialBackoffRetry but it doesn't look widely used. Did you use either of these?
        2. Did you use the same trigger on the server in production (RPC queue being full)?
        Show
        arpitagarwal Arpit Agarwal added a comment - We have been using "no server retry policy" option in production clusters for some time and things are fine. That's a good datapoint, thanks. Couple of questions: What kind of back-off policy are you using on the client? Do you have a separate Jira for the client side work? I just realized today that we already have configurable retry policies via DFS_CLIENT_RETRY_POLICY_SPEC_KEY . We also have a ExponentialBackoffRetry but it doesn't look widely used. Did you use either of these? Did you use the same trigger on the server in production (RPC queue being full)?
        Hide
        mingma Ming Ma added a comment -

        Did you use the same trigger on the server in production (RPC queue being full)?

        Yes, that is what we use.

        What kind of back-off policy are you using on the client?

        This patch doesn't add any new policy on the client side. It tries to use the policy passed from the server if it is specified. But given we don't plan to support server side policy, the new patch doesn't need to change anything on the client side. The client side will receive RetriableException and retry accordingly.

        Regarding the client side retry policy we use, we don't config anything specifically. We use the default. Not config for DFS_CLIENT_RETRY_POLICY_SPEC_KEY. Thus we end up with FailoverOnNetworkExceptionRetry which uses exponential backoff. The actual parameters used in the backoff are be based on DFS_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY, DFS_CLIENT_FAILOVER_SLEEPTIME_BASE_KEY and DFS_CLIENT_FAILOVER_SLEEPTIME_MAX_KEY.

        Show
        mingma Ming Ma added a comment - Did you use the same trigger on the server in production (RPC queue being full)? Yes, that is what we use. What kind of back-off policy are you using on the client? This patch doesn't add any new policy on the client side. It tries to use the policy passed from the server if it is specified. But given we don't plan to support server side policy, the new patch doesn't need to change anything on the client side. The client side will receive RetriableException and retry accordingly. Regarding the client side retry policy we use, we don't config anything specifically. We use the default. Not config for DFS_CLIENT_RETRY_POLICY_SPEC_KEY . Thus we end up with FailoverOnNetworkExceptionRetry which uses exponential backoff. The actual parameters used in the backoff are be based on DFS_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY , DFS_CLIENT_FAILOVER_SLEEPTIME_BASE_KEY and DFS_CLIENT_FAILOVER_SLEEPTIME_MAX_KEY .
        Hide
        mingma Ming Ma added a comment -

        Updated patch based on Arpit's suggestion of removing the server side retry policy.

        Show
        mingma Ming Ma added a comment - Updated patch based on Arpit's suggestion of removing the server side retry policy.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12724949/HADOOP-10597-5.patch
        against trunk revision 174d8b3.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6099//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6099//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724949/HADOOP-10597-5.patch against trunk revision 174d8b3. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6099//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6099//console This message is automatically generated.
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        The code change looks good to me. I am reviewing at the test case.

        I won't commit it until at least next week in case Steve or Chris Li have comments.

        Show
        arpitagarwal Arpit Agarwal added a comment - The code change looks good to me. I am reviewing at the test case. I won't commit it until at least next week in case Steve or Chris Li have comments.
        Hide
        stevel@apache.org Steve Loughran added a comment -

        If arpit is happy, I'm happy

        the one thing I would change is in the test where assertTrue("RetriableException not received", succeeded); is in the finally clause, followed by the cleanup actions. that assert should appear outside the finally block, which should be kept for the cleanup on all exit points, normal and exceptional.

        Show
        stevel@apache.org Steve Loughran added a comment - If arpit is happy, I'm happy the one thing I would change is in the test where assertTrue("RetriableException not received", succeeded); is in the finally clause, followed by the cleanup actions. that assert should appear outside the finally block, which should be kept for the cleanup on all exit points, normal and exceptional.
        Hide
        mingma Ming Ma added a comment -

        Here is the updated patch with Steve's suggestion. Thanks, Steve and Arpit.

        Show
        mingma Ming Ma added a comment - Here is the updated patch with Steve's suggestion. Thanks, Steve and Arpit.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12725491/HADOOP-10597-6.patch
        against trunk revision fddd552.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6104//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6104//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12725491/HADOOP-10597-6.patch against trunk revision fddd552. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6104//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6104//console This message is automatically generated.
        Hide
        arpitagarwal Arpit Agarwal added a comment - - edited

        +1 for the v6 patch. Reviewed the test case and it looks like Steve's comment was addressed. Thanks Ming.

        I'll commit this by tomorrow.

        Show
        arpitagarwal Arpit Agarwal added a comment - - edited +1 for the v6 patch. Reviewed the test case and it looks like Steve's comment was addressed. Thanks Ming. I'll commit this by tomorrow.
        Hide
        arpitagarwal Arpit Agarwal added a comment -

        I committed it to trunk and branch-2.

        Thanks Ming and Steve. Ming Ma, could you please add a short release note to the Jira?

        Show
        arpitagarwal Arpit Agarwal added a comment - I committed it to trunk and branch-2. Thanks Ming and Steve. Ming Ma , could you please add a short release note to the Jira?
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #7647 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7647/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7647 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7647/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        Hide
        mingma Ming Ma added a comment -

        I have updated the release note. Thanks for your suggestions, Chris, Arpit and Steve.

        Show
        mingma Ming Ma added a comment - I have updated the release note. Thanks for your suggestions, Chris, Arpit and Steve.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2105/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2105/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #164 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/164/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #164 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/164/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #173 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/173/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #173 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/173/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #907 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/907/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #907 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/907/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #174 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/174/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #174 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/174/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2123 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2123/)
        HADOOP-10597. RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49)

        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java
        • hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
        • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2123 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2123/ ) HADOOP-10597 . RPC Server signals backoff to clients when all request queues are full. (Contributed by Ming Ma) (arp: rev 49f6e3d35e0f89637ae9ea970f249c13bdc0fd49) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestRPC.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/metrics/RpcMetrics.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestCallQueueManager.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallQueueManager.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        zhz Zhe Zhang added a comment -

        Thanks Ming Ma for the nice work!

        Since the umbrella JIRA HADOOP-9640 (and the FairCallQueue feature) is available in 2.7 and 2.6, I think it makes sense to backport this to at least 2.7. Reopening to test branch-2.7 patch. Please let me know if you have any concern about adding this to branch-2.7.

        Show
        zhz Zhe Zhang added a comment - Thanks Ming Ma for the nice work! Since the umbrella JIRA HADOOP-9640 (and the FairCallQueue feature) is available in 2.7 and 2.6, I think it makes sense to backport this to at least 2.7. Reopening to test branch-2.7 patch. Please let me know if you have any concern about adding this to branch-2.7.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 23s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
        +1 mvninstall 6m 5s branch-2.7 passed
        +1 compile 5m 23s branch-2.7 passed with JDK v1.8.0_111
        +1 compile 6m 14s branch-2.7 passed with JDK v1.7.0_111
        +1 checkstyle 0m 26s branch-2.7 passed
        +1 mvnsite 0m 45s branch-2.7 passed
        +1 mvneclipse 0m 15s branch-2.7 passed
        -1 findbugs 1m 34s hadoop-common-project/hadoop-common in branch-2.7 has 3 extant Findbugs warnings.
        +1 javadoc 0m 41s branch-2.7 passed with JDK v1.8.0_111
        +1 javadoc 0m 54s branch-2.7 passed with JDK v1.7.0_111
        +1 mvninstall 0m 39s the patch passed
        +1 compile 5m 12s the patch passed with JDK v1.8.0_111
        +1 javac 5m 12s the patch passed
        +1 compile 6m 11s the patch passed with JDK v1.7.0_111
        +1 javac 6m 11s the patch passed
        -0 checkstyle 0m 26s hadoop-common-project/hadoop-common: The patch generated 4 new + 293 unchanged - 1 fixed = 297 total (was 294)
        +1 mvnsite 0m 45s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        -1 whitespace 0m 1s The patch has 3026 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
        -1 whitespace 1m 24s The patch 189 line(s) with tabs.
        +1 findbugs 1m 47s the patch passed
        +1 javadoc 0m 40s the patch passed with JDK v1.8.0_111
        +1 javadoc 0m 53s the patch passed with JDK v1.7.0_111
        -1 unit 21m 6s hadoop-common in the patch failed with JDK v1.7.0_111.
        -1 asflicense 1m 9s The patch generated 1 ASF License warnings.
        84m 8s



        Reason Tests
        JDK v1.8.0_111 Failed junit tests hadoop.metrics2.impl.TestGangliaMetrics
          hadoop.security.ssl.TestSSLFactory
          hadoop.util.bloom.TestBloomFilters
          hadoop.security.TestLdapGroupsMapping
        JDK v1.8.0_111 Timed out junit tests org.apache.hadoop.conf.TestConfiguration
        JDK v1.7.0_111 Failed junit tests hadoop.ha.TestZKFailoverController
          hadoop.util.bloom.TestBloomFilters
        JDK v1.7.0_111 Timed out junit tests org.apache.hadoop.conf.TestConfiguration



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:c420dfe
        JIRA Issue HADOOP-10597
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835927/HADOOP-10597-branch-2.7.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux aafcf2b703de 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision branch-2.7 / 5aa26aa
        Default Java 1.7.0_111
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
        findbugs v3.0.0
        findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
        checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt
        whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/whitespace-eol.txt
        whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/whitespace-tabs.txt
        unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_111.txt
        JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/testReport/
        asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/patch-asflicense-problems.txt
        modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
        Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 6m 5s branch-2.7 passed +1 compile 5m 23s branch-2.7 passed with JDK v1.8.0_111 +1 compile 6m 14s branch-2.7 passed with JDK v1.7.0_111 +1 checkstyle 0m 26s branch-2.7 passed +1 mvnsite 0m 45s branch-2.7 passed +1 mvneclipse 0m 15s branch-2.7 passed -1 findbugs 1m 34s hadoop-common-project/hadoop-common in branch-2.7 has 3 extant Findbugs warnings. +1 javadoc 0m 41s branch-2.7 passed with JDK v1.8.0_111 +1 javadoc 0m 54s branch-2.7 passed with JDK v1.7.0_111 +1 mvninstall 0m 39s the patch passed +1 compile 5m 12s the patch passed with JDK v1.8.0_111 +1 javac 5m 12s the patch passed +1 compile 6m 11s the patch passed with JDK v1.7.0_111 +1 javac 6m 11s the patch passed -0 checkstyle 0m 26s hadoop-common-project/hadoop-common: The patch generated 4 new + 293 unchanged - 1 fixed = 297 total (was 294) +1 mvnsite 0m 45s the patch passed +1 mvneclipse 0m 14s the patch passed -1 whitespace 0m 1s The patch has 3026 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply -1 whitespace 1m 24s The patch 189 line(s) with tabs. +1 findbugs 1m 47s the patch passed +1 javadoc 0m 40s the patch passed with JDK v1.8.0_111 +1 javadoc 0m 53s the patch passed with JDK v1.7.0_111 -1 unit 21m 6s hadoop-common in the patch failed with JDK v1.7.0_111. -1 asflicense 1m 9s The patch generated 1 ASF License warnings. 84m 8s Reason Tests JDK v1.8.0_111 Failed junit tests hadoop.metrics2.impl.TestGangliaMetrics   hadoop.security.ssl.TestSSLFactory   hadoop.util.bloom.TestBloomFilters   hadoop.security.TestLdapGroupsMapping JDK v1.8.0_111 Timed out junit tests org.apache.hadoop.conf.TestConfiguration JDK v1.7.0_111 Failed junit tests hadoop.ha.TestZKFailoverController   hadoop.util.bloom.TestBloomFilters JDK v1.7.0_111 Timed out junit tests org.apache.hadoop.conf.TestConfiguration Subsystem Report/Notes Docker Image:yetus/hadoop:c420dfe JIRA Issue HADOOP-10597 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835927/HADOOP-10597-branch-2.7.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux aafcf2b703de 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.7 / 5aa26aa Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_111 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/diff-checkstyle-hadoop-common-project_hadoop-common.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_111.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/testReport/ asflicense https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10926/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        mingma Ming Ma added a comment -

        Zhe Zhang that sounds good. Thank you.

        Show
        mingma Ming Ma added a comment - Zhe Zhang that sounds good. Thank you.
        Hide
        zhz Zhe Zhang added a comment -

        Thanks Ming for confirming this. I verified reported test failures (cannot reproduce locally) and pushed to branch-2.7.

        Show
        zhz Zhe Zhang added a comment - Thanks Ming for confirming this. I verified reported test failures (cannot reproduce locally) and pushed to branch-2.7.

          People

          • Assignee:
            mingma Ming Ma
            Reporter:
            mingma Ming Ma
          • Votes:
            0 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development