Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: ipc, performance
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      The Client#call() methods that are deprecated since 0.23 have been removed.

      Description

        private static ClientCache CLIENTS=new ClientCache();
      ...
          this.client = CLIENTS.getClient(conf, factory);
      

      Meanwhile in ClientCache

      public synchronized Client getClient(Configuration conf,
            SocketFactory factory, Class<? extends Writable> valueClass) {
      ...
         Client client = clients.get(factory);
          if (client == null) {
            client = new Client(valueClass, conf, factory);
            clients.put(factory, client);
          } else {
            client.incCount();
          }
      

      All invokers end up calling these methods, resulting in IPC clients choking up.



      1. after-ipc-fix.png
        19 kB
        Gopal V
      2. cached-connections.png
        52 kB
        Gopal V
      3. cached-locking.png
        36 kB
        Gopal V
      4. dfs-sync-ipc.png
        103 kB
        Gopal V
      5. HADOOP-11772.004.patch
        8 kB
        Haohui Mai
      6. HADOOP-11772-001.patch
        3 kB
        Akira Ajisaka
      7. HADOOP-11772-002.patch
        13 kB
        Akira Ajisaka
      8. HADOOP-11772-003.patch
        12 kB
        Akira Ajisaka
      9. HADOOP-11772-wip-001.patch
        7 kB
        Akira Ajisaka
      10. HADOOP-11772-wip-002.patch
        7 kB
        Akira Ajisaka
      11. sync-client-bt.png
        69 kB
        Gopal V
      12. sync-client-threads.png
        18 kB
        Gopal V

        Issue Links

          Activity

          Hide
          ajisakaa Akira Ajisaka added a comment -

          Thanks Gopal V for the report. Attaching a patch to

          • Use ConcurrentHashMap for caching clients
          • Remove unnecessarily synchronization
          Show
          ajisakaa Akira Ajisaka added a comment - Thanks Gopal V for the report. Attaching a patch to Use ConcurrentHashMap for caching clients Remove unnecessarily synchronization
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708113/HADOOP-11772-001.patch
          against trunk revision 1ed9fb7.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6024//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6024//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708113/HADOOP-11772-001.patch against trunk revision 1ed9fb7. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6024//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6024//console This message is automatically generated.
          Hide
          gopalv Gopal V added a comment -

          Akira Ajisaka: Does this patch fix the requirement of needing >1 IPC client per socket-factory?

          From a quick read, the single factory -> single IPC client mapping still exists, so the same hash-bucket will be locked by all processes using regular RPC invoker.

          Show
          gopalv Gopal V added a comment - Akira Ajisaka : Does this patch fix the requirement of needing >1 IPC client per socket-factory? From a quick read, the single factory -> single IPC client mapping still exists, so the same hash-bucket will be locked by all processes using regular RPC invoker.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708380/dfs-sync-ipc.png
          against trunk revision b5a22e9.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6034//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708380/dfs-sync-ipc.png against trunk revision b5a22e9. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6034//console This message is automatically generated.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Gopal V thank you for your comment.

          From a quick read, the single factory -> single IPC client mapping still exists

          Yes, you are right.

          the requirement of needing >1 IPC client per socket-factory

          Do you mean we should create IPC client pool in ClientCache.java?

          Show
          ajisakaa Akira Ajisaka added a comment - Gopal V thank you for your comment. From a quick read, the single factory -> single IPC client mapping still exists Yes, you are right. the requirement of needing >1 IPC client per socket-factory Do you mean we should create IPC client pool in ClientCache.java?
          Hide
          gopalv Gopal V added a comment -

          Do you mean we should create IPC client pool in ClientCache.java?

          A round-robin pool or at least something better than 1 client object per socket-factory would be nice.

          Right now, listing block locations in parallel (for compute splits, or opening files etc) produces very bad synchronization within that.

          Show
          gopalv Gopal V added a comment - Do you mean we should create IPC client pool in ClientCache.java? A round-robin pool or at least something better than 1 client object per socket-factory would be nice. Right now, listing block locations in parallel (for compute splits, or opening files etc) produces very bad synchronization within that.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Thanks Gopal V for your comment. Attaching a sample patch to create pool for Clients.
          TODO:

          • Add a document for the new parameter
          • Create a test
          Show
          ajisakaa Akira Ajisaka added a comment - Thanks Gopal V for your comment. Attaching a sample patch to create pool for Clients. TODO: Add a document for the new parameter Create a test
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12709214/HADOOP-11772-wip-001.patch
          against trunk revision 72f6bd4.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common:

          org.apache.hadoop.ipc.TestRPCCallBenchmark

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6057//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6057//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709214/HADOOP-11772-wip-001.patch against trunk revision 72f6bd4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-common-project/hadoop-common: org.apache.hadoop.ipc.TestRPCCallBenchmark Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6057//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6057//console This message is automatically generated.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Fixed the test failure.

          Show
          ajisakaa Akira Ajisaka added a comment - Fixed the test failure.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12723325/HADOOP-11772-wip-002.patch
          against trunk revision 96d7211.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6066//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6066//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723325/HADOOP-11772-wip-002.patch against trunk revision 96d7211. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6066//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6066//console This message is automatically generated.
          Hide
          gopalv Gopal V added a comment -

          Akira Ajisaka: added to today's builds.

          Show
          gopalv Gopal V added a comment - Akira Ajisaka : added to today's builds.
          Hide
          gopalv Gopal V added a comment -

          Akira Ajisaka: looks much better with the patch.

          I still see the occasional blocked getConnection(), but that's because I'm running 24 threads in parallel with 10 IPC Client instances.

          Show
          gopalv Gopal V added a comment - Akira Ajisaka : looks much better with the patch. I still see the occasional blocked getConnection(), but that's because I'm running 24 threads in parallel with 10 IPC Client instances.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12723867/after-ipc-fix.png
          against trunk revision dd852f5.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6074//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723867/after-ipc-fix.png against trunk revision dd852f5. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6074//console This message is automatically generated.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Hi Gopal V, thank you for verifying the patch. Now I have two questions:

          1. Did the patch actually reduce the elapsed time?
          2. How many clients per socket factory would be good for default?

          I'll add a test and document the new parameter.

          Show
          ajisakaa Akira Ajisaka added a comment - Hi Gopal V , thank you for verifying the patch. Now I have two questions: Did the patch actually reduce the elapsed time? How many clients per socket factory would be good for default? I'll add a test and document the new parameter.
          Hide
          gopalv Gopal V added a comment -

          Akira Ajisaka:

          1) Yes, went from about 28s to 4s - the thread doing listBlockLocations was blocked for about 10ms (BLOCKED, not WAITING) for each file open (~2500 files)
          2) I think 1 should be the default number - I will configure it particularly for the multi-threaded applications instead.

          Show
          gopalv Gopal V added a comment - Akira Ajisaka : 1) Yes, went from about 28s to 4s - the thread doing listBlockLocations was blocked for about 10ms (BLOCKED, not WAITING) for each file open (~2500 files) 2) I think 1 should be the default number - I will configure it particularly for the multi-threaded applications instead.
          Hide
          gopalv Gopal V added a comment -

          FYI, some of the delays are additive (24 threads x 10ms delays on average, several threads were blocked for 200+ms, stuck behind 20 others each blocked).

          The latency issue is probably due to the fact that the blocked threads get sched_yield out for a HZ tick.

          Show
          gopalv Gopal V added a comment - FYI, some of the delays are additive (24 threads x 10ms delays on average, several threads were blocked for 200+ms, stuck behind 20 others each blocked). The latency issue is probably due to the fact that the blocked threads get sched_yield out for a HZ tick.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Thanks Gopal V for answering the questions! Good to hear that.
          Attaching v2 patch.

          • Added test cases
          • Changed the default value of the new parameter to 1.
          • Refactored ClientCache#getClientWithLeastReferences
          Show
          ajisakaa Akira Ajisaka added a comment - Thanks Gopal V for answering the questions! Good to hear that. Attaching v2 patch. Added test cases Changed the default value of the new parameter to 1. Refactored ClientCache#getClientWithLeastReferences
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12726398/HADOOP-11772-002.patch
          against trunk revision 497c86b.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6122//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6122//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726398/HADOOP-11772-002.patch against trunk revision 497c86b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6122//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6122//console This message is automatically generated.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          v3 patch removes an unused @VisibleForTesting method.

          Show
          ajisakaa Akira Ajisaka added a comment - v3 patch removes an unused @VisibleForTesting method.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 14m 38s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 javac 7m 30s There were no new javac warning messages.
          +1 javadoc 9m 34s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 3m 57s The applied patch generated 2 additional checkstyle issues.
          +1 install 1m 34s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 1m 40s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          -1 common tests 36m 50s Tests failed in hadoop-common.
              76m 40s  



          Reason Tests
          Timed out tests org.apache.hadoop.http.TestHttpCookieFlag



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12727523/HADOOP-11772-003.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / a100be6
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/6161/artifact/patchprocess/checkstyle-result-diff.txt
          hadoop-common test log https://builds.apache.org/job/PreCommit-HADOOP-Build/6161/artifact/patchprocess/testrun_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/6161/testReport/
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/6161//console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 38s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 javac 7m 30s There were no new javac warning messages. +1 javadoc 9m 34s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 3m 57s The applied patch generated 2 additional checkstyle issues. +1 install 1m 34s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 1m 40s The patch does not introduce any new Findbugs (version 2.0.3) warnings. -1 common tests 36m 50s Tests failed in hadoop-common.     76m 40s   Reason Tests Timed out tests org.apache.hadoop.http.TestHttpCookieFlag Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727523/HADOOP-11772-003.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / a100be6 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/6161/artifact/patchprocess/checkstyle-result-diff.txt hadoop-common test log https://builds.apache.org/job/PreCommit-HADOOP-Build/6161/artifact/patchprocess/testrun_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/6161/testReport/ Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/6161//console This message was automatically generated.
          Hide
          wheat9 Haohui Mai added a comment -

          A round-robin pool or at least something better than 1 client object per socket-factory would be nice.
          Right now, listing block locations in parallel (for compute splits, or opening files etc) produces very bad synchronization within that.

          Maybe I'm missing something. The RPC client will send out the request asynchronously. I don't see how allowing multiple clients per socket factory tying to fixing the synchronization issue reported in this jira.

          Show
          wheat9 Haohui Mai added a comment - A round-robin pool or at least something better than 1 client object per socket-factory would be nice. Right now, listing block locations in parallel (for compute splits, or opening files etc) produces very bad synchronization within that. Maybe I'm missing something. The RPC client will send out the request asynchronously. I don't see how allowing multiple clients per socket factory tying to fixing the synchronization issue reported in this jira.
          Hide
          gopalv Gopal V added a comment -

          The RPC client will send out the request asynchronously.

          Asynchronously is what it does - so it does not fail even without this patch.

          The problem is that it takes 200-300ms to send it out, by which time another IPC update has already queued up for the same connection.

          See the two threads locked against each other in the bug report, where one is doing a NameNode operation and another is doing an ApplicationMaster update - which need never lock against each other in reality.

          Because they both use the same ipc.Client singleton.

          If you want to revisit this fix, please remove the Client singleton or find another way to remove the synchronization barrier around the getConnection() & the way it prevents reopening connections for IPC.

          The current IPC implementation works asynchronously, but is too slow to keep up with sub-second performance on a multi-threaded daemon which uses a singleton locked object for 24 cores doing everything (namenode lookups, app master heartbeats, data movement events, statistic updates, error recovery).

          Show
          gopalv Gopal V added a comment - The RPC client will send out the request asynchronously. Asynchronously is what it does - so it does not fail even without this patch. The problem is that it takes 200-300ms to send it out, by which time another IPC update has already queued up for the same connection. See the two threads locked against each other in the bug report, where one is doing a NameNode operation and another is doing an ApplicationMaster update - which need never lock against each other in reality. Because they both use the same ipc.Client singleton. If you want to revisit this fix, please remove the Client singleton or find another way to remove the synchronization barrier around the getConnection() & the way it prevents reopening connections for IPC. The current IPC implementation works asynchronously, but is too slow to keep up with sub-second performance on a multi-threaded daemon which uses a singleton locked object for 24 cores doing everything (namenode lookups, app master heartbeats, data movement events, statistic updates, error recovery).
          Hide
          wheat9 Haohui Mai added a comment -

          It seems that the profiling result points to the connection cache in the Client class.

          The v4 patch use Guava's Loading cache to implement the connection cache, where the read path should be lock-free. Gopal V, can you try it out?

          Show
          wheat9 Haohui Mai added a comment - It seems that the profiling result points to the connection cache in the Client class. The v4 patch use Guava's Loading cache to implement the connection cache, where the read path should be lock-free. Gopal V , can you try it out?
          Hide
          gopalv Gopal V added a comment -

          Haohui Mai: have you got any profiles on this with a multi-threaded test?

          Show
          gopalv Gopal V added a comment - Haohui Mai : have you got any profiles on this with a multi-threaded test?
          Hide
          wheat9 Haohui Mai added a comment -

          I think that it might be feasible to reproduce the problem is to spawn a client that talks to 200 nodes concurrently, but unfortunately I don't have the access of the cluster nor YourKit.

          Show
          wheat9 Haohui Mai added a comment - I think that it might be feasible to reproduce the problem is to spawn a client that talks to 200 nodes concurrently, but unfortunately I don't have the access of the cluster nor YourKit.
          Hide
          gopalv Gopal V added a comment -

          reproduce the problem is to spawn a client that talks to 200 nodes concurrently, but unfortunately I don't have the access of the cluster nor YourKit.

          The problem was reported as being visible on 1 process when it talks to 1 NameNode. You do not need 200 nodes to reproduce this bug - I reported this as observed using 1 single process and 1 namenode instance (not even HA).

          I got my yourkit license for use with Apache Hive for free - see section (G) of their license and email their sales folks to get a free license.

          Those arguments aside, the earlier patch had a unit test - the testClientCacheFromMultiThreads() that Akira Ajisaka wrote, when you run that does that show blocked threads or de-scheduled threads with the new patch?

          This is an important fix late in the cycle, the new patch should get as much testing as early as possible.

          Show
          gopalv Gopal V added a comment - reproduce the problem is to spawn a client that talks to 200 nodes concurrently, but unfortunately I don't have the access of the cluster nor YourKit. The problem was reported as being visible on 1 process when it talks to 1 NameNode. You do not need 200 nodes to reproduce this bug - I reported this as observed using 1 single process and 1 namenode instance (not even HA). I got my yourkit license for use with Apache Hive for free - see section (G) of their license and email their sales folks to get a free license. Those arguments aside, the earlier patch had a unit test - the testClientCacheFromMultiThreads() that Akira Ajisaka wrote, when you run that does that show blocked threads or de-scheduled threads with the new patch? This is an important fix late in the cycle, the new patch should get as much testing as early as possible.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 14m 57s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 javac 7m 48s There were no new javac warning messages.
          +1 javadoc 9m 55s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 1m 6s There were no new checkstyle issues.
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 36s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 42s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          -1 common tests 22m 44s Tests failed in hadoop-common.
              60m 47s  



          Reason Tests
          Failed unit tests hadoop.security.token.delegation.web.TestWebDelegationToken



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12733258/HADOOP-11772.004.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / f7e051c
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/artifact/patchprocess/whitespace.txt
          hadoop-common test log https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/artifact/patchprocess/testrun_hadoop-common.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/testReport/
          Java 1.7.0_55
          uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 57s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 48s There were no new javac warning messages. +1 javadoc 9m 55s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 1m 6s There were no new checkstyle issues. -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 36s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 42s The patch does not introduce any new Findbugs (version 2.0.3) warnings. -1 common tests 22m 44s Tests failed in hadoop-common.     60m 47s   Reason Tests Failed unit tests hadoop.security.token.delegation.web.TestWebDelegationToken Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12733258/HADOOP-11772.004.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / f7e051c whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/artifact/patchprocess/whitespace.txt hadoop-common test log https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/artifact/patchprocess/testrun_hadoop-common.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/testReport/ Java 1.7.0_55 uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/6705/console This message was automatically generated.
          Hide
          gopalv Gopal V added a comment -

          Added patch to next week's build + deploy queue.

          Show
          gopalv Gopal V added a comment - Added patch to next week's build + deploy queue.
          Hide
          gopalv Gopal V added a comment -

          The patch looks good, there's no performance degradation due to locking

          Tried to check total held connections, which seems to also be contained with the guava cache patch.

          Show
          gopalv Gopal V added a comment - The patch looks good, there's no performance degradation due to locking Tried to check total held connections, which seems to also be contained with the guava cache patch.
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Thanks Haohui Mai for taking this issue. Using Guava cache looks better.

          Show
          ajisakaa Akira Ajisaka added a comment - Thanks Haohui Mai for taking this issue. Using Guava cache looks better.
          Hide
          jingzhao Jing Zhao added a comment -

          The patch looks good to me. One concern is that it removes several deprecated API from Client.java, which is marked as "LimitedPrivate". Please make sure it is OK for other projects. Other than that +1.

          Show
          jingzhao Jing Zhao added a comment - The patch looks good to me. One concern is that it removes several deprecated API from Client.java, which is marked as "LimitedPrivate". Please make sure it is OK for other projects. Other than that +1.
          Hide
          wheat9 Haohui Mai added a comment -
          @InterfaceAudience.LimitedPrivate(value = { "Common", "HDFS", "MapReduce", "Yarn" })
          

          I verified the above projects do not use the deprecated methods.

          Show
          wheat9 Haohui Mai added a comment - @InterfaceAudience.LimitedPrivate(value = { "Common" , "HDFS" , "MapReduce" , "Yarn" }) I verified the above projects do not use the deprecated methods.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7878 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7878/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7878 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7878/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #934 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/934/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #934 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/934/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #203 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/203/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #203 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/203/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2132 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2132/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2132 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2132/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #202 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/202/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #202 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/202/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #192 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/192/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #192 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/192/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2150 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2150/)
          HADOOP-11772. RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2150 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2150/ ) HADOOP-11772 . RPC Invoker relies on static ClientCache which has synchronized(this) blocks. Contributed by Haohui Mai. (wheat9: rev fb6b38d67d8b997eca498fc5010b037e3081ace7) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-common-project/hadoop-common/CHANGES.txt
          Hide
          daryn Daryn Sharp added a comment -

          A very delayed -1. CacheBuilder is obscenely expensive for concurrent map, and it requires generating unnecessary garbage even just to look up a key. Replace it with ConcurrentHashMap.

          I identified this issue that impaired my own perf testing under load. The slowdown isn't just the sync. It's the expensive of Connection's ctor stalling other connections. The expensive of ConnectionId#equals causes delays. Synch'ing on connections causes unfair contention unlike a sync'ed method. Concurrency simply hides this.

          If I can ever escape from the, um, "fun" of stabilizing our production clusters, I'll dig out a patch that is more efficient than a CHM. In the meantime, use a CHM. Either re-fix on this jira, or file another critical jira.

          Show
          daryn Daryn Sharp added a comment - A very delayed -1. CacheBuilder is obscenely expensive for concurrent map, and it requires generating unnecessary garbage even just to look up a key. Replace it with ConcurrentHashMap. I identified this issue that impaired my own perf testing under load. The slowdown isn't just the sync. It's the expensive of Connection's ctor stalling other connections. The expensive of ConnectionId#equals causes delays. Synch'ing on connections causes unfair contention unlike a sync'ed method. Concurrency simply hides this. If I can ever escape from the, um, "fun" of stabilizing our production clusters, I'll dig out a patch that is more efficient than a CHM. In the meantime, use a CHM. Either re-fix on this jira, or file another critical jira.
          Hide
          kihwal Kihwal Lee added a comment -

          Daryn Sharp, would you file a jira to address the issue you pointed out? May be we could make it a subtask of HADOOP-13425.

          Show
          kihwal Kihwal Lee added a comment - Daryn Sharp , would you file a jira to address the issue you pointed out? May be we could make it a subtask of HADOOP-13425 .

            People

            • Assignee:
              wheat9 Haohui Mai
              Reporter:
              gopalv Gopal V
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development