Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4816

transitionToActive blocks if the SBN is doing checkpoint image transfer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.0.4-alpha
    • Fix Version/s: 2.3.0
    • Component/s: namenode
    • Labels:
      None

      Description

      The NN and SBN do this dance during checkpoint image transfer with nested HTTP GETs via HttpURLConnection. When an admin does a -transitionToActive during this transfer, part of that is interrupting an ongoing checkpoint so we can transition immediately.

      However, the thread.interrupt() in StandbyCheckpointer#stop gets swallowed by connection.getResponseCode() in TransferFsImage#doGetUrl. None of the methods in HttpURLConnection throw InterruptedException, so we need to do something else (perhaps HttpClient [1]):

      [1]: http://hc.apache.org/httpclient-3.x/

      1. stacks.out
        770 kB
        Andrew Wang
      2. hdfs-4816-slow-shutdown.txt
        103 kB
        Andrew Wang
      3. hdfs-4816-4.patch
        5 kB
        Andrew Wang
      4. hdfs-4816-3.patch
        5 kB
        Andrew Wang
      5. hdfs-4816-2.patch
        4 kB
        Andrew Wang
      6. hdfs-4816-1.patch
        2 kB
        Andrew Wang

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          8d 15h 51m 1 Andrew Wang 22/May/13 19:25
          Open Open Patch Available Patch Available
          51d 4h 45m 2 Andrew Wang 09/Jul/13 22:02
          Patch Available Patch Available Resolved Resolved
          36d 2h 36m 1 Andrew Wang 15/Aug/13 00:38
          Resolved Resolved Closed Closed
          193d 21h 18m 1 Arun C Murthy 24/Feb/14 20:57
          Chris Nauroth made changes -
          Link This issue is related to HDFS-6243 [ HDFS-6243 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Arun C Murthy made changes -
          Fix Version/s 2.3.0 [ 12325255 ]
          Fix Version/s 2.4.0 [ 12324588 ]
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1519 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1519/)
          HDFS-4816. transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1519 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1519/ ) HDFS-4816 . transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1492 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1492/)
          HDFS-4816. transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1492 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1492/ ) HDFS-4816 . transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #302 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/302/)
          HDFS-4816. transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #302 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/302/ ) HDFS-4816 . transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #4261 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4261/)
          HDFS-4816. transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4261 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4261/ ) HDFS-4816 . transitionToActive blocks if the SBN is doing checkpoint image transfer. (Andrew Wang) (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1514095 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyCheckpointer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Andrew Wang made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 2.3.0 [ 12324588 ]
          Resolution Fixed [ 1 ]
          Hide
          Andrew Wang added a comment -

          Great, thanks atm. Committed to trunk and branch-2.

          Show
          Andrew Wang added a comment - Great, thanks atm. Committed to trunk and branch-2.
          Hide
          Aaron T. Myers added a comment -

          +1, the latest patch looks good to me.

          I also checked with Todd offline and confirmed that he's good with the latest patch.

          Thanks a lot, Andrew.

          Show
          Aaron T. Myers added a comment - +1, the latest patch looks good to me. I also checked with Todd offline and confirmed that he's good with the latest patch. Thanks a lot, Andrew.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12589793/hdfs-4816-4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4614//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4614//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12589793/hdfs-4816-4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4614//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4614//console This message is automatically generated.
          Andrew Wang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Andrew Wang made changes -
          Attachment hdfs-4816-4.patch [ 12589793 ]
          Hide
          Andrew Wang added a comment -

          Thanks for the review Colin. Newest patch adds a print on InterruptedException, my test output shows the expected interrupt during the get on the Future.

          2013-06-26 13:49:15,797 INFO  ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(332)) - Interrupted during checkpointing
          java.lang.InterruptedException
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:979)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
          	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
          	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:200)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:61)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:325)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:238)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:258)
          	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:254)
          2013-06-26 13:49:15,798 WARN  ha.EditLogTailer (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
          java.lang.InterruptedException: sleep interrupted
          	at java.lang.Thread.sleep(Native Method)
          	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
          	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
          	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
          	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
          	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
          
          Show
          Andrew Wang added a comment - Thanks for the review Colin. Newest patch adds a print on InterruptedException, my test output shows the expected interrupt during the get on the Future. 2013-06-26 13:49:15,797 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(332)) - Interrupted during checkpointing java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:979) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:200) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:325) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:238) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:258) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:254) 2013-06-26 13:49:15,798 WARN ha.EditLogTailer (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
          Hide
          Colin Patrick McCabe added a comment -

          I think we don't see an exception because StandbyCheckpointer#stop does a thread.interrupt, so the blocking get on the future throws an uncaught InterruptedException rather than an ExecutionException. This is then silently caught in #doWork and the StandbyCheckpointer exits.

          Can you add a log to StandbyCheckpointer#doWork for the case where it gets interrupted?

          Show
          Colin Patrick McCabe added a comment - I think we don't see an exception because StandbyCheckpointer#stop does a thread.interrupt, so the blocking get on the future throws an uncaught InterruptedException rather than an ExecutionException. This is then silently caught in #doWork and the StandbyCheckpointer exits. Can you add a log to StandbyCheckpointer#doWork for the case where it gets interrupted?
          Andrew Wang made changes -
          Attachment stacks.out [ 12584360 ]
          Hide
          Andrew Wang added a comment -

          Attaching my jstack output, in case I missed something.

          I think we don't see an exception because StandbyCheckpointer#stop does a thread.interrupt, so the blocking get on the future throws an uncaught InterruptedException rather than an ExecutionException. This is then silently caught in #doWork and the StandbyCheckpointer exits.

          I'm a bit surprised that I don't see a "TransferFsImageUpload" thread in the jstack (is it automatically cleaned up when the StandbyCheckpointer terminates? is the executor reusing the current thread?), but that's kind of a side note.

          All I do see are the two servlets talking to each other and MiniDFSCluster#shutdown blocking on them.

          Show
          Andrew Wang added a comment - Attaching my jstack output, in case I missed something. I think we don't see an exception because StandbyCheckpointer#stop does a thread.interrupt , so the blocking get on the future throws an uncaught InterruptedException rather than an ExecutionException . This is then silently caught in #doWork and the StandbyCheckpointer exits. I'm a bit surprised that I don't see a "TransferFsImageUpload" thread in the jstack (is it automatically cleaned up when the StandbyCheckpointer terminates? is the executor reusing the current thread?), but that's kind of a side note. All I do see are the two servlets talking to each other and MiniDFSCluster#shutdown blocking on them.
          Andrew Wang made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Todd Lipcon added a comment -

          Hey Andrew. I don't see the line "Exception during image upload" anywhere in the log you uploaded. Am I missing something? Shouldn't we see the slow checkpoint upload getting an exception due to being interrupted by the "cancellation"?

          Show
          Todd Lipcon added a comment - Hey Andrew. I don't see the line "Exception during image upload" anywhere in the log you uploaded. Am I missing something? Shouldn't we see the slow checkpoint upload getting an exception due to being interrupted by the "cancellation"?
          Hide
          Andrew Wang added a comment -

          I gathered some jstacks. It looks like there's a SBN servlet upload thread blocked on a DataTransferThrottler, a NN servlet thread doing the image download, and the main thread is hung in MiniDFSCluster#shutdown, specifically HttpServer#stop. The javadoc for this says:

          Stops the component. The component may wait for current activities to complete normally, but it can be interrupted.

          Apparently, we're dealing with a non-interrupting stop. Further googling indicates that shutdown blocks until active requests are finished [1]. I tried setting setGracefulShutdown to no avail. Dunno how to get around this without patching Jetty.

          Are we okay with the 100B/s set in the current patch? It easily hits the timing window for me.

          Setting 1B/s also does succeed, it just takes 9 minutes to complete the test.

          [1]: http://docs.codehaus.org/display/JETTY/How+to+gracefully+shutdown

          Show
          Andrew Wang added a comment - I gathered some jstacks. It looks like there's a SBN servlet upload thread blocked on a DataTransferThrottler, a NN servlet thread doing the image download, and the main thread is hung in MiniDFSCluster#shutdown, specifically HttpServer#stop. The javadoc for this says: Stops the component. The component may wait for current activities to complete normally, but it can be interrupted. Apparently, we're dealing with a non-interrupting stop. Further googling indicates that shutdown blocks until active requests are finished [1] . I tried setting setGracefulShutdown to no avail. Dunno how to get around this without patching Jetty. Are we okay with the 100B/s set in the current patch? It easily hits the timing window for me. Setting 1B/s also does succeed, it just takes 9 minutes to complete the test. [1] : http://docs.codehaus.org/display/JETTY/How+to+gracefully+shutdown
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12583572/hdfs-4816-3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4410//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4410//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583572/hdfs-4816-3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4410//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4410//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Did you try gathering a jstack of the test process while it was hung? That's the interesting bit that will tell why it's hanging.

          Show
          Todd Lipcon added a comment - Did you try gathering a jstack of the test process while it was hung? That's the interesting bit that will tell why it's hanging.
          Andrew Wang made changes -
          Attachment hdfs-4816-3.patch [ 12583572 ]
          Hide
          Andrew Wang added a comment -

          Here's a new patch using a thread factory, setting daemon=true and a thread name.

          I also attached a log from where I throttled the test bandwidth to 1 B/s. You see it basically hang for 8 minutes after trying to shutdown the MiniDFSCluster, then this:

          2013-05-16 15:38:46,950 WARN  mortbay.log (Slf4jLog.java:warn(76)) - 1 threads could not be stopped
          

          Down the road, I still want to make image transfer actually cancellable, it'd fix this issue.

          Show
          Andrew Wang added a comment - Here's a new patch using a thread factory, setting daemon=true and a thread name. I also attached a log from where I throttled the test bandwidth to 1 B/s. You see it basically hang for 8 minutes after trying to shutdown the MiniDFSCluster, then this: 2013-05-16 15:38:46,950 WARN mortbay.log (Slf4jLog.java:warn(76)) - 1 threads could not be stopped Down the road, I still want to make image transfer actually cancellable, it'd fix this issue.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12583570/hdfs-4816-slow-shutdown.txt
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4408//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583570/hdfs-4816-slow-shutdown.txt against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4408//console This message is automatically generated.
          Andrew Wang made changes -
          Attachment hdfs-4816-slow-shutdown.txt [ 12583570 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12583234/hdfs-4816-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4396//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4396//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583234/hdfs-4816-2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4396//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4396//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          A few thoughts that might help with the above:

          • make the thread a daemon thread - that way it won't block shutdown, I think? Are you sure it's the cluster.shutdown() and not just the JVM exit waiting on the thread?
          • the thread could probably do with a name as well. You can use the guava ThreadFactoryBuilder to make this easier.
          Show
          Todd Lipcon added a comment - A few thoughts that might help with the above: make the thread a daemon thread - that way it won't block shutdown, I think? Are you sure it's the cluster.shutdown() and not just the JVM exit waiting on the thread? the thread could probably do with a name as well. You can use the guava ThreadFactoryBuilder to make this easier.
          Andrew Wang made changes -
          Attachment hdfs-4816-2.patch [ 12583234 ]
          Hide
          Andrew Wang added a comment -

          Thanks for the review, Todd.

          The attached test case isn't perfect since there's a potential timing issue. I believe cluster.shutdown() in the test's @After hangs on the background transfer thread. When I had the transfer throttled to 1 B/s it took 9 minutes to complete (successfully), so I set it to 100B/s for faster test runs. The log shows it's still testing the behavior correctly though, so it's probably fine.

          I also do the rethrow of any exceptions from the Future, but we'll miss any exceptions thrown after it interrupts out. Probably fine also.

          Show
          Andrew Wang added a comment - Thanks for the review, Todd. The attached test case isn't perfect since there's a potential timing issue. I believe cluster.shutdown() in the test's @After hangs on the background transfer thread. When I had the transfer throttled to 1 B/s it took 9 minutes to complete (successfully), so I set it to 100B/s for faster test runs. The log shows it's still testing the behavior correctly though, so it's probably fine. I also do the rethrow of any exceptions from the Future, but we'll miss any exceptions thrown after it interrupts out. Probably fine also.
          Hide
          Todd Lipcon added a comment -

          May want to actually wait on the submitted future itself, so that if there's an exception it propagates back to the main thread.

          For a unit test, maybe you can set the transfer throttler to something very slow like 1 byte/second?

          Show
          Todd Lipcon added a comment - May want to actually wait on the submitted future itself, so that if there's an exception it propagates back to the main thread. For a unit test, maybe you can set the transfer throttler to something very slow like 1 byte/second?
          Andrew Wang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Andrew Wang [ andrew.wang ]
          Andrew Wang made changes -
          Field Original Value New Value
          Attachment hdfs-4816-1.patch [ 12583064 ]
          Hide
          Andrew Wang added a comment -

          Here's a patch making it into a thread. I've tested this manually with a pseudo-distributed HA cluster, and am able to transition from standby to active without blocking on the transfer.

          Making a unit test for this is proving difficult. Posting it up for now sans test.

          Show
          Andrew Wang added a comment - Here's a patch making it into a thread. I've tested this manually with a pseudo-distributed HA cluster, and am able to transition from standby to active without blocking on the transfer. Making a unit test for this is proving difficult. Posting it up for now sans test.
          Hide
          Todd Lipcon added a comment -

          Another option would be to run the checkpoint upload in a separate thread, so that we don't block on it while becoming active. It's OK for the transfer to continue even as the NN goes into active state, since it's just transferring an immutable file anyway.

          Show
          Todd Lipcon added a comment - Another option would be to run the checkpoint upload in a separate thread, so that we don't block on it while becoming active. It's OK for the transfer to continue even as the NN goes into active state, since it's just transferring an immutable file anyway.
          Andrew Wang created issue -

            People

            • Assignee:
              Andrew Wang
              Reporter:
              Andrew Wang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development