Uploaded image for project: 'Tephra'
  1. Tephra
  2. TEPHRA-242

Transaction Service sometimes does not shutdown

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.0-incubating
    • Fix Version/s: 0.13.0-incubating
    • Component/s: manager
    • Labels:
      None

      Description

      If shutting down encounters an exception (for example, OutOfMemory), it can happen that the process does not exit. We have observed this after an NPE from DFS (closing the transaction log), the process stayed alive and the\ prune thread (executor) was not shut down. It was not clear why the process stayed alive, but it would certainly help to handle exceptions better in the shutdown sequence.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/incubator-tephra/pull/49

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/incubator-tephra/pull/49
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user anew commented on a diff in the pull request:

          https://github.com/apache/incubator-tephra/pull/49#discussion_r137879378

          — Diff: tephra-core/src/main/java/org/apache/tephra/distributed/TransactionService.java —
          @@ -121,15 +121,15 @@ public void failed(State from, Throwable failure) {
          @Override
          public void follower() {
          ListenableFuture<State> stopFuture = null;
          + if (pruningService != null && pruningService.isRunning())

          { + // Wait for pruning service to stop after un-registering from discovery + stopFuture = pruningService.stop(); + }

          // First stop the transaction server as un-registering from discovery can block sometimes.
          // That can lead to multiple transaction servers being active at the same time.
          if (server != null && server.isRunning())

          { server.stopAndWait(); }
          • if (pruningService != null && pruningService.isRunning()) { - // Wait for pruning service to stop after un-registering from discovery - stopFuture = pruningService.stop(); - }

            undoRegister();

          if (stopFuture != null) {
          — End diff –

          pruningService.shutdown() itself has a 30 second timeout. So this future is guaranteed to return, and we don't need a timeout here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user anew commented on a diff in the pull request: https://github.com/apache/incubator-tephra/pull/49#discussion_r137879378 — Diff: tephra-core/src/main/java/org/apache/tephra/distributed/TransactionService.java — @@ -121,15 +121,15 @@ public void failed(State from, Throwable failure) { @Override public void follower() { ListenableFuture<State> stopFuture = null; + if (pruningService != null && pruningService.isRunning()) { + // Wait for pruning service to stop after un-registering from discovery + stopFuture = pruningService.stop(); + } // First stop the transaction server as un-registering from discovery can block sometimes. // That can lead to multiple transaction servers being active at the same time. if (server != null && server.isRunning()) { server.stopAndWait(); } if (pruningService != null && pruningService.isRunning()) { - // Wait for pruning service to stop after un-registering from discovery - stopFuture = pruningService.stop(); - } undoRegister(); if (stopFuture != null) { — End diff – pruningService.shutdown() itself has a 30 second timeout. So this future is guaranteed to return, and we don't need a timeout here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user poornachandra commented on a diff in the pull request:

          https://github.com/apache/incubator-tephra/pull/49#discussion_r137865504

          — Diff: tephra-core/src/main/java/org/apache/tephra/distributed/TransactionService.java —
          @@ -121,15 +121,15 @@ public void failed(State from, Throwable failure) {
          @Override
          public void follower() {
          ListenableFuture<State> stopFuture = null;
          + if (pruningService != null && pruningService.isRunning())

          { + // Wait for pruning service to stop after un-registering from discovery + stopFuture = pruningService.stop(); + }

          // First stop the transaction server as un-registering from discovery can block sometimes.
          // That can lead to multiple transaction servers being active at the same time.
          if (server != null && server.isRunning())

          { server.stopAndWait(); }
          • if (pruningService != null && pruningService.isRunning()) { - // Wait for pruning service to stop after un-registering from discovery - stopFuture = pruningService.stop(); - }

            undoRegister();

          if (stopFuture != null) {
          — End diff –

          It would be good to add a timeout to the `Futures.getUnchecked(stopFuture)` in the next line, so that we don't block forever.

          Show
          githubbot ASF GitHub Bot added a comment - Github user poornachandra commented on a diff in the pull request: https://github.com/apache/incubator-tephra/pull/49#discussion_r137865504 — Diff: tephra-core/src/main/java/org/apache/tephra/distributed/TransactionService.java — @@ -121,15 +121,15 @@ public void failed(State from, Throwable failure) { @Override public void follower() { ListenableFuture<State> stopFuture = null; + if (pruningService != null && pruningService.isRunning()) { + // Wait for pruning service to stop after un-registering from discovery + stopFuture = pruningService.stop(); + } // First stop the transaction server as un-registering from discovery can block sometimes. // That can lead to multiple transaction servers being active at the same time. if (server != null && server.isRunning()) { server.stopAndWait(); } if (pruningService != null && pruningService.isRunning()) { - // Wait for pruning service to stop after un-registering from discovery - stopFuture = pruningService.stop(); - } undoRegister(); if (stopFuture != null) { — End diff – It would be good to add a timeout to the `Futures.getUnchecked(stopFuture)` in the next line, so that we don't block forever.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user gokulavasan commented on the issue:

          https://github.com/apache/incubator-tephra/pull/49

          LGTM 👍

          Show
          githubbot ASF GitHub Bot added a comment - Github user gokulavasan commented on the issue: https://github.com/apache/incubator-tephra/pull/49 LGTM 👍
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user anew opened a pull request:

          https://github.com/apache/incubator-tephra/pull/49

          TEPHRA-242 Ensure Pruning Service shuts down

          Shutdown the pruning service before the tx server, to ensure it shuts down even if tx server shutdown is interrupted.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/anew/incubator-tephra tephra-242

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/incubator-tephra/pull/49.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #49


          commit 944509fc25167fc52815673b01a7b03295628711
          Author: anew <anew@apache.org>
          Date: 2017-08-31T22:41:10Z

          TEPHRA-242 Ensure Pruning Service shuts down


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user anew opened a pull request: https://github.com/apache/incubator-tephra/pull/49 TEPHRA-242 Ensure Pruning Service shuts down Shutdown the pruning service before the tx server, to ensure it shuts down even if tx server shutdown is interrupted. You can merge this pull request into a Git repository by running: $ git pull https://github.com/anew/incubator-tephra tephra-242 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tephra/pull/49.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #49 commit 944509fc25167fc52815673b01a7b03295628711 Author: anew <anew@apache.org> Date: 2017-08-31T22:41:10Z TEPHRA-242 Ensure Pruning Service shuts down
          Hide
          anew Andreas Neumann added a comment -

          It is hard to say what happened in the scenario that we observed: The pruning service was still running after the transaction server had shutdown. The only way to explain this is that in this code:

                  if (server != null && server.isRunning()) {
                    server.stopAndWait();
                  }
                  if (pruningService != null && pruningService.isRunning()) {
                    // Wait for pruning service to stop after un-registering from discovery
                    stopFuture = pruningService.stop();
                  }
          

          the stopping of the server threw an exception (we were facing OutOfMemory issues) or an InterruptedException. In that case, the exception goes uncaught and pruningService.stop() is never called. It seems safe to stop the prune service first (it needs the tx server anyway).

          Show
          anew Andreas Neumann added a comment - It is hard to say what happened in the scenario that we observed: The pruning service was still running after the transaction server had shutdown. The only way to explain this is that in this code: if (server != null && server.isRunning()) { server.stopAndWait(); } if (pruningService != null && pruningService.isRunning()) { // Wait for pruning service to stop after un-registering from discovery stopFuture = pruningService.stop(); } the stopping of the server threw an exception (we were facing OutOfMemory issues) or an InterruptedException. In that case, the exception goes uncaught and pruningService.stop() is never called. It seems safe to stop the prune service first (it needs the tx server anyway).

            People

            • Assignee:
              anew Andreas Neumann
              Reporter:
              anew Andreas Neumann
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development