Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.3, 1.7.1, 1.8.0
    • Component/s: test, tserver
    • Labels:
    • Environment:

      1.6.0-SNAPSHOT (sha-1: 0da9a56)
      cdh4.5.0

      Description

      I was running continuous ingest on a 7 node cluster (5 slaves) and after enabling HDFS agitation, my clients died.

      ingest.err
      Thread "org.apache.accumulo.test.continuous.ContinuousIngest" died java.lang.reflect.InvocationTargetException
      java.lang.reflect.InvocationTargetException
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.accumulo.start.Main$1.run(Main.java:137)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: java.lang.reflect.UndeclaredThrowableException
      at $Proxy9.addMutation(Unknown Source)
      at org.apache.accumulo.test.continuous.ContinuousIngest.main(ContinuousIngest.java:212)
      ... 6 more
      Caused by: java.lang.reflect.InvocationTargetException
      at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.accumulo.trace.instrument.TraceProxy$2.invoke(TraceProxy.java:43)
      ... 8 more
      Caused by: org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0 security codes: {} # server errors 1 # exceptions 0
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:258)
      at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:43)
      ... 12 more
      
      ingest.out
      UUID 1392844086463 f822a6a9-9592-4b3a-ab3b-1c172be20b96
      FLUSH 1392844135523 49047 6165 1000000 1000000
      FLUSH 1392844165594 30071 7787 2000000 1000000
      FLUSH 1392844195875 30281 7816 3000000 1000000
      FLUSH 1392844226787 30912 8086 4000000 1000000
      FLUSH 1392844257194 30407 7989 5000000 1000000
      FLUSH 1392844287518 30324 7743 6000000 1000000
      FLUSH 1392844325833 38315 10933 7000000 1000000
      FLUSH 1392844364708 38875 7916 8000000 1000000
      FLUSH 1392844395818 31110 8104 9000000 1000000
      2014-02-19 13:16:57,444 [impl.TabletServerBatchWriter] ERROR: Server side error on tserver1:10011: org.apache.thrift.TApplicationException: Internal error processing applyUpdates
      2014-02-19 13:16:57,446 [impl.TabletServerBatchWriter] ERROR: Failed to send tablet server tserver1:10011 its batch : Error on server tserver1:10011
      org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server tserver1:10011
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:937)
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.access$1600(TabletServerBatchWriter.java:616)
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.send(TabletServerBatchWriter.java:801)
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.run(TabletServerBatchWriter.java:765)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
      at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: org.apache.thrift.TApplicationException: Internal error processing applyUpdates
      at org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
      at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
      at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_closeUpdate(TabletClientService.java:431)
      at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.closeUpdate(TabletClientService.java:417)
      at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:899)
      ... 11 more
      
      tserver.log
      2014-02-19 13:16:56,156 [util.TServerUtils$THsHaServer] WARN : Got an IOException in internalRead!
      java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:171)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
      at org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
      at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
      at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:355)
      at org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
      at org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
      at org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
      

      Note that this last message was not propagated to the monitor for some reason, but that is likely a different issue. (I had been seeing other WARN messages show up earlier.)

      1. tserver1.log
        12 kB
        Mike Drob
      2. tracer.debug.log
        47 kB
        Mike Drob
      3. ACCUMULO-2388-1.patch
        1 kB
        Keith Turner

        Issue Links

          Activity

          Hide
          elserj Josh Elser added a comment -
          Show
          elserj Josh Elser added a comment - Filed ACCUMULO-3908
          Hide
          kturner Keith Turner added a comment -

          Not sure what it should be instrumented with.

          Show
          kturner Keith Turner added a comment - Not sure what it should be instrumented with.
          Hide
          elserj Josh Elser added a comment -

          I would really like to see info collected from instrumented clients

          Instrumented with what?

          Show
          elserj Josh Elser added a comment - I would really like to see info collected from instrumented clients Instrumented with what?
          Hide
          kturner Keith Turner added a comment -

          > Sounds like another good ticket. We don't have any centralized metrics system for the client, do we?

          I would really like to see info collected from instrumented clients. Seems like it would be very useful info.

          Show
          kturner Keith Turner added a comment - > Sounds like another good ticket. We don't have any centralized metrics system for the client, do we? I would really like to see info collected from instrumented clients. Seems like it would be very useful info.
          Hide
          elserj Josh Elser added a comment -

          For some fraction of the time when the exception is thrown the client will be gone. I wish I had more data about what that fraction was.

          Sounds like another good ticket. We don't have any centralized metrics system for the client, do we? We have metrics2 on the server. Not sure if that's the best choice for the client though. Make be better to look at something more self-contained.

          Show
          elserj Josh Elser added a comment - For some fraction of the time when the exception is thrown the client will be gone. I wish I had more data about what that fraction was. Sounds like another good ticket. We don't have any centralized metrics system for the client, do we? We have metrics2 on the server. Not sure if that's the best choice for the client though. Make be better to look at something more self-contained.
          Hide
          kturner Keith Turner added a comment -

          > This would eliminate the unnecessary location-cache eviction this change creates. I'd like to see us think about this for 1.8/2.0. Thoughts?

          I was worried about the cache eviction when I first made the change. However I am not sure how frequently it will actually occur. For some fraction of the time when the exception is thrown the client will be gone. I wish I had more data about what that fraction was.

          Show
          kturner Keith Turner added a comment - > This would eliminate the unnecessary location-cache eviction this change creates. I'd like to see us think about this for 1.8/2.0. Thoughts? I was worried about the cache eviction when I first made the change. However I am not sure how frequently it will actually occur. For some fraction of the time when the exception is thrown the client will be gone. I wish I had more data about what that fraction was.
          Hide
          elserj Josh Elser added a comment -

          Sorry for the belated review. Made some time to look over these changes in light of the near immediate RC after.

          I did a quick glance at how the client handles the new thrown Exception from the server, and it does appear to line up with what your comments suggested. I did cringe a bit at it being a hack, but can understand the choice for this solution for patch releases.

          It seems like this is another one of those areas in the code where we could seriously benefit from a well-defined retry policy instead of piggy-backing on code which happens to do an implicit retry. It seems like we'd want a specific exception for the "go away, come back in a little while" logic. This would eliminate the unnecessary location-cache eviction this change creates. I'd like to see us think about this for 1.8/2.0. Thoughts?

          Show
          elserj Josh Elser added a comment - Sorry for the belated review. Made some time to look over these changes in light of the near immediate RC after. I did a quick glance at how the client handles the new thrown Exception from the server, and it does appear to line up with what your comments suggested. I did cringe a bit at it being a hack, but can understand the choice for this solution for patch releases. It seems like this is another one of those areas in the code where we could seriously benefit from a well-defined retry policy instead of piggy-backing on code which happens to do an implicit retry. It seems like we'd want a specific exception for the "go away, come back in a little while" logic. This would eliminate the unnecessary location-cache eviction this change creates. I'd like to see us think about this for 1.8/2.0. Thoughts?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/accumulo/pull/38

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/accumulo/pull/38
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user keith-turner opened a pull request:

          https://github.com/apache/accumulo/pull/38

          ACCUMULO-2388 Make clients retry in case of HoldTimeoutException

          To test this patch, I made the following changes and ran the WriteLotsIt and ConditionalWriterIT

          ```
          diff --git a/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java b/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java
          index 7c0eedc..4483fe8 100644
          — a/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java
          +++ b/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java
          @@ -469,6 +469,9 @@ public class TabletServerResourceManager {
          }

          void waitUntilCommitsAreEnabled() {
          + if(Math.random() < .1)
          + throw new HoldTimeoutException("test");
          +
          if (holdCommits) {
          long timeout = System.currentTimeMillis() + conf.getConfiguration().getTimeInMillis(Property.GENERAL_RPC_TIMEOUT);
          synchronized (commitHold) {
          ```

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/keith-turner/accumulo ACCUMULO-2388

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/accumulo/pull/38.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #38


          commit 6965fb07c97cd752d6f5415a39e87b9ff3c1ba7e
          Author: Keith Turner <kturner@apache.org>
          Date: 2015-06-12T21:47:18Z

          ACCUMULO-2388 Make clients retry in case of HoldTimeoutException


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user keith-turner opened a pull request: https://github.com/apache/accumulo/pull/38 ACCUMULO-2388 Make clients retry in case of HoldTimeoutException To test this patch, I made the following changes and ran the WriteLotsIt and ConditionalWriterIT ``` diff --git a/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java b/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java index 7c0eedc..4483fe8 100644 — a/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java +++ b/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java @@ -469,6 +469,9 @@ public class TabletServerResourceManager { } void waitUntilCommitsAreEnabled() { + if(Math.random() < .1) + throw new HoldTimeoutException("test"); + if (holdCommits) { long timeout = System.currentTimeMillis() + conf.getConfiguration().getTimeInMillis(Property.GENERAL_RPC_TIMEOUT); synchronized (commitHold) { ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/keith-turner/accumulo ACCUMULO-2388 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/38.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #38 commit 6965fb07c97cd752d6f5415a39e87b9ff3c1ba7e Author: Keith Turner <kturner@apache.org> Date: 2015-06-12T21:47:18Z ACCUMULO-2388 Make clients retry in case of HoldTimeoutException
          Hide
          ctubbsii Christopher Tubbs added a comment -

          Since there hasn't been a response, and this has already been punted before, I can't imagine why it should be a blocker for 1.6.3. I'm dropping priority and making it Critical instead. (If somebody objects to lowering priority, please volunteer to provide a timely fix, before moving it back to "Blocker"; otherwise, I see no reason to permit this to hold up other fixes included in 1.6.3).

          Show
          ctubbsii Christopher Tubbs added a comment - Since there hasn't been a response, and this has already been punted before, I can't imagine why it should be a blocker for 1.6.3. I'm dropping priority and making it Critical instead. (If somebody objects to lowering priority, please volunteer to provide a timely fix, before moving it back to "Blocker"; otherwise, I see no reason to permit this to hold up other fixes included in 1.6.3).
          Hide
          ctubbsii Christopher Tubbs added a comment -

          Does this need to be a blocker for 1.6.3?

          Show
          ctubbsii Christopher Tubbs added a comment - Does this need to be a blocker for 1.6.3?
          Hide
          kturner Keith Turner added a comment -

          I was not aware of that until Eric Newton mentioned it to me earlier today.

          Show
          kturner Keith Turner added a comment - I was not aware of that until Eric Newton mentioned it to me earlier today.
          Hide
          elserj Josh Elser added a comment -

          I was not doing any agitation. I was seeing a lot of error messages from HDFS when writing to walogs stating that it could not replicate to all datanodes. I was using hadoop 2.7.0.

          Oh, ok. BTW, are you aware of the "not for production" label on Hadoop 2.7.0? I think a 2.7.1 is incoming shortly, but I don't remember if the found issues would have a direct impact on us.

          Show
          elserj Josh Elser added a comment - I was not doing any agitation. I was seeing a lot of error messages from HDFS when writing to walogs stating that it could not replicate to all datanodes. I was using hadoop 2.7.0. Oh, ok. BTW, are you aware of the "not for production" label on Hadoop 2.7.0? I think a 2.7.1 is incoming shortly, but I don't remember if the found issues would have a direct impact on us.
          Hide
          kturner Keith Turner added a comment - - edited

          I was not doing any agitation. I was seeing a lot of error messages from HDFS when writing to walogs stating that it could not replicate to all datanodes. I was using hadoop 2.7.0. I am going to try switching to hadoop 2.6, updating Accumulo settings, and restarting test.

          Show
          kturner Keith Turner added a comment - - edited I was not doing any agitation. I was seeing a lot of error messages from HDFS when writing to walogs stating that it could not replicate to all datanodes. I was using hadoop 2.7.0. I am going to try switching to hadoop 2.6, updating Accumulo settings, and restarting test.
          Hide
          elserj Josh Elser added a comment -

          Copied from ACCUMULO-3811 because it might be relevant here:

          I'm seeing this pretty regularly with DN agitation on, commits held being the cause each time. I am surprised that I'm seeing this little resilience coming out of HDFS:

          Big CI picture w/ agitation

          20150513 08:58:26 Killing datanode on cn021
          20150513 09:08:26 Starting datanode on cn021
          
          2015-05-13 09:08:16,473 [tserver.TabletServer$ThriftClientHandler] ERROR: Commits are held
          org.apache.accumulo.tserver.HoldTimeoutException: Commits are held
          
          2015-05-13 09:08:16,479 [impl.TabletServerBatchWriter] ERROR: Server side error on cn022:9997: org.apache.thrift.TApplicationException: Internal error processing closeUpdate
          2015-05-13 09:08:16,483 [start.Main] ERROR: Thread 'org.apache.accumulo.test.continuous.ContinuousIngest' died.
          

          Maybe some blocks aren't fully replicated? I'm not sure but I feel like things shouldn't bog down like this.

          Show
          elserj Josh Elser added a comment - Copied from ACCUMULO-3811 because it might be relevant here: I'm seeing this pretty regularly with DN agitation on, commits held being the cause each time. I am surprised that I'm seeing this little resilience coming out of HDFS: Big CI picture w/ agitation 20150513 08:58:26 Killing datanode on cn021 20150513 09:08:26 Starting datanode on cn021 2015-05-13 09:08:16,473 [tserver.TabletServer$ThriftClientHandler] ERROR: Commits are held org.apache.accumulo.tserver.HoldTimeoutException: Commits are held 2015-05-13 09:08:16,479 [impl.TabletServerBatchWriter] ERROR: Server side error on cn022:9997: org.apache.thrift.TApplicationException: Internal error processing closeUpdate 2015-05-13 09:08:16,483 [start.Main] ERROR: Thread 'org.apache.accumulo.test.continuous.ContinuousIngest' died. Maybe some blocks aren't fully replicated? I'm not sure but I feel like things shouldn't bog down like this.
          Hide
          elserj Josh Elser added a comment -

          I was seeing this off and on again too, ACCUMULO-3811

          Show
          elserj Josh Elser added a comment - I was seeing this off and on again too, ACCUMULO-3811
          Hide
          kturner Keith Turner added a comment -

          Just ran into this again while testing 1.7.0rc3. I would like to get it fixed in the next round of bug fix releases.

          Show
          kturner Keith Turner added a comment - Just ran into this again while testing 1.7.0rc3. I would like to get it fixed in the next round of bug fix releases.
          Hide
          mdrob Mike Drob added a comment -

          I'm running another CI tests and seeing the

          Got an IOException in internalRead!
          	java.io.IOException: Connection reset by peer
          

          trace back show up, but none of the clients appeared to die this time.

          Show
          mdrob Mike Drob added a comment - I'm running another CI tests and seeing the Got an IOException in internalRead! java.io.IOException: Connection reset by peer trace back show up, but none of the clients appeared to die this time.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1588315 from kturner@apache.org in branch 'site/trunk'
          [ https://svn.apache.org/r1588315 ]

          ACCUMULO-2396 added ACCUMULO-2388 to release notes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1588315 from kturner@apache.org in branch 'site/trunk' [ https://svn.apache.org/r1588315 ] ACCUMULO-2396 added ACCUMULO-2388 to release notes
          Hide
          kturner Keith Turner added a comment -

          I was able to do another run w/ ACCUMULO-2668 and it was successful. I analyzed all of the tserver logs and still saw some high walog times. So the potential for the problem I saw previously was still there, it just did not happen to all occur on one node at one time.

          [cluster@ip-10-1-2-10 continuous]$ pssh -i -h ingesters.txt 'grep -a "UpSess" /opt/accumulo-1.6.0/logs/*tserver* | egrep -a -v "lt=[0-9]?[0-9][.][0-9]"'
          [12] 17:31:55 [SUCCESS] ip-10-1-2-18
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-18.ec2.internal.debug.log:2014-04-15 06:17:24,337 [tserver.TabletServer] DEBUG: UpSess 10.1.2.28:48032 16,975 in 102.868s, at=[0 0 0.00 32] ft=102.667s(pt=0.090s lt=102.317s ct=0.260s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-18.ec2.internal.debug.log:2014-04-15 06:17:38,671 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:52736 16,842 in 118.418s, at=[0 0 0.00 32] ft=118.355s(pt=0.008s lt=118.178s ct=0.169s)
          [17] 17:31:58 [SUCCESS] ip-10-1-2-23
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:36,850 [tserver.TabletServer] DEBUG: UpSess 10.1.2.20:50978 17,092 in 137.192s, at=[0 1 0.03 32] ft=137.122s(pt=0.006s lt=136.790s ct=0.326s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:36,896 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:38184 24,779 in 137.509s, at=[0 0 0.00 32] ft=137.295s(pt=0.008s lt=136.912s ct=0.375s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:37,429 [tserver.TabletServer] DEBUG: UpSess 10.1.2.21:59812 17,032 in 138.041s, at=[0 0 0.00 32] ft=137.855s(pt=0.013s lt=137.715s ct=0.127s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,004 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:32948 8,003 in 139.603s, at=[0 0 0.00 32] ft=139.400s(pt=0.003s lt=139.190s ct=0.207s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,070 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:37183 16,372 in 139.701s, at=[0 0 0.00 32] ft=139.506s(pt=0.008s lt=139.228s ct=0.270s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,071 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:40775 16,313 in 139.670s, at=[0 1 0.03 32] ft=139.497s(pt=0.012s lt=139.211s ct=0.274s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,345 [tserver.TabletServer] DEBUG: UpSess 10.1.2.22:37906 33,858 in 140.193s, at=[0 0 0.00 32] ft=140.124s(pt=0.045s lt=139.540s ct=0.539s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:41,388 [tserver.TabletServer] DEBUG: UpSess 10.1.2.17:39697 33,352 in 142.488s, at=[0 0 0.00 32] ft=142.351s(pt=0.015s lt=141.952s ct=0.384s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,494 [tserver.TabletServer] DEBUG: UpSess 10.1.2.15:43946 8,477 in 120.655s, at=[0 1 0.03 32] ft=120.607s(pt=0.002s lt=120.210s ct=0.395s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,669 [tserver.TabletServer] DEBUG: UpSess 10.1.2.13:57710 17,033 in 115.454s, at=[0 1 0.03 32] ft=115.211s(pt=0.013s lt=114.630s ct=0.568s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,675 [tserver.TabletServer] DEBUG: UpSess 10.1.2.28:44022 16,907 in 116.482s, at=[0 0 0.00 32] ft=116.322s(pt=0.004s lt=115.743s ct=0.575s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,784 [tserver.TabletServer] DEBUG: UpSess 10.1.2.29:48898 25,146 in 126.834s, at=[0 0 0.00 32] ft=126.749s(pt=0.007s lt=126.057s ct=0.685s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:50,020 [tserver.TabletServer] DEBUG: UpSess 10.1.2.10:60386 2,533 in 140.164s, at=[0 0 0.00 1] ft=140.155s(pt=0.000s lt=140.129s ct=0.026s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:50,388 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:43589 16,838 in 132.732s, at=[0 1 0.03 32] ft=132.671s(pt=0.004s lt=132.276s ct=0.391s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:50,503 [tserver.TabletServer] DEBUG: UpSess 10.1.2.25:50677 25,491 in 136.344s, at=[0 0 0.00 32] ft=136.211s(pt=0.016s lt=135.684s ct=0.511s)
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:53,409 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:36106 16,961 in 143.786s, at=[0 1 0.06 32] ft=143.719s(pt=0.004s lt=143.373s ct=0.342s)
          
          [cluster@ip-10-1-2-10 continuous]$ pssh -i -h ingesters.txt 'grep -a "writeTime" /opt/accumulo-1.6.0/logs/*tserver* | egrep -v "writeTime:[1-2]?[0-9]?[0-9]?[0-9]?[0-9]m" '
          [4] 17:34:02 [SUCCESS] ip-10-1-2-13
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-13.ec2.internal.debug.log:2014-04-15 06:17:00,243 [log.TabletServerLogger] DEBUG:  wrote MinC finish  64556: writeTime:30902ms 
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-13.ec2.internal.debug.log:2014-04-15 06:17:00,246 [log.TabletServerLogger] DEBUG:  wrote MinC finish  64557: writeTime:30663ms 
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-13.ec2.internal.debug.log:2014-04-15 06:17:03,548 [log.TabletServerLogger] DEBUG:  wrote MinC finish  64558: writeTime:33712ms 
          [12] 17:34:02 [SUCCESS] ip-10-1-2-24
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-16 04:32:31,653 [log.TabletServerLogger] DEBUG:  wrote MinC finish  822843: writeTime:35119ms 
          [14] 17:34:02 [SUCCESS] ip-10-1-2-22
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-22.ec2.internal.debug.log:2014-04-15 18:15:27,592 [log.TabletServerLogger] DEBUG:  wrote MinC finish  153363: writeTime:39875ms 
          [16] 17:34:02 [SUCCESS] ip-10-1-2-29
          /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-29.ec2.internal.debug.log:2014-04-15 06:21:25,684 [log.TabletServerLogger] DEBUG:  wrote MinC finish  62514: writeTime:50973ms 
          
          Show
          kturner Keith Turner added a comment - I was able to do another run w/ ACCUMULO-2668 and it was successful. I analyzed all of the tserver logs and still saw some high walog times. So the potential for the problem I saw previously was still there, it just did not happen to all occur on one node at one time. [cluster@ip-10-1-2-10 continuous]$ pssh -i -h ingesters.txt 'grep -a "UpSess" /opt/accumulo-1.6.0/logs/*tserver* | egrep -a -v "lt=[0-9]?[0-9][.][0-9]"' [12] 17:31:55 [SUCCESS] ip-10-1-2-18 /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-18.ec2.internal.debug.log:2014-04-15 06:17:24,337 [tserver.TabletServer] DEBUG: UpSess 10.1.2.28:48032 16,975 in 102.868s, at=[0 0 0.00 32] ft=102.667s(pt=0.090s lt=102.317s ct=0.260s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-18.ec2.internal.debug.log:2014-04-15 06:17:38,671 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:52736 16,842 in 118.418s, at=[0 0 0.00 32] ft=118.355s(pt=0.008s lt=118.178s ct=0.169s) [17] 17:31:58 [SUCCESS] ip-10-1-2-23 /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:36,850 [tserver.TabletServer] DEBUG: UpSess 10.1.2.20:50978 17,092 in 137.192s, at=[0 1 0.03 32] ft=137.122s(pt=0.006s lt=136.790s ct=0.326s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:36,896 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:38184 24,779 in 137.509s, at=[0 0 0.00 32] ft=137.295s(pt=0.008s lt=136.912s ct=0.375s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:37,429 [tserver.TabletServer] DEBUG: UpSess 10.1.2.21:59812 17,032 in 138.041s, at=[0 0 0.00 32] ft=137.855s(pt=0.013s lt=137.715s ct=0.127s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,004 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:32948 8,003 in 139.603s, at=[0 0 0.00 32] ft=139.400s(pt=0.003s lt=139.190s ct=0.207s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,070 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:37183 16,372 in 139.701s, at=[0 0 0.00 32] ft=139.506s(pt=0.008s lt=139.228s ct=0.270s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,071 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:40775 16,313 in 139.670s, at=[0 1 0.03 32] ft=139.497s(pt=0.012s lt=139.211s ct=0.274s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:39,345 [tserver.TabletServer] DEBUG: UpSess 10.1.2.22:37906 33,858 in 140.193s, at=[0 0 0.00 32] ft=140.124s(pt=0.045s lt=139.540s ct=0.539s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:41,388 [tserver.TabletServer] DEBUG: UpSess 10.1.2.17:39697 33,352 in 142.488s, at=[0 0 0.00 32] ft=142.351s(pt=0.015s lt=141.952s ct=0.384s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,494 [tserver.TabletServer] DEBUG: UpSess 10.1.2.15:43946 8,477 in 120.655s, at=[0 1 0.03 32] ft=120.607s(pt=0.002s lt=120.210s ct=0.395s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,669 [tserver.TabletServer] DEBUG: UpSess 10.1.2.13:57710 17,033 in 115.454s, at=[0 1 0.03 32] ft=115.211s(pt=0.013s lt=114.630s ct=0.568s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,675 [tserver.TabletServer] DEBUG: UpSess 10.1.2.28:44022 16,907 in 116.482s, at=[0 0 0.00 32] ft=116.322s(pt=0.004s lt=115.743s ct=0.575s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:47,784 [tserver.TabletServer] DEBUG: UpSess 10.1.2.29:48898 25,146 in 126.834s, at=[0 0 0.00 32] ft=126.749s(pt=0.007s lt=126.057s ct=0.685s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:50,020 [tserver.TabletServer] DEBUG: UpSess 10.1.2.10:60386 2,533 in 140.164s, at=[0 0 0.00 1] ft=140.155s(pt=0.000s lt=140.129s ct=0.026s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:50,388 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:43589 16,838 in 132.732s, at=[0 1 0.03 32] ft=132.671s(pt=0.004s lt=132.276s ct=0.391s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:50,503 [tserver.TabletServer] DEBUG: UpSess 10.1.2.25:50677 25,491 in 136.344s, at=[0 0 0.00 32] ft=136.211s(pt=0.016s lt=135.684s ct=0.511s) /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-23.ec2.internal.debug.log:2014-04-15 06:17:53,409 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:36106 16,961 in 143.786s, at=[0 1 0.06 32] ft=143.719s(pt=0.004s lt=143.373s ct=0.342s) [cluster@ip-10-1-2-10 continuous]$ pssh -i -h ingesters.txt 'grep -a "writeTime" /opt/accumulo-1.6.0/logs/*tserver* | egrep -v "writeTime:[1-2]?[0-9]?[0-9]?[0-9]?[0-9]m" ' [4] 17:34:02 [SUCCESS] ip-10-1-2-13 /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-13.ec2.internal.debug.log:2014-04-15 06:17:00,243 [log.TabletServerLogger] DEBUG: wrote MinC finish 64556: writeTime:30902ms /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-13.ec2.internal.debug.log:2014-04-15 06:17:00,246 [log.TabletServerLogger] DEBUG: wrote MinC finish 64557: writeTime:30663ms /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-13.ec2.internal.debug.log:2014-04-15 06:17:03,548 [log.TabletServerLogger] DEBUG: wrote MinC finish 64558: writeTime:33712ms [12] 17:34:02 [SUCCESS] ip-10-1-2-24 /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-16 04:32:31,653 [log.TabletServerLogger] DEBUG: wrote MinC finish 822843: writeTime:35119ms [14] 17:34:02 [SUCCESS] ip-10-1-2-22 /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-22.ec2.internal.debug.log:2014-04-15 18:15:27,592 [log.TabletServerLogger] DEBUG: wrote MinC finish 153363: writeTime:39875ms [16] 17:34:02 [SUCCESS] ip-10-1-2-29 /opt/accumulo-1.6.0/logs/tserver_ip-10-1-2-29.ec2.internal.debug.log:2014-04-15 06:21:25,684 [log.TabletServerLogger] DEBUG: wrote MinC finish 62514: writeTime:50973ms
          Hide
          mdrob Mike Drob added a comment -

          Running 1.6.0-RC2, saw this happen with both traces and one of the ingesters again. I believe this build includes ACCUMULO-2668, so that didn't solve the issue.

          The ingester died, the tracer is still running.

          Show
          mdrob Mike Drob added a comment - Running 1.6.0-RC2 , saw this happen with both traces and one of the ingesters again. I believe this build includes ACCUMULO-2668 , so that didn't solve the issue. The ingester died, the tracer is still running.
          Hide
          mdrob Mike Drob added a comment -

          Another possibility for the short term is to list this as a known issue in the release notes w/ recommendations for working around.

          +1

          Show
          mdrob Mike Drob added a comment - Another possibility for the short term is to list this as a known issue in the release notes w/ recommendations for working around. +1
          Hide
          kturner Keith Turner added a comment -

          I feel like we should build some retry capability into clients for this.

          I agree, this is the long term solution. The batch writer should handle this situation transparently. It would probably be best if when the server suspected the client had timed out that it threw a specific thrift exception for this case. What if, anything should be done for the short term?

          Does increasing the rpc timeout have adverse affects on anything else? Like causing servers to hold on to more requests in memory?

          Yes, it could certainly increase memory used. I will attach a patch to outline the short term work around I was thinking about. Maybe its too risky because it along w/ ACCUMULO-1905 may lead to too much memory used to buffer incoming data. Also it does not prevent this issue from occurring, just reduces the probability that it will occur. Another possibility for the short term is to list this as a known issue in the release notes w/ recommendations for working around.

          Maybe ACCUMULO-2668 will speed up walogs and decrease the probability this situation will occur. I am thinking of trying that patch w/ my next run instead of increasing general.rpc.timoute.

          Show
          kturner Keith Turner added a comment - I feel like we should build some retry capability into clients for this. I agree, this is the long term solution. The batch writer should handle this situation transparently. It would probably be best if when the server suspected the client had timed out that it threw a specific thrift exception for this case. What if, anything should be done for the short term? Does increasing the rpc timeout have adverse affects on anything else? Like causing servers to hold on to more requests in memory? Yes, it could certainly increase memory used. I will attach a patch to outline the short term work around I was thinking about. Maybe its too risky because it along w/ ACCUMULO-1905 may lead to too much memory used to buffer incoming data. Also it does not prevent this issue from occurring, just reduces the probability that it will occur. Another possibility for the short term is to list this as a known issue in the release notes w/ recommendations for working around. Maybe ACCUMULO-2668 will speed up walogs and decrease the probability this situation will occur. I am thinking of trying that patch w/ my next run instead of increasing general.rpc.timoute.
          Hide
          mdrob Mike Drob added a comment -

          By increasing the tserver rpc timeout, we're not really doing anything to address the issue. It deals with the proximal cause of clients dying, but eventually somebody will reach the point where 4 minutes is not enough due to the load on their cluster and up it goes again. Does increasing the rpc timeout have adverse affects on anything else? Like causing servers to hold on to more requests in memory?

          I feel like we should build some retry capability into clients for this. Probably would be good to do it inside of the [Batch]Scanner implementation so that users don't even need to worry about it. I remember discussing the possibility of scanners that could advertise their own timeout, did that ever get a JIRA filed? If not, it should.

          Show
          mdrob Mike Drob added a comment - By increasing the tserver rpc timeout, we're not really doing anything to address the issue. It deals with the proximal cause of clients dying, but eventually somebody will reach the point where 4 minutes is not enough due to the load on their cluster and up it goes again. Does increasing the rpc timeout have adverse affects on anything else? Like causing servers to hold on to more requests in memory? I feel like we should build some retry capability into clients for this. Probably would be good to do it inside of the [Batch] Scanner implementation so that users don't even need to worry about it. I remember discussing the possibility of scanners that could advertise their own timeout, did that ever get a JIRA filed? If not, it should.
          Hide
          kturner Keith Turner added a comment -

          I ran continuous ingest on a 20 node AWS cluster using 1.6.0-RC1 and 6 of my 17 ingest clients died with this issue. About 11 hours after stating 6 of the clients died around the same time because of an internal error on the same tserver. The tserver had many org.apache.accumulo.tserver.HoldTimeoutException around that time.

          [cluster@ip-10-1-2-24 logs]$ grep "Commits held" * | egrep [0-9][0-9][0-9][.][0-9][0-9]
          tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,800 [tserver.TabletServerResourceManager] DEBUG: Commits held for 184.93 secs
          

          Looking at the tserver logs it seems that walog writes taking too long were the culprit. The tablet server ran for hours, but excessively long walog times only occurred around this short period of time.

          [cluster@ip-10-1-2-24 logs]$ grep writeTime * | egrep -v "writeTime:[1-2]?[0-9]?[0-9]?[0-9]?[0-9]m"
          tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:03:21,790 [log.TabletServerLogger] DEBUG:  wrote MinC finish  60370: writeTime:39247ms 
          tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,665 [log.TabletServerLogger] DEBUG:  wrote MinC finish  60410: writeTime:58963ms 
          tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 [log.TabletServerLogger] DEBUG:  wrote MinC finish  60412: writeTime:58104ms 
          tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 [log.TabletServerLogger] DEBUG:  wrote MinC finish  60413: writeTime:57303ms 
          tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 [log.TabletServerLogger] DEBUG:  wrote MinC finish  60411: writeTime:58922ms 
          
          [cluster@ip-10-1-2-24 logs]$ grep UpSess tserver_ip-10-1-2-24.ec2.internal.debug.log | egrep -v "lt=[0-9]?[0-9][.][0-9]"
          2014-04-12 11:04:21,370 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:41076 16,893 in 113.972s, at=[0 0 0.00 32] ft=113.917s(pt=0.005s lt=113.564s ct=0.348s)
          2014-04-12 11:04:21,528 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:45239 25,402 in 114.076s, at=[0 0 0.00 32] ft=113.998s(pt=0.009s lt=113.485s ct=0.504s)
          2014-04-12 11:04:34,246 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:34925 25,495 in 127.707s, at=[0 0 0.00 32] ft=127.524s(pt=0.008s lt=127.188s ct=0.328s)
          2014-04-12 11:05:12,730 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:47808 41,559 in 110.530s, at=[0 1 0.03 32] ft=110.297s(pt=0.027s lt=109.579s ct=0.691s)
          2014-04-12 11:05:12,731 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:36129 42,009 in 110.808s, at=[0 0 0.00 32] ft=110.699s(pt=0.027s lt=109.802s ct=0.870s)
          2014-04-12 11:05:13,091 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:38228 41,187 in 114.117s, at=[0 1 0.03 32] ft=114.022s(pt=0.050s lt=113.574s ct=0.398s)
          2014-04-12 11:05:39,362 [tserver.TabletServer] DEBUG: UpSess 10.1.2.22:37561 41,799 in 124.262s, at=[0 0 0.00 32] ft=124.169s(pt=0.019s lt=123.414s ct=0.736s)
          2014-04-12 11:06:41,348 [tserver.TabletServer] DEBUG: UpSess 10.1.2.17:57689 33,547 in 130.710s, at=[0 1 0.06 32] ft=130.631s(pt=0.014s lt=130.119s ct=0.498s)
          2014-04-12 11:06:53,563 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:35186 33,503 in 146.692s, at=[0 1 0.03 32] ft=146.587s(pt=0.017s lt=146.300s ct=0.270s)
          2014-04-12 11:06:58,482 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:45239 33,394 in 156.948s, at=[0 1 0.03 32] ft=156.845s(pt=0.017s lt=156.575s ct=0.253s)
          2014-04-12 11:07:05,899 [tserver.TabletServer] DEBUG: UpSess 10.1.2.13:40356 41,804 in 226.959s, at=[0 1 0.03 32] ft=226.786s(pt=0.020s lt=226.168s ct=0.598s)
          2014-04-12 11:07:06,025 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:36129 7,962 in 113.265s, at=[0 1 0.03 32] ft=113.243s(pt=0.002s lt=113.082s ct=0.159s)
          2014-04-12 11:07:06,044 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:38228 8,314 in 111.536s, at=[0 0 0.00 32] ft=111.497s(pt=0.003s lt=111.318s ct=0.176s)
          2014-04-12 11:07:06,143 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:47808 16,831 in 110.390s, at=[0 0 0.00 32] ft=110.359s(pt=0.005s lt=110.081s ct=0.273s)
          2014-04-12 11:07:14,716 [tserver.TabletServer] DEBUG: UpSess 10.1.2.23:54944 8,485 in 138.654s, at=[0 0 0.00 32] ft=138.638s(pt=0.003s lt=138.596s ct=0.039s)
          2014-04-12 11:08:25,465 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:41076 41,923 in 244.077s, at=[0 1 0.03 32] ft=243.878s(pt=78.546s lt=164.889s ct=0.443s)
          2014-04-12 11:08:25,891 [tserver.TabletServer] DEBUG: UpSess 10.1.2.20:41926 41,910 in 235.374s, at=[0 1 0.03 32] ft=235.277s(pt=98.752s lt=135.996s ct=0.529s)
          2014-04-12 11:08:25,891 [tserver.TabletServer] DEBUG: UpSess 10.1.2.29:41553 41,691 in 228.196s, at=[0 1 0.03 32] ft=228.041s(pt=107.973s lt=119.670s ct=0.398s)
          2014-04-12 11:08:26,307 [tserver.TabletServer] DEBUG: UpSess 10.1.2.15:50332 42,013 in 228.612s, at=[0 0 0.00 32] ft=228.434s(pt=118.835s lt=109.381s ct=0.218s)
          

          Since this was running on AWS maybe these walog hiccups are expected, not sure. In any case the clients should not error. I am going to retry w/ setting general.rpc.timout=240s in accumulo-site.xml

          Show
          kturner Keith Turner added a comment - I ran continuous ingest on a 20 node AWS cluster using 1.6.0-RC1 and 6 of my 17 ingest clients died with this issue. About 11 hours after stating 6 of the clients died around the same time because of an internal error on the same tserver. The tserver had many org.apache.accumulo.tserver.HoldTimeoutException around that time. [cluster@ip-10-1-2-24 logs]$ grep "Commits held" * | egrep [0-9][0-9][0-9][.][0-9][0-9] tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,800 [tserver.TabletServerResourceManager] DEBUG: Commits held for 184.93 secs Looking at the tserver logs it seems that walog writes taking too long were the culprit. The tablet server ran for hours, but excessively long walog times only occurred around this short period of time. [cluster@ip-10-1-2-24 logs]$ grep writeTime * | egrep -v "writeTime:[1-2]?[0-9]?[0-9]?[0-9]?[0-9]m" tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:03:21,790 [log.TabletServerLogger] DEBUG: wrote MinC finish 60370: writeTime:39247ms tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,665 [log.TabletServerLogger] DEBUG: wrote MinC finish 60410: writeTime:58963ms tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 [log.TabletServerLogger] DEBUG: wrote MinC finish 60412: writeTime:58104ms tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 [log.TabletServerLogger] DEBUG: wrote MinC finish 60413: writeTime:57303ms tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 [log.TabletServerLogger] DEBUG: wrote MinC finish 60411: writeTime:58922ms [cluster@ip-10-1-2-24 logs]$ grep UpSess tserver_ip-10-1-2-24.ec2.internal.debug.log | egrep -v "lt=[0-9]?[0-9][.][0-9]" 2014-04-12 11:04:21,370 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:41076 16,893 in 113.972s, at=[0 0 0.00 32] ft=113.917s(pt=0.005s lt=113.564s ct=0.348s) 2014-04-12 11:04:21,528 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:45239 25,402 in 114.076s, at=[0 0 0.00 32] ft=113.998s(pt=0.009s lt=113.485s ct=0.504s) 2014-04-12 11:04:34,246 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:34925 25,495 in 127.707s, at=[0 0 0.00 32] ft=127.524s(pt=0.008s lt=127.188s ct=0.328s) 2014-04-12 11:05:12,730 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:47808 41,559 in 110.530s, at=[0 1 0.03 32] ft=110.297s(pt=0.027s lt=109.579s ct=0.691s) 2014-04-12 11:05:12,731 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:36129 42,009 in 110.808s, at=[0 0 0.00 32] ft=110.699s(pt=0.027s lt=109.802s ct=0.870s) 2014-04-12 11:05:13,091 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:38228 41,187 in 114.117s, at=[0 1 0.03 32] ft=114.022s(pt=0.050s lt=113.574s ct=0.398s) 2014-04-12 11:05:39,362 [tserver.TabletServer] DEBUG: UpSess 10.1.2.22:37561 41,799 in 124.262s, at=[0 0 0.00 32] ft=124.169s(pt=0.019s lt=123.414s ct=0.736s) 2014-04-12 11:06:41,348 [tserver.TabletServer] DEBUG: UpSess 10.1.2.17:57689 33,547 in 130.710s, at=[0 1 0.06 32] ft=130.631s(pt=0.014s lt=130.119s ct=0.498s) 2014-04-12 11:06:53,563 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:35186 33,503 in 146.692s, at=[0 1 0.03 32] ft=146.587s(pt=0.017s lt=146.300s ct=0.270s) 2014-04-12 11:06:58,482 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:45239 33,394 in 156.948s, at=[0 1 0.03 32] ft=156.845s(pt=0.017s lt=156.575s ct=0.253s) 2014-04-12 11:07:05,899 [tserver.TabletServer] DEBUG: UpSess 10.1.2.13:40356 41,804 in 226.959s, at=[0 1 0.03 32] ft=226.786s(pt=0.020s lt=226.168s ct=0.598s) 2014-04-12 11:07:06,025 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:36129 7,962 in 113.265s, at=[0 1 0.03 32] ft=113.243s(pt=0.002s lt=113.082s ct=0.159s) 2014-04-12 11:07:06,044 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:38228 8,314 in 111.536s, at=[0 0 0.00 32] ft=111.497s(pt=0.003s lt=111.318s ct=0.176s) 2014-04-12 11:07:06,143 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:47808 16,831 in 110.390s, at=[0 0 0.00 32] ft=110.359s(pt=0.005s lt=110.081s ct=0.273s) 2014-04-12 11:07:14,716 [tserver.TabletServer] DEBUG: UpSess 10.1.2.23:54944 8,485 in 138.654s, at=[0 0 0.00 32] ft=138.638s(pt=0.003s lt=138.596s ct=0.039s) 2014-04-12 11:08:25,465 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:41076 41,923 in 244.077s, at=[0 1 0.03 32] ft=243.878s(pt=78.546s lt=164.889s ct=0.443s) 2014-04-12 11:08:25,891 [tserver.TabletServer] DEBUG: UpSess 10.1.2.20:41926 41,910 in 235.374s, at=[0 1 0.03 32] ft=235.277s(pt=98.752s lt=135.996s ct=0.529s) 2014-04-12 11:08:25,891 [tserver.TabletServer] DEBUG: UpSess 10.1.2.29:41553 41,691 in 228.196s, at=[0 1 0.03 32] ft=228.041s(pt=107.973s lt=119.670s ct=0.398s) 2014-04-12 11:08:26,307 [tserver.TabletServer] DEBUG: UpSess 10.1.2.15:50332 42,013 in 228.612s, at=[0 0 0.00 32] ft=228.434s(pt=118.835s lt=109.381s ct=0.218s) Since this was running on AWS maybe these walog hiccups are expected, not sure. In any case the clients should not error. I am going to retry w/ setting general.rpc.timout=240s in accumulo-site.xml
          Hide
          mdrob Mike Drob added a comment -

          I think we can bump resolving it properly to 1.6.1 since there are workarounds present.

          Show
          mdrob Mike Drob added a comment - I think we can bump resolving it properly to 1.6.1 since there are workarounds present.
          Hide
          kturner Keith Turner added a comment -

          Mike Drob what are your thoughts on this for 1.6.0? I am thinking that if we waiting 2x general.rpc.timeout on the server side might be a good option. If we do this, may want to consider doing it for 1.4 and 1.5.

          Show
          kturner Keith Turner added a comment - Mike Drob what are your thoughts on this for 1.6.0? I am thinking that if we waiting 2x general.rpc.timeout on the server side might be a good option. If we do this, may want to consider doing it for 1.4 and 1.5.
          Hide
          mdrob Mike Drob added a comment -

          I brought the timeout back to the default, but instead of starting the table from scratch I pre-split it into 10 tablets and started a continuous ingest run. None of the clients died under this configuration.

          Show
          mdrob Mike Drob added a comment - I brought the timeout back to the default, but instead of starting the table from scratch I pre-split it into 10 tablets and started a continuous ingest run. None of the clients died under this configuration.
          Hide
          mdrob Mike Drob added a comment -

          Wow, that's pretty buried in the code. Thanks, Keith.

          I started a run yesterday with general.rpc.timout=240s in my accumulo-site.xml, and everything seems to be going well. The longest hold time was 215s, and all held commits occurred in the first two hours of ingest. I'm going to try and piece together a time-line to see where things correspond to split/flush/compact and see if I can get more insight into potential issues.

          Show
          mdrob Mike Drob added a comment - Wow, that's pretty buried in the code. Thanks, Keith. I started a run yesterday with general.rpc.timout=240s in my accumulo-site.xml, and everything seems to be going well. The longest hold time was 215s, and all held commits occurred in the first two hours of ingest. I'm going to try and piece together a time-line to see where things correspond to split/flush/compact and see if I can get more insight into potential issues.
          Hide
          kturner Keith Turner added a comment -

          TabletServerBatchWriter.MutationWriter.sendMutationsToTabletServer() calls the followng method in ThriftUtil. Which will probably just get the default timeout. So the client should timeout. Another possible problem is that the timeout code on the client is not working properly.

          class ThriftUtil{
           static public TabletClientService.Client getTServerClient(String address, AccumuloConfiguration conf) throws TTransportException {
              return getClient(new TabletClientService.Client.Factory(), address, Property.GENERAL_RPC_TIMEOUT, conf);
            }
          }
          
          Show
          kturner Keith Turner added a comment - TabletServerBatchWriter.MutationWriter.sendMutationsToTabletServer() calls the followng method in ThriftUtil. Which will probably just get the default timeout. So the client should timeout. Another possible problem is that the timeout code on the client is not working properly. class ThriftUtil{ static public TabletClientService.Client getTServerClient( String address, AccumuloConfiguration conf) throws TTransportException { return getClient( new TabletClientService.Client.Factory(), address, Property.GENERAL_RPC_TIMEOUT, conf); } }
          Hide
          mdrob Mike Drob added a comment -

          Do the continuous ingest clients read the value for general.rpc.timeout out of the accumulo-site.xml or zookeeper? I don't see that happening (not saying that it doesn't, just that I am not aware of it, if it is).

          Like I said earlier, I didn't think the client has any notion of time-out, so it will wait forever for a response (well, MAX_LONG, but close enough). It is also possible I am misinterpreting client values.

          Show
          mdrob Mike Drob added a comment - Do the continuous ingest clients read the value for general.rpc.timeout out of the accumulo-site.xml or zookeeper? I don't see that happening (not saying that it doesn't, just that I am not aware of it, if it is). Like I said earlier, I didn't think the client has any notion of time-out, so it will wait forever for a response (well, MAX_LONG , but close enough). It is also possible I am misinterpreting client values.
          Hide
          kturner Keith Turner added a comment -

          How will doubling the default value for general.rpc.timeout resolve the issue? This still leaves the server and client timeout the same, which is the root of the problem.

          Show
          kturner Keith Turner added a comment - How will doubling the default value for general.rpc.timeout resolve the issue? This still leaves the server and client timeout the same, which is the root of the problem.
          Hide
          mdrob Mike Drob added a comment -

          Doubling the hold time in just that one spot doesn't sit quite right with me - I think that is surprising behaviour and I would not want to push that onto an operator ever. What do you think of doubling the underlying property - general.rpc.timeout - either in this specific case, or even in the default case.

          Show
          mdrob Mike Drob added a comment - Doubling the hold time in just that one spot doesn't sit quite right with me - I think that is surprising behaviour and I would not want to push that onto an operator ever. What do you think of doubling the underlying property - general.rpc.timeout - either in this specific case, or even in the default case.
          Hide
          kturner Keith Turner added a comment -

          I think having the server double the configured hold time is a good short time solution. Having clients pass timeout to the server via thrift seems like a good long term solution along with throwing an explicit thrift timeout exception. The client needs to know this happened so it can differentiate it from an unexpected server side error.

          Show
          kturner Keith Turner added a comment - I think having the server double the configured hold time is a good short time solution. Having clients pass timeout to the server via thrift seems like a good long term solution along with throwing an explicit thrift timeout exception. The client needs to know this happened so it can differentiate it from an unexpected server side error.
          Hide
          mdrob Mike Drob added a comment -

          Started looking more into the time-out issues, and discovered that I'm not setting --batchTimeout and it is defaulting to Long.MAX_VALUE. So the client is always going to still be waiting and the tablet server will never be able to safely abort the operation. At least, I hope that I'm reading these properties correctly.

          The longest hold time that I see in the logs is 170s, and my general.rpc.timeout is 120s, so the HoldTimeoutException makes sense, from a sanity check point of view. Doubling the wait time on the tserver would likely solve the issue here, yes, but I'm worried that's not a stable approach.

          The question that remains is how best to prevent ingest clients from dying, now. There may be different short term and long term solutions that are appropriate.
          Short term: Let ingest retry the failed mutations. If the tablet server rejects the mutation for whatever reason, and the client is aware of it, then this wouldn't constitute data loss.
          Long term: Provide an API for clients to specify their wait time to the servers. Then, tablet servers can more intelligently decide when to abort scans and when not to. Would need to protect the cluster from several poorly configured clients asking for very long hold times. Possibly other availability issues would be present as well.

          Thoughts?

          Show
          mdrob Mike Drob added a comment - Started looking more into the time-out issues, and discovered that I'm not setting --batchTimeout and it is defaulting to Long.MAX_VALUE . So the client is always going to still be waiting and the tablet server will never be able to safely abort the operation. At least, I hope that I'm reading these properties correctly. The longest hold time that I see in the logs is 170s, and my general.rpc.timeout is 120s, so the HoldTimeoutException makes sense, from a sanity check point of view. Doubling the wait time on the tserver would likely solve the issue here, yes, but I'm worried that's not a stable approach. The question that remains is how best to prevent ingest clients from dying, now. There may be different short term and long term solutions that are appropriate. Short term: Let ingest retry the failed mutations. If the tablet server rejects the mutation for whatever reason, and the client is aware of it, then this wouldn't constitute data loss. Long term: Provide an API for clients to specify their wait time to the servers. Then, tablet servers can more intelligently decide when to abort scans and when not to. Would need to protect the cluster from several poorly configured clients asking for very long hold times. Possibly other availability issues would be present as well. Thoughts?
          Hide
          mdrob Mike Drob added a comment -

          Finally caught the moment when things die - before commits are held and all but one of my ingest clients die, I see write rate of about 300k entries per second.

          Show
          mdrob Mike Drob added a comment - Finally caught the moment when things die - before commits are held and all but one of my ingest clients die, I see write rate of about 300k entries per second.
          Hide
          mdrob Mike Drob added a comment -

          Attaching tracer debug log. Doesn't look like there is anything new, unfortunately.

          Show
          mdrob Mike Drob added a comment - Attaching tracer debug log. Doesn't look like there is anything new, unfortunately.
          Hide
          mdrob Mike Drob added a comment -

          Keith Turner - looking at the waitUntilCommitsAreEnabled method, I see that we also ignore InterruptedException and just continue waiting. Seems like a code smell.

          I had another realization on this - I've only seen the error on writes that are traced. After the fixes from ACCUMULO-2390 I see better exceptions in the monitor, and it looks like the commits held are specifically traces - not ingest data. I'm going to fish around in the logs for something concrete and hopefully more useful.

          Show
          mdrob Mike Drob added a comment - Keith Turner - looking at the waitUntilCommitsAreEnabled method, I see that we also ignore InterruptedException and just continue waiting. Seems like a code smell. I had another realization on this - I've only seen the error on writes that are traced. After the fixes from ACCUMULO-2390 I see better exceptions in the monitor, and it looks like the commits held are specifically traces - not ingest data. I'm going to fish around in the logs for something concrete and hopefully more useful.
          Hide
          kturner Keith Turner added a comment -

          If the HoldTimeoutException is whats causing problems, it may be that timing is too tight. The server code assumes that if the write was held more than general.rpc.timeout that the client has timed out and is no longer interested. Therefore the server code throws exception to abort with the expectation that client will not be impacted. However if the client has not timed out for some reason, it would see this exception. Maybe change the timeout in TabletServerResourceManager.waitUntilCommitsAreEnabled() to 2x general.rpc.timeout and see if that helps avoid this problem.

          Show
          kturner Keith Turner added a comment - If the HoldTimeoutException is whats causing problems, it may be that timing is too tight. The server code assumes that if the write was held more than general.rpc.timeout that the client has timed out and is no longer interested. Therefore the server code throws exception to abort with the expectation that client will not be impacted. However if the client has not timed out for some reason, it would see this exception. Maybe change the timeout in TabletServerResourceManager.waitUntilCommitsAreEnabled() to 2x general.rpc.timeout and see if that helps avoid this problem.
          Hide
          elserj Josh Elser added a comment -

          we're up to ~100 tablets per server currently

          Does that mean you didn't presplit your table nor set the split threshold higher than the default 1G?

          Show
          elserj Josh Elser added a comment - we're up to ~100 tablets per server currently Does that mean you didn't presplit your table nor set the split threshold higher than the default 1G?
          Hide
          mdrob Mike Drob added a comment - - edited

          As an unrelated note, while trying to understand what is going on with the exception handling, I feel like TraceProxy.trace should not throw Throwable...

          I'll create a separate JIRA for that this afternoon.

          Show
          mdrob Mike Drob added a comment - - edited As an unrelated note, while trying to understand what is going on with the exception handling, I feel like TraceProxy.trace should not throw Throwable ... I'll create a separate JIRA for that this afternoon.
          Hide
          mdrob Mike Drob added a comment -

          Neither minors nor majors appear to be backing up. I let the single client continue writing overnight and we're up to ~100 tablets per server currently. The ingest rate is still steady at 150k entries. Actually, steady at 175k, but occasionally dropping to 100k when there is a spike in the number of flushes happening. I did not restart the other clients which died.

          Where do I get an aggregate write rate? From the table on the "Recent Traces" page:

          Type Total Min Max Avg
          majorCompaction 4 465ms 810ms 612ms
          minorCompaction 199 114ms 7s 709ms 4s 323ms

          When I drill down into minors, and look at the longest few, here are the details:

          time entries size
          5815 471697 19258626
          5879 436972 17837345
          5760 471619 19260212
          Show
          mdrob Mike Drob added a comment - Neither minors nor majors appear to be backing up. I let the single client continue writing overnight and we're up to ~100 tablets per server currently. The ingest rate is still steady at 150k entries. Actually, steady at 175k, but occasionally dropping to 100k when there is a spike in the number of flushes happening. I did not restart the other clients which died. Where do I get an aggregate write rate? From the table on the "Recent Traces" page: Type Total Min Max Avg majorCompaction 4 465ms 810ms 612ms minorCompaction 199 114ms 7s 709ms 4s 323ms When I drill down into minors, and look at the longest few, here are the details: time entries size 5815 471697 19258626 5879 436972 17837345 5760 471619 19260212
          Hide
          ecn Eric Newton added a comment -

          150K aggregate? That is significantly less than I would expect on that kind of hardware.

          What's the write rate you are seeing on minor compactions? The traces have information about the size and times.

          Are minor compactions backing up? What's the rate when the number of tablets per server is ~80?

          Show
          ecn Eric Newton added a comment - 150K aggregate? That is significantly less than I would expect on that kind of hardware. What's the write rate you are seeing on minor compactions? The traces have information about the size and times. Are minor compactions backing up? What's the rate when the number of tablets per server is ~80?
          Hide
          mdrob Mike Drob added a comment -

          Also saw this same error with 1.5.1-rc2

          Show
          mdrob Mike Drob added a comment - Also saw this same error with 1.5.1-rc2
          Hide
          mdrob Mike Drob added a comment - - edited

          Physical cluster; 5 slaves + 5 ingest clients; 11 drives/node.
          Ingest with 1 clients (since the rest died...) is 150k entries/second which is approx 17 MB/s
          IMM = 8G, Tserver Heap = 2G, Total RAM = 48G

          Edit: I was looking at the wrong logs, updated the number of ingesters accordingly.

          Show
          mdrob Mike Drob added a comment - - edited Physical cluster; 5 slaves + 5 ingest clients; 11 drives/node. Ingest with 1 clients (since the rest died...) is 150k entries/second which is approx 17 MB/s IMM = 8G, Tserver Heap = 2G, Total RAM = 48G Edit: I was looking at the wrong logs, updated the number of ingesters accordingly.
          Hide
          ecn Eric Newton added a comment -

          Can you provide some details about your cluster (drives, virtual vs bare metal)?

          What ingest rate are you seeing without agitation?

          How big is your IMM?

          Show
          ecn Eric Newton added a comment - Can you provide some details about your cluster (drives, virtual vs bare metal)? What ingest rate are you seeing without agitation? How big is your IMM?
          Hide
          elserj Josh Elser added a comment -

          Without any HDFS errors, might you be oversaturating the tservers? I typically found that one CI ingest client can saturate a couple of tservers once majcs really kick off.

          Also, have you taken a look at the MinC traces to see if some part of the chain is taking unexpectedly long?

          Show
          elserj Josh Elser added a comment - Without any HDFS errors, might you be oversaturating the tservers? I typically found that one CI ingest client can saturate a couple of tservers once majcs really kick off. Also, have you taken a look at the MinC traces to see if some part of the chain is taking unexpectedly long?
          Hide
          mdrob Mike Drob added a comment -

          Was running again without any agitation, it took about an hour for the first client to die. From the tserver log:

          2014-02-20 11:35:33,570 [thrift.ProcessFunction] ERROR: Internal error processing applyUpdates
          org.apache.accumulo.tserver.HoldTimeoutException: Commits are held
                  at org.apache.accumulo.tserver.TabletServerResourceManager.waitUntilCommitsAreEnabled(TabletServerResourceManager.java:412)
                  at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.flush(TabletServer.java:1556)
                  at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.applyUpdates(TabletServer.java:1535)
                  at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
                  at $Proxy16.applyUpdates(Unknown Source)
                  at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$applyUpdates.getResult(TabletClientService.java:2347)
                  at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$applyUpdates.getResult(TabletClientService.java:2333)
                  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
                  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
                  at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171)
                  at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
                  at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:231)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
                  at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
                  at java.lang.Thread.run(Thread.java:662)
          

          Nothing interesting before, and a little while later there's a read error because the client died.

          Show
          mdrob Mike Drob added a comment - Was running again without any agitation, it took about an hour for the first client to die. From the tserver log: 2014-02-20 11:35:33,570 [thrift.ProcessFunction] ERROR: Internal error processing applyUpdates org.apache.accumulo.tserver.HoldTimeoutException: Commits are held at org.apache.accumulo.tserver.TabletServerResourceManager.waitUntilCommitsAreEnabled(TabletServerResourceManager.java:412) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.flush(TabletServer.java:1556) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.applyUpdates(TabletServer.java:1535) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63) at $Proxy16.applyUpdates(Unknown Source) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$applyUpdates.getResult(TabletClientService.java:2347) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$applyUpdates.getResult(TabletClientService.java:2333) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478) at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:231) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:662) Nothing interesting before, and a little while later there's a read error because the client died.
          Hide
          elserj Josh Elser added a comment - - edited

          Are you sure you had Accumulo pointed at the correct HDFS config directory? The NameNodeProxies class will try to determine which kind of class to instantiate based on the contents of core-site and hdfs-site. Perhaps it was incorrectly only talking to one of the NNs in your (I'm assuming from other chatter) HA/QJM installation and would fail on renewing a lease when that NN wasn't active?

          Edit: nevermind, I re-read your post. My questions are likely irrelevant.

          Show
          elserj Josh Elser added a comment - - edited Are you sure you had Accumulo pointed at the correct HDFS config directory? The NameNodeProxies class will try to determine which kind of class to instantiate based on the contents of core-site and hdfs-site. Perhaps it was incorrectly only talking to one of the NNs in your (I'm assuming from other chatter) HA/QJM installation and would fail on renewing a lease when that NN wasn't active? Edit: nevermind, I re-read your post. My questions are likely irrelevant.
          Hide
          mdrob Mike Drob added a comment -

          Looking later in the tserver logs, this line is repeated a bunch:
          2014-02-20 07:23:22,907 [retry.RetryInvocationHandler] WARN : Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB. Trying to fail over immediately.

          It only started later, so I didn't catch it at first, but now I am thinking it might be relevant. Actually... It looks like it corresponds quite strongly to NN failover, and doesn't show up until later (because I didn't start NN agitation until later).

          So... potentially this has nothing to do with HDFS agitation. That does not make me feel better at all.

          Show
          mdrob Mike Drob added a comment - Looking later in the tserver logs, this line is repeated a bunch: 2014-02-20 07:23:22,907 [retry.RetryInvocationHandler] WARN : Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB. Trying to fail over immediately. It only started later, so I didn't catch it at first, but now I am thinking it might be relevant. Actually... It looks like it corresponds quite strongly to NN failover, and doesn't show up until later (because I didn't start NN agitation until later). So... potentially this has nothing to do with HDFS agitation. That does not make me feel better at all.
          Hide
          ecn Eric Newton added a comment -

          The log shows this tablet server was held, minor compactions were probably hanging. But there's no errors bubbling up from DFSClient.

          Do you know what sort of agitation was going on at this time? NN or a bunch of DN's?

          Show
          ecn Eric Newton added a comment - The log shows this tablet server was held, minor compactions were probably hanging. But there's no errors bubbling up from DFSClient. Do you know what sort of agitation was going on at this time? NN or a bunch of DN's?
          Hide
          mdrob Mike Drob added a comment -

          Attaching a larger section of the tserver log.

          Show
          mdrob Mike Drob added a comment - Attaching a larger section of the tserver log.
          Hide
          ecn Eric Newton added a comment -

          Mike Drob can you post more of tserver1's log? This traceback is the loss of a connection to a client, and not an error from HDFS.

          Show
          ecn Eric Newton added a comment - Mike Drob can you post more of tserver1's log? This traceback is the loss of a connection to a client, and not an error from HDFS.
          Hide
          mdrob Mike Drob added a comment -

          It's easy to modify the CI client code to retry when faced with a MutationsRejectedException but I wonder if that would hide other issues that we could find. Probably need to find the code in tserver and make it more robust.

          Show
          mdrob Mike Drob added a comment - It's easy to modify the CI client code to retry when faced with a MutationsRejectedException but I wonder if that would hide other issues that we could find. Probably need to find the code in tserver and make it more robust.

            People

            • Assignee:
              kturner Keith Turner
              Reporter:
              mdrob Mike Drob
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h

                  Development