Accumulo
  1. Accumulo
  2. ACCUMULO-2213

tracer reports: IllegalStateException: Closed

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.5, 1.5.1, 1.6.0
    • Component/s: trace
    • Labels:
      None

      Description

      During a 24 hour continuous ingest test with agitation, the following was reported 42k times:

      Timer task failed org.apache.accumulo.tracer.TraceServer$1 Closed
      	java.lang.IllegalStateException: Closed
      		at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.flush(TabletServerBatchWriter.java:302)
      		at org.apache.accumulo.core.client.impl.BatchWriterImpl.flush(BatchWriterImpl.java:59)
      		at org.apache.accumulo.tracer.TraceServer.flush(TraceServer.java:225)
      		at org.apache.accumulo.tracer.TraceServer.access$400(TraceServer.java:75)
      		at org.apache.accumulo.tracer.TraceServer$1.run(TraceServer.java:217)
      		at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
      		at java.util.TimerThread.mainLoop(Timer.java:512)
      		at java.util.TimerThread.run(Timer.java:462)
      

        Issue Links

          Activity

          Eric Newton created issue -
          Hide
          Sean Busbey added a comment -

          I can make this happen reliably on 1.4.5 while using gremlins to disrupt the node the tracer is on.

          I also get exceptions in the main loop, which blows up the log for the service:

          2014-01-15 23:41:10,533 [trace.TraceServer] ERROR: Unable to write mutation to table: org.apache.accumulo.core.data.Mutation@0
          java.lang.IllegalStateException: Closed
                  at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:192)
                  at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:40)
                  at org.apache.accumulo.server.trace.TraceServer$Receiver.span(TraceServer.java:136)
                  at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor$span.process(SpanReceiver.java:205)
                  at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor.process(SpanReceiver.java:185)
                  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at java.lang.Thread.run(Thread.java:662)
          2014-01-15 23:41:10,533 [trace.TraceServer] ERROR: Unable to write mutation to table: org.apache.accumulo.core.data.Mutation@0
          java.lang.IllegalStateException: Closed
                  at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:192)
                  at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:40)
                  at org.apache.accumulo.server.trace.TraceServer$Receiver.span(TraceServer.java:137)
                  at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor$span.process(SpanReceiver.java:205)
                  at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor.process(SpanReceiver.java:185)
                  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at java.lang.Thread.run(Thread.java:662)
          

          I think there are two issues here: that the writer gets closed and that our error handling to reset doesn't properly account for enough error conditions (right now it's just Mutations Rejected).

          I've been talking in IRC as I try to figure out how the tracer's writer could get closed. Can we coordinate there?

          Show
          Sean Busbey added a comment - I can make this happen reliably on 1.4.5 while using gremlins to disrupt the node the tracer is on. I also get exceptions in the main loop, which blows up the log for the service: 2014-01-15 23:41:10,533 [trace.TraceServer] ERROR: Unable to write mutation to table: org.apache.accumulo.core.data.Mutation@0 java.lang.IllegalStateException: Closed at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:192) at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:40) at org.apache.accumulo.server.trace.TraceServer$Receiver.span(TraceServer.java:136) at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor$span.process(SpanReceiver.java:205) at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor.process(SpanReceiver.java:185) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2014-01-15 23:41:10,533 [trace.TraceServer] ERROR: Unable to write mutation to table: org.apache.accumulo.core.data.Mutation@0 java.lang.IllegalStateException: Closed at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:192) at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:40) at org.apache.accumulo.server.trace.TraceServer$Receiver.span(TraceServer.java:137) at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor$span.process(SpanReceiver.java:205) at org.apache.accumulo.cloudtrace.thrift.SpanReceiver$Processor.process(SpanReceiver.java:185) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) I think there are two issues here: that the writer gets closed and that our error handling to reset doesn't properly account for enough error conditions (right now it's just Mutations Rejected). I've been talking in IRC as I try to figure out how the tracer's writer could get closed. Can we coordinate there?
          Hide
          Eric Newton added a comment -

          Ugh... I checked in the fix for this under ACCUMULO-2209. I'm still investigating how the writer gets closed.

          Show
          Eric Newton added a comment - Ugh... I checked in the fix for this under ACCUMULO-2209 . I'm still investigating how the writer gets closed.
          Hide
          Sean Busbey added a comment -

          I was worried about causing deadlock if the recovery was expanded, because the only way I could think of at first is for the thread doing the reset to block after the close but before the reset to null.

          Show
          Sean Busbey added a comment - I was worried about causing deadlock if the recovery was expanded, because the only way I could think of at first is for the thread doing the reset to block after the close but before the reset to null.
          Sean Busbey made changes -
          Field Original Value New Value
          Remote Link This issue links to "review board (Web Link)" [ 13746 ]
          Hide
          Sean Busbey added a comment -

          Attaching a follow up patch for consideration.

          Unfortunately, I can no longer get this particular failure case to show up.

          Proposed patch :

          • isolates reset to the flush() thread; we don't need to queue up a bunch of resets in failure
          • adds explanatory comments
          • avoids catching Exception when adding mutations or flushing
          • adjusts logging to reflect retrying
          • makes it easier for to leave a test cluster configured to watch for the source of the closing (by setting TSBW's log level to trace in tracer_logger.xml)

          Eric Newton, let me know if this looks reasonable

          Show
          Sean Busbey added a comment - Attaching a follow up patch for consideration. Unfortunately, I can no longer get this particular failure case to show up. Proposed patch : isolates reset to the flush() thread; we don't need to queue up a bunch of resets in failure adds explanatory comments avoids catching Exception when adding mutations or flushing adjusts logging to reflect retrying makes it easier for to leave a test cluster configured to watch for the source of the closing (by setting TSBW's log level to trace in tracer_logger.xml) Eric Newton , let me know if this looks reasonable
          Hide
          Sean Busbey added a comment -

          Eric Newton, do you want the source of the close call that triggered this error chased down before closing this ticket?

          Show
          Sean Busbey added a comment - Eric Newton , do you want the source of the close call that triggered this error chased down before closing this ticket?
          Hide
          Eric Newton added a comment -

          If you have it, great. If not, I'm not worried about it.

          Show
          Eric Newton added a comment - If you have it, great. If not, I'm not worried about it.
          Sean Busbey made changes -
          Assignee Eric Newton [ ecn ] Sean Busbey [ busbey ]
          Hide
          Sean Busbey added a comment -

          I couldn't reproduce the failure condition upon rerunning a 24hr version of the test.

          I did, however, see our new code handle some failures in the network. So I think we're good to close.

          Show
          Sean Busbey added a comment - I couldn't reproduce the failure condition upon rerunning a 24hr version of the test. I did, however, see our new code handle some failures in the network. So I think we're good to close.
          Sean Busbey made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Sean Busbey
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development