Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-5358

Interrupting a thread writing to a socket can result in a hang due to a lost message

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • messaging
    • None

    Description

      If a thread doing a geode operation is interrupted, it can result in the system hanging waiting for a a reply. I have a dunit test that demonstrates this issue which interrupts a thread while we are doing function execution. The system is then stuck waiting for replies

        [vm0] [warn 2018/06/28 11:14:13.715 PDT <Thread-264> tid=454] 15 seconds have elapsed while waiting for replies: <FunctionStreamingResultCollector 11084 waiting for 1 replies from [10.118.20.71(server-1:90978)<v6>:32771]> on 10.118.20.71(server-0:90977)<v5>:32770 whose current membership list is: [[10.118.20.71(server-1:90978)<v6>:32771, 10.118.20.71(90975:locator)<ec><v0>:32769, 10.118.20.71(server-0:90977)<v5>:32770]]
      
      "Thread-264" #454 daemon prio=5 os_prio=31 tid=0x00007fd30b9f8000 nid=0x8727 waiting on condition [0x000070000b300000]
         java.lang.Thread.State: TIMED_WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000007b8c10360> (a java.util.concurrent.CountDownLatch$Sync)
      	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
      	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
      	at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:61)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:714)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:765)
      	at org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:139)
      	at org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.executeFunction(InterruptTcpConduitDUnitTest.java:91)
      	at org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest.lambda$doInterruptTest$1(InterruptTcpConduitDUnitTest.java:67)
      	at org.apache.geode.distributed.internal.InterruptTcpConduitDUnitTest$$Lambda$68/1495662507.run(Unknown Source)
      	at java.lang.Thread.run(Thread.java:748)
      

      I think what is going on here is that there are two threads that write messages to the same socket. If the second thread is interrupted, that causes an ClosedByInterruptException and closes the socket. That can cause a message from the first thread to be lost, because the socket is closed. The system will then hang.

      A suggested fix would be to implement a layer that can replay a certain window of sent messages if a tcp connection between peers is lost and reestablished.

      Attachments

        1. GEODE-5358.diff
          5 kB
          Dan Smith

        Activity

          People

            Unassigned Unassigned
            upthewaterspout Dan Smith
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: