Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2789

Race condition in ipc.Server prevents responce being written back to client.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.16.0
    • 0.16.1
    • ipc
    • None

    Description

      I encountered a race condition in ipc.Server when writing the response
      back to the socket. Sometimes the write SelectKey is being canceled
      when it should not be, and thus the full response never gets
      written. This results in clients timing out on the socket while waiting for the response.

      I am attaching a unit test that demonstrates the problem. It follows
      closely the TestIPC test, however the socket output buffer is set
      smaller than the result being sent back, so that partial writes
      occur. I also put random sleep in the client to help provoke the race
      condition.

      On my machine this fails over half of the time.

      Looking at the code in ipc.Server.java. The problem is manifested in
      Responder.doAsyncWrite(). If I comment out the key.cancel() line, then
      everything works fine.

      So we need to identify when to safely cancel the key.

      I tried the following:

          private void doAsyncWrite(SelectionKey key) throws IOException {
            Call call = (Call)key.attachment();
            if (call == null) {
              return;
            }
            if (key.channel() != call.connection.channel) {
              throw new IOException("doAsyncWrite: bad channel");
            }
            if (processResponse(call.connection.responseQueue)) {
                synchronized(call.connection.responseQueue) {
                    if (call.connection.responseQueue.size() == 0) {
                        LOG.info("Cancelling key for call "+call.toString()+ " key: "+ key.toString());
                        key.cancel();          // remove item from selector.
                    } else {
                        LOG.warn("NOT REALLY DONE: "+call.toString()+ " key: "+ key.toString());
                    }
                }
            }
          }
      

      And this does catch some of the cases (EG, the LOG.warn message gets hit), but i still hit the race condition.

      Attachments

        1. failure.log
          3.84 MB
          Clint Morgan
        2. failure-after-patch.log.gz
          909 kB
          Clint Morgan
        3. failure-with-patch.log
          3.85 MB
          Clint Morgan
        4. HADOOP-2789.patch
          18 kB
          Raghu Angadi
        5. HADOOP-2789.patch
          18 kB
          Raghu Angadi
        6. HADOOP-2789.patch
          13 kB
          Raghu Angadi
        7. HADOOP-2789.patch
          13 kB
          Raghu Angadi
        8. HADOOP-2789.patch
          12 kB
          Raghu Angadi
        9. HADOOP-2789.patch
          4 kB
          Clint Morgan
        10. HADOOP-2789-correction.patch
          1 kB
          Raghu Angadi
        11. HADOOP-2789-correction.patch
          1 kB
          Raghu Angadi
        12. HADOOP-2789-Test.patch
          5 kB
          Clint Morgan
        13. success.log
          1.92 MB
          Clint Morgan

        Activity

          People

            rangadi Raghu Angadi
            clint.morgan Clint Morgan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: