Solr
  1. Solr
  2. SOLR-8152

Overseer Task Processor/Queue can miss responses, leading to timeouts

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      I noticed some jenkins reports of timeouts in the TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of work to be done is small and the timeout generous at 300 seconds.

      I added some statistics gathering and started beasting the test and sure enough, some tests reported tasks taking slightly more than 300 seconds, while most tests ran with a maximum task run of less than a second. This suggested something was hanging until the timeout.

      Some investigation lead to this code:
      https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194

      There appears to be a few issues here:

       String path = createData(dir + "/" + PREFIX, data,
                CreateMode.PERSISTENT_SEQUENTIAL);
            String watchID = createData(
                dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") + 1),
                null, CreateMode.EPHEMERAL);
      
            Object lock = new Object();
            LatchWatcher watcher = new LatchWatcher(lock);
            synchronized (lock) {
              if (zookeeper.exists(watchID, watcher, true) != null) {
                watcher.await(timeout);
              }
            }
      

      For one, the request object is created before the response object. If the request is quickly picked up and processed, two things can happen:
      1) The response is written before the watch is set, which means we wait until the timeout even though the response is ready. This will still pass the test because the response is available, the client will just wait needlessly.
      2) The response is attempted to be written before the response node is even created. The fact that the response node doesn't exist is ignored:
      https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
      In this case, the task is processed but the client will actually see a failure because there is no response.

      1. SOLR-8152.patch
        3 kB
        Gregory Chanan

        Issue Links

          Activity

          Hide
          Gregory Chanan added a comment -

          There are a few ways to solve this, the most straightforward of which appears to be:
          1) Create the response node first using SEQUENTIAL (to generate the path to the request node)
          2) Watch the response node, so we can't possible miss the response (because the request node isn't even created yet)
          3) Create the request mode

          At this point, before we wait, we check that the watch didn't already fire (otherwise we will wait unnecessarily).

          Show
          Gregory Chanan added a comment - There are a few ways to solve this, the most straightforward of which appears to be: 1) Create the response node first using SEQUENTIAL (to generate the path to the request node) 2) Watch the response node, so we can't possible miss the response (because the request node isn't even created yet) 3) Create the request mode At this point, before we wait, we check that the watch didn't already fire (otherwise we will wait unnecessarily).
          Hide
          Gregory Chanan added a comment -

          Here's a patch implementing the above. I haven't beasted it for too long, but it's gotten further than I ever got without the changes.

          Show
          Gregory Chanan added a comment - Here's a patch implementing the above. I haven't beasted it for too long, but it's gotten further than I ever got without the changes.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Gregory, good catch! Your patch changes the response node from EPHEMERAL to EPHEMERAL_SEQUENTIAL. Was that intentional?

          Another option could be to create both nodes together in a 'multi' operation but this approach also looks fine.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Gregory, good catch! Your patch changes the response node from EPHEMERAL to EPHEMERAL_SEQUENTIAL. Was that intentional? Another option could be to create both nodes together in a 'multi' operation but this approach also looks fine.
          Hide
          Gregory Chanan added a comment -

          Thanks for taking a look, Shalin.

          Your patch changes the response node from EPHEMERAL to EPHEMERAL_SEQUENTIAL. Was that intentional?

          Yes, see below.

          Another option could be to create both nodes together in a 'multi' operation but this approach also looks fine.

          I considered using the 'multi' operation, but it doesn't appear trivial. The issue is that the request and response node share a suffix as determined by ZooKeeper's SEQUENTIAL, but you don't know that until you actually create the node. So, to use multi to create the request and response znodes together, you'd have to do something like first create a SEQUENTIAL node, note the suffix, then use multi to create the request and response nodes together. This seemed to lead to more of a tracking / cleanup issue than not using multi. The fact that the response is EPHEMERAL_SEQUENTIAL rather than EPHEMERAL is because it is now created first (so we don't miss the response).

          Show
          Gregory Chanan added a comment - Thanks for taking a look, Shalin. Your patch changes the response node from EPHEMERAL to EPHEMERAL_SEQUENTIAL. Was that intentional? Yes, see below. Another option could be to create both nodes together in a 'multi' operation but this approach also looks fine. I considered using the 'multi' operation, but it doesn't appear trivial. The issue is that the request and response node share a suffix as determined by ZooKeeper's SEQUENTIAL, but you don't know that until you actually create the node. So, to use multi to create the request and response znodes together, you'd have to do something like first create a SEQUENTIAL node, note the suffix, then use multi to create the request and response nodes together. This seemed to lead to more of a tracking / cleanup issue than not using multi. The fact that the response is EPHEMERAL_SEQUENTIAL rather than EPHEMERAL is because it is now created first (so we don't miss the response).
          Hide
          Shalin Shekhar Mangar added a comment -

          Okay I understand now. So we first create the response node as an EPHEMERAL_SEQUENTIAL and then use its sequence ID to create the persistent request node. Sounds good to me. Thanks for explaining.

          Show
          Shalin Shekhar Mangar added a comment - Okay I understand now. So we first create the response node as an EPHEMERAL_SEQUENTIAL and then use its sequence ID to create the persistent request node. Sounds good to me. Thanks for explaining.
          Hide
          ASF subversion and git services added a comment -

          Commit 1708538 from gchanan@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1708538 ]

          SOLR-8152: Overseer Task Processor/Queue can miss responses, leading to timeouts

          Show
          ASF subversion and git services added a comment - Commit 1708538 from gchanan@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1708538 ] SOLR-8152 : Overseer Task Processor/Queue can miss responses, leading to timeouts
          Hide
          ASF subversion and git services added a comment -

          Commit 1708539 from gchanan@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1708539 ]

          SOLR-8152: Overseer Task Processor/Queue can miss responses, leading to timeouts

          Show
          ASF subversion and git services added a comment - Commit 1708539 from gchanan@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1708539 ] SOLR-8152 : Overseer Task Processor/Queue can miss responses, leading to timeouts

            People

            • Assignee:
              Gregory Chanan
              Reporter:
              Gregory Chanan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development