Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8152

Overseer Task Processor/Queue can miss responses, leading to timeouts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 5.4, 6.0
    • SolrCloud
    • None

    Description

      I noticed some jenkins reports of timeouts in the TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of work to be done is small and the timeout generous at 300 seconds.

      I added some statistics gathering and started beasting the test and sure enough, some tests reported tasks taking slightly more than 300 seconds, while most tests ran with a maximum task run of less than a second. This suggested something was hanging until the timeout.

      Some investigation lead to this code:
      https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194

      There appears to be a few issues here:

       String path = createData(dir + "/" + PREFIX, data,
                CreateMode.PERSISTENT_SEQUENTIAL);
            String watchID = createData(
                dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") + 1),
                null, CreateMode.EPHEMERAL);
      
            Object lock = new Object();
            LatchWatcher watcher = new LatchWatcher(lock);
            synchronized (lock) {
              if (zookeeper.exists(watchID, watcher, true) != null) {
                watcher.await(timeout);
              }
            }
      

      For one, the request object is created before the response object. If the request is quickly picked up and processed, two things can happen:
      1) The response is written before the watch is set, which means we wait until the timeout even though the response is ready. This will still pass the test because the response is available, the client will just wait needlessly.
      2) The response is attempted to be written before the response node is even created. The fact that the response node doesn't exist is ignored:
      https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
      In this case, the task is processed but the client will actually see a failure because there is no response.

      Attachments

        1. SOLR-8152.patch
          3 kB
          Gregory Chanan

        Issue Links

          Activity

            People

              gchanan Gregory Chanan
              gchanan Gregory Chanan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: