Solr
  1. Solr
  2. SOLR-5216

Document updates to SolrCloud can cause a distributed deadlock.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, Trunk
    • Component/s: SolrCloud
    • Labels:
      None

      Issue Links

        Activity

        Hide
        Tim Vaillancourt added a comment - - edited

        Hey guys,

        We tested this patch and unfortunately encountered some serious issues a few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are writing about 5000 docs/sec total, using autoCommit to commit the updates (no explicit commits).

        Our environment:

        • Solr 4.3.1 w/SOLR-5216 patch.
        • Jetty 9, Java 1.7.
        • 3 solr instances, 1 per physical server.
        • 1 collection.
        • 3 shards.
        • 2 replicas (each instance is a leader and a replica).
        • Soft autoCommit is 1000ms.
        • Hard autoCommit is 15000ms.

        After about 6 hours of stress-testing this patch, we see many of these stalled transactions (below), and the Solr instances start to see each other as down, flooding our Solr logs with "Connection Refused" exceptions, and otherwise no obviously-useful logs that I could see.

        I did notice some stalled transactions on both /select and /update, however. This never occurred without this patch.

        Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
        Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

        Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. My script "normalizes" the ERROR-severity stack traces and returns them in order of occurrence.

        Summary of my solr.log: http://pastebin.com/pBdMAWeb

        Thanks!

        Tim Vaillancourt

        Show
        Tim Vaillancourt added a comment - - edited Hey guys, We tested this patch and unfortunately encountered some serious issues a few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are writing about 5000 docs/sec total, using autoCommit to commit the updates (no explicit commits). Our environment: Solr 4.3.1 w/ SOLR-5216 patch. Jetty 9, Java 1.7. 3 solr instances, 1 per physical server. 1 collection. 3 shards. 2 replicas (each instance is a leader and a replica). Soft autoCommit is 1000ms. Hard autoCommit is 15000ms. After about 6 hours of stress-testing this patch, we see many of these stalled transactions (below), and the Solr instances start to see each other as down, flooding our Solr logs with "Connection Refused" exceptions, and otherwise no obviously-useful logs that I could see. I did notice some stalled transactions on both /select and /update, however. This never occurred without this patch. Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. My script "normalizes" the ERROR-severity stack traces and returns them in order of occurrence. Summary of my solr.log: http://pastebin.com/pBdMAWeb Thanks! Tim Vaillancourt
        Hide
        Mark Miller added a comment -

        I think the only thing that this patch could possibly do is eliminate the deadlock, but allow for more threads. I only expect up to a 2x max use in threads, but it could be off and allow for more than that if there is some small bug I'm missing. In any case, I'm sure the idea is the right one for the deadlock. I worry the problems you get after many hours might be due to the sheer number of threads and requests. I worry about spending too much time trying to get this solution working though - it might be energy better spent changing things towards a better direction - SOLR-5232: SolrCloud should distribute updates via streaming rather than buffering.

        Show
        Mark Miller added a comment - I think the only thing that this patch could possibly do is eliminate the deadlock, but allow for more threads. I only expect up to a 2x max use in threads, but it could be off and allow for more than that if there is some small bug I'm missing. In any case, I'm sure the idea is the right one for the deadlock. I worry the problems you get after many hours might be due to the sheer number of threads and requests. I worry about spending too much time trying to get this solution working though - it might be energy better spent changing things towards a better direction - SOLR-5232 : SolrCloud should distribute updates via streaming rather than buffering.
        Hide
        Mark Miller added a comment -

        I'm going to resolve this for 4.6 with SOLR-5232 - look forward to any help testing it out

        Show
        Mark Miller added a comment - I'm going to resolve this for 4.6 with SOLR-5232 - look forward to any help testing it out
        Hide
        ASF subversion and git services added a comment -

        Commit 1533649 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1533649 ]

        SOLR-5216: Document updates to SolrCloud can cause a distributed deadlock.
        SOLR-5232: SolrCloud should distribute updates via streaming rather than buffering.

        Show
        ASF subversion and git services added a comment - Commit 1533649 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1533649 ] SOLR-5216 : Document updates to SolrCloud can cause a distributed deadlock. SOLR-5232 : SolrCloud should distribute updates via streaming rather than buffering.
        Hide
        ASF subversion and git services added a comment -

        Commit 1533652 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1533652 ]

        SOLR-5216: Document updates to SolrCloud can cause a distributed deadlock.
        SOLR-5232: SolrCloud should distribute updates via streaming rather than buffering.

        Show
        ASF subversion and git services added a comment - Commit 1533652 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1533652 ] SOLR-5216 : Document updates to SolrCloud can cause a distributed deadlock. SOLR-5232 : SolrCloud should distribute updates via streaming rather than buffering.
        Hide
        Ricardo Merizalde added a comment -

        Mark, can this issue affect SolrCloud deployments with a single shard? We've been running SolrCloud since April and we experienced an odd outage today we've never seen before.

        We are currently running Solr 4.5.1 with 4 slaves and we use CloudSolrServer to send updates. The number of threads went from under 100 to almost 400 in each of the instances in less than one minute. The heap filled up quickly as well until they ran out of memory. It filled about 2GB worth of heap in a couple minutes. Of course, all four JVM started doing major collections one after another but couldn't free any heap memory.

        Unfortunately, we forgot to take thread dumps in the rush for recovering our site. All we have are the heap dumps.

        Also, we do auto hard commits every 5 minutes or every 10k documents.

        We'll be trying 4.6 soon, however I want to check if we are headed in the right direction.

        Show
        Ricardo Merizalde added a comment - Mark, can this issue affect SolrCloud deployments with a single shard? We've been running SolrCloud since April and we experienced an odd outage today we've never seen before. We are currently running Solr 4.5.1 with 4 slaves and we use CloudSolrServer to send updates. The number of threads went from under 100 to almost 400 in each of the instances in less than one minute. The heap filled up quickly as well until they ran out of memory. It filled about 2GB worth of heap in a couple minutes. Of course, all four JVM started doing major collections one after another but couldn't free any heap memory. Unfortunately, we forgot to take thread dumps in the rush for recovering our site. All we have are the heap dumps. Also, we do auto hard commits every 5 minutes or every 10k documents. We'll be trying 4.6 soon, however I want to check if we are headed in the right direction.
        Hide
        Erick Erickson added a comment -

        This shouldn't affect single-shard setups. The deadlock,
        as I remember, showed up when lots of nodes split up
        incoming batches of documents to forward to lots of
        leaders. Since a single shard won't split up the documents,
        I doubt this is the root of what you're seeing.

        But yeah, a stack trace would tell us for certain.

        And Mark committed SOLR-5232 which uses a different
        mechanism anyway.

        To recap: I doubt this issue is a problem in single-shard
        setups.

        Show
        Erick Erickson added a comment - This shouldn't affect single-shard setups. The deadlock, as I remember, showed up when lots of nodes split up incoming batches of documents to forward to lots of leaders. Since a single shard won't split up the documents, I doubt this is the root of what you're seeing. But yeah, a stack trace would tell us for certain. And Mark committed SOLR-5232 which uses a different mechanism anyway. To recap: I doubt this issue is a problem in single-shard setups.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development